skip to main content
research-article
Public Access

Mean-Field Analysis of Coding Versus Replication in Large Data Storage Systems

Published:13 February 2018Publication History
Skip Abstract Section

Abstract

We study cloud storage systems with a very large number of files stored in a very large number of servers. In such systems, files are either replicated or coded to ensure reliability, i.e., to guarantee file recovery from server failures. This redundancy in storage can further be exploited to improve system performance (mean file-access delay) through appropriate load-balancing (routing) schemes. However, it is unclear whether coding or replication is better from a system performance perspective since the corresponding queueing analysis of such systems is, in general, quite difficult except for the trivial case when the system load asymptotically tends to zero. Here, we study the more difficult case where the system load is not asymptotically zero. Using the fact that the system size is large, we obtain a mean-field limit for the steady-state distribution of the number of file access requests waiting at each server. We then use the mean-field limit to show that, for a given storage capacity per file, coding strictly outperforms replication at all traffic loads while improving reliability. Further, the factor by which the performance improves in the heavy traffic is at least as large as in the light-traffic case. Finally, we validate these results through extensive simulations.

References

  1. G. Ananthanarayanan, A. Ghodsi, S. Shenker, and I. Stoica. 2012. Why let resources idle? Aggressive cloning of jobs with Dolly. In Proceedings of the USENIX Conference on Hot Topics in Cloud Computing (HotCloud’12). Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. M. Bramson, Y. Lu, and B. Prabhakar. 2010. Randomized load balancing with general service time distributions. In Proceedings of the ACM International Conference on Measurement and Modeling of Computer Systems (SIGMETRIC’10). Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. S. Chen, Y. Sun, U. C. Kozat, L. Huang, P. Sinha, G. Liang, X. Liu, and N. B. Shroff. 2014. When queueing meets coding: Optimal-latency data retrieving scheme in storage clouds. In Proceedings of the IEEE International Conference on Computer Communications (INFOCOM’14).Google ScholarGoogle Scholar
  4. K. Gardner, M. Harchol-Balter, M. Velednitsky A. Scheller-Wolf, and S. Zbarsky. 2017. Redundancy-d: The power of d choices for redundancy. Operations Research 65, 4 (2017), 1078--1094.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. K. Gardner, S. Zbarsky, S. Doroudi, M. Harchol-Balter, E. Hyytia, and A. Scheller-Wolf.2015. Reducing latency via redundant requests: Exact analysis. In Proceedings of the ACM International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’15). Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. B. Hajek. 2006. Notes for ECE 467: Communication Network Analysis. University of Illinois at Urbana-Champaign.Google ScholarGoogle Scholar
  7. L. Huang, S. Pawar, H. Zhang, and K. Ramchandran. 2012. Codes can reduce queueing delay in data centers. In Proceedings of the IEEE International Symposium on Information Theory (ISIT’12).Google ScholarGoogle Scholar
  8. S. Jain, M. Demmer, R. Patra, and K. Fall. 2005. Using redundancy to cope with failures in a delay tolerant network. In Proceedings of the ACM Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications (SIGCOMM’05). Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. G. Joshi, Y. Liu, and E. Soljanin. 2012. Coding for fast content download. In Proceedings of the 2012 50th Annual Allerton Conference on Communication, Control, and Computing (Allerton). IEEE, 326--333.Google ScholarGoogle Scholar
  10. G. Joshi, Y. Liu, and E. Soljanin. 2014. On the delay-storage trade-off in content download from coded distributed storage systems. IEEE J. Sel. Areas Commun. 32, 5 (2014), 989--997.Google ScholarGoogle ScholarCross RefCross Ref
  11. G. Joshi, E. Soljanin, and G. Wornell. 2015a. Efficient replication of queued tasks to reduce latency in cloud systems. In Proceedings of the 53rd Annual Allerton Conference on Communication, Control, and Computing.Google ScholarGoogle Scholar
  12. G. Joshi, E. Soljanin, and G. Wornell. 2015b. Queues with redundancy: Latency-cost analysis. ACM SIGMETRICS Perform. Eval. Rev. 43, 2 (2015), 54--56. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. S. Kadhe, E. Soljanin, and A. Sprintson. 2015. Analyzing the download time of availability codes. In Proceedings of the 2015 IEEE International Symposium on Information Theory (ISIT’15). IEEE, 1467--1471.Google ScholarGoogle Scholar
  14. D. Knuth, R. Graham, O. Patashnik, and others. 1989. Concrete Mathematics.Adison-Wesley.Google ScholarGoogle Scholar
  15. B. Li, A. Ramamoorthy, and R. Srikant. 2016. Mean-field-analysis of coding versus replication in cloud storage systems. In Proceedings of the IEEE International Conference on Computer Communications (INFOCOM’16).Google ScholarGoogle Scholar
  16. B. Li, A. Ramamoorthy, and R. Srikant. 2017. Mean-field-analysis of coding versus replication in cloud storage systems. http://www.ele.uri.edu/faculty/binli/papers/StorageCodingReport.pdf.Google ScholarGoogle Scholar
  17. G. Liang and U. C. Kozat. 2014. TOFEC: Achieving optimal throughput-delay trade-off of cloud storage using erasure codes. In Proceedings of the IEEE International Conference on Computer Communications (INFOCOM’14).Google ScholarGoogle Scholar
  18. S. Lin and D. J. Costello. 2004. Error Control Coding (2nd ed.). Prentice Hall. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Y. Lu, Q. Xie, G. Kliot, A. Geller, J. R. Larus, and A. Greenberg. 2011. Join-idle-queue: A novel load balancing algorithm for dynamically scalable web services. Perform. Eval. 68, 11 (2011), 1056--1071. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. M. Lugo. 2011. A Note for Stat 134 Fall 2011: The Expectation of the Maximum of Exponentials. University of California at Berkeley.Google ScholarGoogle Scholar
  21. M. Mitzenmacher. 1996. The Power of Two Choices in Randomized Load Balancing. Ph.D. thesis, University of California at Berkeley. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. K. Ousterhout, P. Wendell, M. Zaharia, and I. Stoica. 2013. Sparrow: Distributed, low latency scheduling. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP’13). Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. S. Ross. 1995. Stochastic Processes. John Wiley 8 Sons.Google ScholarGoogle Scholar
  24. S. Ross. 2014. Introduction to Probability Models. Academic Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. N. B. Shah, K. Lee, and K. Ramchandran. 2013. When do redundant requests reduce latency? In Proceedings of the Allerton Conference on Communication, Control, and Computing (Allerton).Google ScholarGoogle Scholar
  26. N. B. Shah, K. Lee, and K. Ramchandran. 2014. The MDS queue: Analysing the latency performance of erasure codes. In Proceedings of the IEEE International Symposium on Information Theory (ISIT’14).Google ScholarGoogle Scholar
  27. R. Srikant and L. Ying. 2013. Communication Networks: An Optimization, Control, and Stochastic Networks Perspective. Cambridge University Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. A. L. Stolyar. 2015. Pull-based load distribution in large-scale heterogeneous service systems. Queueing Syst. 80, 11 (2015), 341--361. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Y. Sun, Z. Zheng, C. E. Koksal, K. Kim, and N. B. Shroff. 2015. Provably delay efficient data retrieving in storage clouds. In Proceedings of the IEEE International Conference on Computer Communications (INFOCOM’15).Google ScholarGoogle Scholar
  30. A. Vulimiri, P. B. Godfrey, R. Mittal, J. Sherry, S. Ratnasamy, and S. Shenker. 2013. Low latency via redundancy. In Proceedings of the ACM Conference on Emerging Networking Experiments and Technologies (CoNEXT’13). Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. N. D. Vvedenskaya, R. L. Dobrushin, and F. I. Karpelevich. 1996. Queueing system with selection of the shortest of two queues: An asymptotic approach. Probl. Peredachi Inform. 32, 1 (1996), 20--34.Google ScholarGoogle Scholar
  32. Y. Xiang, T. Lan, V. Aggarwal, and Y. F. R. Chen. 2014. Joint latency and cost optimization for erasure-coded data center storage. ACM SIGMETRICS Perform. Eval. Rev. 42, 2 (2014), 3--14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. L. Ying, R. Srikant, and X. Kang. 2015. The power of slightly more than one sample in randomized load balancing. In Proceedings of the IEEE International Conference on Computer Communications (INFOCOM’15).Google ScholarGoogle Scholar

Index Terms

  1. Mean-Field Analysis of Coding Versus Replication in Large Data Storage Systems

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Modeling and Performance Evaluation of Computing Systems
        ACM Transactions on Modeling and Performance Evaluation of Computing Systems  Volume 3, Issue 1
        March 2018
        124 pages
        ISSN:2376-3639
        EISSN:2376-3647
        DOI:10.1145/3186330
        • Editors:
        • Sem Borst,
        • Carey Williamson
        Issue’s Table of Contents

        Copyright © 2018 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 13 February 2018
        • Accepted: 1 November 2017
        • Revised: 1 July 2017
        • Received: 1 February 2017
        Published in tompecs Volume 3, Issue 1

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader