Abstract
We study cloud storage systems with a very large number of files stored in a very large number of servers. In such systems, files are either replicated or coded to ensure reliability, i.e., to guarantee file recovery from server failures. This redundancy in storage can further be exploited to improve system performance (mean file-access delay) through appropriate load-balancing (routing) schemes. However, it is unclear whether coding or replication is better from a system performance perspective since the corresponding queueing analysis of such systems is, in general, quite difficult except for the trivial case when the system load asymptotically tends to zero. Here, we study the more difficult case where the system load is not asymptotically zero. Using the fact that the system size is large, we obtain a mean-field limit for the steady-state distribution of the number of file access requests waiting at each server. We then use the mean-field limit to show that, for a given storage capacity per file, coding strictly outperforms replication at all traffic loads while improving reliability. Further, the factor by which the performance improves in the heavy traffic is at least as large as in the light-traffic case. Finally, we validate these results through extensive simulations.
- G. Ananthanarayanan, A. Ghodsi, S. Shenker, and I. Stoica. 2012. Why let resources idle? Aggressive cloning of jobs with Dolly. In Proceedings of the USENIX Conference on Hot Topics in Cloud Computing (HotCloud’12). Google ScholarDigital Library
- M. Bramson, Y. Lu, and B. Prabhakar. 2010. Randomized load balancing with general service time distributions. In Proceedings of the ACM International Conference on Measurement and Modeling of Computer Systems (SIGMETRIC’10). Google ScholarDigital Library
- S. Chen, Y. Sun, U. C. Kozat, L. Huang, P. Sinha, G. Liang, X. Liu, and N. B. Shroff. 2014. When queueing meets coding: Optimal-latency data retrieving scheme in storage clouds. In Proceedings of the IEEE International Conference on Computer Communications (INFOCOM’14).Google Scholar
- K. Gardner, M. Harchol-Balter, M. Velednitsky A. Scheller-Wolf, and S. Zbarsky. 2017. Redundancy-d: The power of d choices for redundancy. Operations Research 65, 4 (2017), 1078--1094.Google ScholarDigital Library
- K. Gardner, S. Zbarsky, S. Doroudi, M. Harchol-Balter, E. Hyytia, and A. Scheller-Wolf.2015. Reducing latency via redundant requests: Exact analysis. In Proceedings of the ACM International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’15). Google ScholarDigital Library
- B. Hajek. 2006. Notes for ECE 467: Communication Network Analysis. University of Illinois at Urbana-Champaign.Google Scholar
- L. Huang, S. Pawar, H. Zhang, and K. Ramchandran. 2012. Codes can reduce queueing delay in data centers. In Proceedings of the IEEE International Symposium on Information Theory (ISIT’12).Google Scholar
- S. Jain, M. Demmer, R. Patra, and K. Fall. 2005. Using redundancy to cope with failures in a delay tolerant network. In Proceedings of the ACM Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications (SIGCOMM’05). Google ScholarDigital Library
- G. Joshi, Y. Liu, and E. Soljanin. 2012. Coding for fast content download. In Proceedings of the 2012 50th Annual Allerton Conference on Communication, Control, and Computing (Allerton). IEEE, 326--333.Google Scholar
- G. Joshi, Y. Liu, and E. Soljanin. 2014. On the delay-storage trade-off in content download from coded distributed storage systems. IEEE J. Sel. Areas Commun. 32, 5 (2014), 989--997.Google ScholarCross Ref
- G. Joshi, E. Soljanin, and G. Wornell. 2015a. Efficient replication of queued tasks to reduce latency in cloud systems. In Proceedings of the 53rd Annual Allerton Conference on Communication, Control, and Computing.Google Scholar
- G. Joshi, E. Soljanin, and G. Wornell. 2015b. Queues with redundancy: Latency-cost analysis. ACM SIGMETRICS Perform. Eval. Rev. 43, 2 (2015), 54--56. Google ScholarDigital Library
- S. Kadhe, E. Soljanin, and A. Sprintson. 2015. Analyzing the download time of availability codes. In Proceedings of the 2015 IEEE International Symposium on Information Theory (ISIT’15). IEEE, 1467--1471.Google Scholar
- D. Knuth, R. Graham, O. Patashnik, and others. 1989. Concrete Mathematics.Adison-Wesley.Google Scholar
- B. Li, A. Ramamoorthy, and R. Srikant. 2016. Mean-field-analysis of coding versus replication in cloud storage systems. In Proceedings of the IEEE International Conference on Computer Communications (INFOCOM’16).Google Scholar
- B. Li, A. Ramamoorthy, and R. Srikant. 2017. Mean-field-analysis of coding versus replication in cloud storage systems. http://www.ele.uri.edu/faculty/binli/papers/StorageCodingReport.pdf.Google Scholar
- G. Liang and U. C. Kozat. 2014. TOFEC: Achieving optimal throughput-delay trade-off of cloud storage using erasure codes. In Proceedings of the IEEE International Conference on Computer Communications (INFOCOM’14).Google Scholar
- S. Lin and D. J. Costello. 2004. Error Control Coding (2nd ed.). Prentice Hall. Google ScholarDigital Library
- Y. Lu, Q. Xie, G. Kliot, A. Geller, J. R. Larus, and A. Greenberg. 2011. Join-idle-queue: A novel load balancing algorithm for dynamically scalable web services. Perform. Eval. 68, 11 (2011), 1056--1071. Google ScholarDigital Library
- M. Lugo. 2011. A Note for Stat 134 Fall 2011: The Expectation of the Maximum of Exponentials. University of California at Berkeley.Google Scholar
- M. Mitzenmacher. 1996. The Power of Two Choices in Randomized Load Balancing. Ph.D. thesis, University of California at Berkeley. Google ScholarDigital Library
- K. Ousterhout, P. Wendell, M. Zaharia, and I. Stoica. 2013. Sparrow: Distributed, low latency scheduling. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP’13). Google ScholarDigital Library
- S. Ross. 1995. Stochastic Processes. John Wiley 8 Sons.Google Scholar
- S. Ross. 2014. Introduction to Probability Models. Academic Press. Google ScholarDigital Library
- N. B. Shah, K. Lee, and K. Ramchandran. 2013. When do redundant requests reduce latency? In Proceedings of the Allerton Conference on Communication, Control, and Computing (Allerton).Google Scholar
- N. B. Shah, K. Lee, and K. Ramchandran. 2014. The MDS queue: Analysing the latency performance of erasure codes. In Proceedings of the IEEE International Symposium on Information Theory (ISIT’14).Google Scholar
- R. Srikant and L. Ying. 2013. Communication Networks: An Optimization, Control, and Stochastic Networks Perspective. Cambridge University Press. Google ScholarDigital Library
- A. L. Stolyar. 2015. Pull-based load distribution in large-scale heterogeneous service systems. Queueing Syst. 80, 11 (2015), 341--361. Google ScholarDigital Library
- Y. Sun, Z. Zheng, C. E. Koksal, K. Kim, and N. B. Shroff. 2015. Provably delay efficient data retrieving in storage clouds. In Proceedings of the IEEE International Conference on Computer Communications (INFOCOM’15).Google Scholar
- A. Vulimiri, P. B. Godfrey, R. Mittal, J. Sherry, S. Ratnasamy, and S. Shenker. 2013. Low latency via redundancy. In Proceedings of the ACM Conference on Emerging Networking Experiments and Technologies (CoNEXT’13). Google ScholarDigital Library
- N. D. Vvedenskaya, R. L. Dobrushin, and F. I. Karpelevich. 1996. Queueing system with selection of the shortest of two queues: An asymptotic approach. Probl. Peredachi Inform. 32, 1 (1996), 20--34.Google Scholar
- Y. Xiang, T. Lan, V. Aggarwal, and Y. F. R. Chen. 2014. Joint latency and cost optimization for erasure-coded data center storage. ACM SIGMETRICS Perform. Eval. Rev. 42, 2 (2014), 3--14. Google ScholarDigital Library
- L. Ying, R. Srikant, and X. Kang. 2015. The power of slightly more than one sample in randomized load balancing. In Proceedings of the IEEE International Conference on Computer Communications (INFOCOM’15).Google Scholar
Index Terms
- Mean-Field Analysis of Coding Versus Replication in Large Data Storage Systems
Recommendations
Mean-field-analysis of coding versus replication in cloud storage systems
IEEE INFOCOM 2016 - The 35th Annual IEEE International Conference on Computer CommunicationsWe study cloud-storage systems with a very large number of files stored in a very large number of servers. In such systems, files are either replicated or coded to ensure reliability, i.e., file recovery from server failures. This redundancy in storage ...
Deterministic load-balancing schemes for disk-based video-on-demand storage servers
MSS '95: Proceedings of the 14th IEEE Symposium on Mass Storage SystemsA video-on-demand (VOD) storage server is a parallel, storage-centric system used for playing a large number of relatively slow streams of compressed digitized video and audio concurrently. Data is read from disks in relatively large chunks, and is then ...
Comments