Abstract
Distributed storage systems for large clusters typically use replication to provide reliability. Recently, erasure codes have been used to reduce the large storage overhead of three-replicated systems. Reed-Solomon codes are the standard design choice and their high repair cost is often considered an unavoidable price to pay for high storage efficiency and high reliability.
This paper shows how to overcome this limitation. We present a novel family of erasure codes that are efficiently repairable and offer higher reliability compared to Reed-Solomon codes. We show analytically that our codes are optimal on a recently identified tradeoff between locality and minimum distance.
We implement our new codes in Hadoop HDFS and compare to a currently deployed HDFS module that uses Reed-Solomon codes. Our modified HDFS implementation shows a reduction of approximately 2× on the repair disk I/O and repair network traffic. The disadvantage of the new coding scheme is that it requires 14% more storage compared to Reed-Solomon codes, an overhead shown to be information theoretically optimal to obtain locality. Because the new codes repair failures faster, this provides higher reliability, which is orders of magnitude higher compared to replication.
- Amazon EC2. http://aws.amazon.com/ec2/.Google Scholar
- HDFS-RAID wiki. http://wiki.apache.org/hadoop/HDFS-RAID.Google Scholar
- V. Cadambe, S. Jafar, H. Maleki, K. Ramchandran, and C. Suh. Asymptotic interference alignment for optimal repair of mds codes in distributed storage. Submitted to IEEE Transactions on Information Theory, Sep. 2011 (consolidated paper of arXiv:1004.4299 and arXiv:1004.4663).Google Scholar
- B. Calder, J. Wang, A. Ogus, N. Nilakantan, A. Skjolsvold, S. McKelvie, Y. Xu, S. Srivastav, J. Wu, H. Simitci, et al. Windows azure storage: A highly available cloud storage service with strong consistency. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles, pages 143-157, 2011. Google Scholar
- M. Chowdhury, M. Zaharia, J. Ma, M. I. Jordan, and I. Stoica. Managing data transfers in computer clusters with orchestra. In SIGCOMM-Computer Communication Review, pages 98-109, 2011. with orchestra. In SIGCOMM-Computer Communication Review, pages 98-109, 2011. Google Scholar
- A. Dimakis, P. Godfrey, Y. Wu, M. Wainwright, and K. Ramchandran. Network coding for distributed storage systems. IEEE Transactions on Information Theory, pages 4539-4551, 2010. Google Scholar
- A. Dimakis, K. Ramchandran, Y. Wu, and C. Suh. A survey on network codes for distributed storage. Proceedings of the IEEE, 99(3):476-489, 2011.Google Scholar
- B. Fan, W. Tantisiriroj, L. Xiao, and G. Gibson. Diskreduce: Raid for data-intensive scalable computing. In Proceedings of the 4th Annual Workshop on Petascale Data Storage, pages 6-10. ACM, 2009. Google Scholar
- D. Ford, F. Labelle, F. Popovici, M. Stokely, V. Truong, L. Barroso, C. Grimes, and S. Quinlan. Availability in globally distributed storage systems. In Proceedings of the 9th USENIX conference on Operating systems design and implementation, pages 1-7, 2010. Google Scholar
- P. Gopalan, C. Huang, H. Simitci, and S. Yekhanin. On the locality of codeword symbols. CoRR, abs/1106.3625, 2011.Google Scholar
- K. Greenan. Reliability and power-efficiency in erasure-coded storage systems. PhD thesis, University of California, Santa Cruz, December 2009.Google Scholar
- K. Greenan, J. Plank, and J. Wylie. Mean time to meaningless: MTTDL, Markov models, and storage system reliability. In HotStorage, 2010. Google Scholar
- A. Greenberg, J. Hamilton, D. A. Maltz, and P. Patel. The cost of a cloud: Research problems in data center networks. Computer Communications Review (CCR), pages 68-73, 2009. Google Scholar
- A. Greenberg, J. R. Hamilton, N. Jain, S. Kandula, C. Kim, P. Lahiri, D. A. Maltz, P. Patel, and S. Sengupta. VL2: A scalable and flexible data center network. SIGCOMM Comput. Commun. Rev., 39:51-62, Aug. 2009. Google Scholar
- C. Guo, H. Wu, K. Tan, L. Shi, Y. Zhang, and S. Lu. DCell: a scalable and fault-tolerant network structure for data centers. SIGCOMM Comput. Commun. Rev., 38:75-86, August 2008. Google Scholar
- T. Ho, M. Médard, R. Koetter, D. Karger, M. Effros, J. Shi, and B. Leong. A random linear network coding approach to multicast. IEEE Transactions on Information Theory, pages 4413-4430, October 2006. Google Scholar
- C. Huang, M. Chen, and J. Li. Pyramid codes: Flexible schemes to trade space for access efficiency in reliable data storage systems. NCA, 2007.Google Scholar
- S. Jaggi, P. Sanders, P. A. Chou, M. Effros, S. Egner, K. Jain, and L. Tolhuizen. Polynomial time algorithms for multicast network code construction. Information Theory, IEEE Transactions on, 51(6):1973-1982, 2005. Google Scholar
- O. Khan, R. Burns, J. Plank, W. Pierce, and C. Huang. Rethinking erasure codes for cloud file systems: Minimizing I/O for recovery and degraded reads. In FAST 2012. Google Scholar
- O. Khan, R. Burns, J. S. Plank, and C. Huang. In search of I/O-optimal recovery from disk failures. In HotStorage'11: 3rd Workshop on Hot Topics in Storage and File Systems, Portland, June 2011. USENIX. Google Scholar
- D. Narayanan, A. Donnelly, and A. Rowstron. Write off-loading: Practical power management for enterprise storage. ACM Transactions on Storage (TOS), 4(3):10, 2008. Google Scholar
- F. Oggier and A. Datta. Self-repairing homomorphic codes for distributed storage systems. In INFOCOM, 2011 Proceedings IEEE, pages 1215-1223, april 2011.Google Scholar
- D. Papailiopoulos and A. G. Dimakis. Locally repairable codes. In ISIT 2012.Google Scholar
- D. Papailiopoulos, J. Luo, A. Dimakis, C. Huang, and J. Li. Simple regenerating codes: Network coding for cloud storage. Arxiv preprint arXiv:1109.0264, 2011.Google Scholar
- K. Rashmi, N. Shah, and P. Kumar. Optimal exact-regenerating codes for distributed storage at the msr and mbr points via a product-matrix construction. Information Theory, IEEE Transactions on, 57(8):5227-5239, aug. 2011. Google Scholar
- K. Rashmi, N. Shah, and P. Kumar. Optimal exact-regenerating codes for distributed storage at the msr and mbr points via a product-matrix construction. Information Theory, IEEE Transactions on, 57(8):5227-5239, 2011. Google Scholar
- I. Reed and G. Solomon. Polynomial codes over certain finite fields. In Journal of the SIAM, 1960.Google Scholar
- R. Rodrigues and B. Liskov. High availability in dhts: Erasure coding vs. replication. Peer-to-Peer Systems IV, pages 226-239, 2005. Google Scholar
- M. Sathiamoorthy, M. Asteris, D. Papailiopoulous, A. G. Dimakis, R. Vadali, S. Chen, and D. Borthakur. Xoring elephants: Novel erasure codes for big data. USC Technical Report 2012, available online at http://bit.ly/xorbas.Google Scholar
- N. Shah, K. Rashmi, P. Kumar, and K. Ramchandran. Interference alignment in regenerating codes for distributed storage: Necessity and code constructions. Information Theory, IEEE Transactions on, 58(4):2134-2158, 2012.Google Scholar
- I. Tamo, Z. Wang, and J. Bruck. MDS array codes with optimal rebuilding. CoRR, abs/1103.3737, 2011.Google Scholar
- S. B. Wicker and V. K. Bhargava. Reed-solomon codes and their applications. In IEEE Press, 1994. Google Scholar
- Q. Xin, E. Miller, T. Schwarz, D. Long, S. Brandt, and W. Litwin. Reliability mechanisms for very large storage systems. In MSST, pages 146-156. IEEE, 2003. Google Scholar
Index Terms
- XORing elephants: novel erasure codes for big data
Recommendations
Only aggressive elephants are fast elephants
Yellow elephants are slow. A major reason is that they consume their inputs entirely before responding to an elephant rider's orders. Some clever riders have trained their yellow elephants to only consume parts of the inputs before responding. However, ...
Can the elephants handle the NoSQL onslaught?
In this new era of "big data", traditional DBMSs are under attack from two sides. At one end of the spectrum, the use of document store NoSQL systems (e.g. MongoDB) threatens to move modern Web 2.0 applications away from traditional RDBMSs. At the other ...
Coded modulation in the block-fading channel: coding theorems and code construction
We consider coded modulation schemes for the block-fading channel. In the setting where a codeword spans a finite number N of fading degrees of freedom, we show that coded modulations of rate R bit per complex dimension, over a finite signal set ýýC of ...
Comments