skip to main content
research-article

A new intra-disk redundancy scheme for high-reliability RAID storage systems in the presence of unrecoverable errors

Authors Info & Claims
Published:28 May 2008Publication History
Skip Abstract Section

Abstract

Today's data storage systems are increasingly adopting low-cost disk drives that have higher capacity but lower reliability, leading to more frequent rebuilds and to a higher risk of unrecoverable media errors. We propose an efficient intradisk redundancy scheme to enhance the reliability of RAID systems. This scheme introduces an additional level of redundancy inside each disk, on top of the RAID redundancy across multiple disks. The RAID parity provides protection against disk failures, whereas the proposed scheme aims to protect against media-related unrecoverable errors. In particular, we consider an intradisk redundancy architecture that is based on an interleaved parity-check coding scheme, which incurs only negligible I/O performance degradation. A comparison between this coding scheme and schemes based on traditional Reed--Solomon codes and single-parity-check codes is conducted by analytical means. A new model is developed to capture the effect of correlated unrecoverable sector errors. The probability of an unrecoverable failure associated with these schemes is derived for the new correlated model, as well as for the simpler independent error model. We also derive closed-form expressions for the mean time to data loss of RAID-5 and RAID-6 systems in the presence of unrecoverable errors and disk failures. We then combine these results to characterize the reliability of RAID systems that incorporate the intradisk redundancy scheme. Our results show that in the practical case of correlated errors, the interleaved parity-check scheme provides the same reliability as the optimum, albeit more complex, Reed--Solomon coding scheme. Finally, the I/O and throughput performances are evaluated by means of analysis and event-driven simulation.

References

  1. Blaum, M., Brady, J., Bruck, J., and Mennon, J. 1995. EVENODD: An efficient scheme for tolerating double disk failures in RAID architectures. IEEE Trans. Comput. 44, 2 (Feb.), 192--202. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Burkhard, W. A. and Menon, J. 1993. Disk array storage system reliability. In Proceedings of the 23rd International Symposium on Fault-Tolerant Computing, Toulouse, France. 432--441.Google ScholarGoogle Scholar
  3. Chen, P. M., Lee, E., Gibson, G., Katz, R., and Patterson, D. 1994. RAID: High-Performance, reliable secondary storage. ACM Comput. Surv. 26, 2 (Jun.), 145--185. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Chen, S. and Towsley, D. 1996. A performance evaluation of RAID architectures. IEEE Trans. Comput. 45, 10 (Oct.), 1116--1130. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Corbett, P., English, R., Goel, A., Grcanac, T., Kleiman, S., Leong, J., and Sankar, S. 2004. Row-Diagonal parity for double disk failure correction. In Proceedings of the 3rd USENIX Conference on File and Storage Technologies (FAST), San Francisco, CA. 1--14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. DiskSim simulation environment (Version 3.0) Carnegie Mellon University. 2007. http://www.pdl.cmu.edu/DiskSim/.Google ScholarGoogle Scholar
  7. Hafner, J. L., Deenadhayalan, V., Kanungo, T., and Rao, K. 2004. Performance metrics for erasure codes in storage systems. IBM Res. Rep. RJ 10321.Google ScholarGoogle Scholar
  8. Hitachi Global Storage Technologies. 2007. Hitachi disk drive product datasheets. http://www.hitachigst.com/.Google ScholarGoogle Scholar
  9. HP Labs. 2006. Private software. http://tesla.hpl.hp.com/private_software/.Google ScholarGoogle Scholar
  10. Hughes, G. F. and Murray, J. F. 2004. Reliability and security of RAID storage systems and D2D archives using SATA disk drives. ACM Trans. Storage 1, 1 (Dec.), 95--107. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Keeton, K., Santos, C., Beyer, D., Chase, J., and Wilkes, J. 2004. Designing for disasters. In Proceedings of the 3rd USENIX Conference on File and Storage Technologies (FAST), San Francisco, CA. 59--72. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Kleinrock, L. 1975. Queueing Systems, Volume 1: Theory. Wiley, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Malhotra, M. and Trivedi, K. S. 1993. Reliability analysis of redundant arrays of inexpensive disks. J. Parallel Distrib. Comput. 17, 146--151. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Malhotra, M. and Trivedi, K. S. 1995. Data integrity analysis of disk array systems with analytic modeling of coverage. Perform. Eval. 22, 111--133. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Patterson, D. A., Gibson, G., and Katz, R. H. 1988. A case for redundant arrays of inexpensive disks (RAID). In Proceedings of the ACM SIGMOD International Conference on Management of Data, Chicago, IL. 109--116. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Ruemmler, C. and Wilkes, J. 1994. An introduction to disk drive modeling. IEEE Comput. 27, 3 (Mar.), 17--28. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Schulze, M., Gibson, G., Katz, R., and Patterson, D. 1989. How reliable is a RAID? In Proceedings of the 34th IEEE COMPCON, San Francisco, CA. 118--123.Google ScholarGoogle Scholar
  18. SPC. 2007a. Storage performance council, storage OLTP application I/O traces. http://prisms.cs.umass.edu/repository/.Google ScholarGoogle Scholar
  19. SPC. 2007b. Storage performance council, storage search engine I/O traces. http://prisms.cs.umass.edu/repository/.Google ScholarGoogle Scholar
  20. Trivedi, K. S. 2002. Probabilistic and Statistics with Reliability, Queueing and Computer Science Applications, 2nd ed. Wiley, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Varki, E., Merchant, A., Xu, J., and Qiu, X. 2004. Issues and challenges in the performance analysis of real disk arrays. IEEE Trans. Parallel. Distrib. Syst. 15, 6 (Jun.), 559--574. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Wu, X., Li, J., and Kameda, H. 1997. Reliability analysis of disk array organizations by considering uncorrectable bit errors. In Proceedings of the 16th IEEE Symposium on Reliable Distributed Systems, Durham, NC. 2--9. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Xia, H. and Chien, A. A. 2007. RobuSTore: A distributed storage architecture with robust and high performance. In Proceedings of the ACM/IEEE International Conference on Supercomputing (SC), Reno, NV. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Xin, Q., Miller, E. L., Schwarz, T., Long, D. D. E., Brandt, S. A., and Litwin, W. 2003. Reliability mechanisms for very large storage systems. In Proceedings of the 20th IEEE/11th NASA Goddard Conference on Mass Storage Systems and Technologies (MSST), San Diego, CA. 146--156. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A new intra-disk redundancy scheme for high-reliability RAID storage systems in the presence of unrecoverable errors

                Recommendations

                Comments

                Login options

                Check if you have access through your login credentials or your institution to get full access on this article.

                Sign in

                Full Access

                • Published in

                  cover image ACM Transactions on Storage
                  ACM Transactions on Storage  Volume 4, Issue 1
                  May 2008
                  90 pages
                  ISSN:1553-3077
                  EISSN:1553-3093
                  DOI:10.1145/1353452
                  Issue’s Table of Contents

                  Copyright © 2008 ACM

                  Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                  Publisher

                  Association for Computing Machinery

                  New York, NY, United States

                  Publication History

                  • Published: 28 May 2008
                  • Revised: 1 January 2008
                  • Accepted: 1 January 2008
                  • Received: 1 October 2007
                  Published in tos Volume 4, Issue 1

                  Permissions

                  Request permissions about this article.

                  Request Permissions

                  Check for updates

                  Qualifiers

                  • research-article
                  • Research
                  • Refereed

                PDF Format

                View or Download as a PDF file.

                PDF

                eReader

                View online with eReader.

                eReader