research-article

A new intra-disk redundancy scheme for high-reliability RAID storage systems in the presence of unrecoverable errors

Authors:
Ajay Dholakia

IBM Systems and Technology Group, Research Triangle Park, NC

IBM Systems and Technology Group, Research Triangle Park, NC
View Profile

,
Evangelos Eleftheriou

IBM Zurich Research Laboratory, Rüschlikon, Switzerland

IBM Zurich Research Laboratory, Rüschlikon, Switzerland
View Profile

,
Xiao-Yu Hu

IBM Zurich Research Laboratory, Rüschlikon, Switzerland

IBM Zurich Research Laboratory, Rüschlikon, Switzerland
View Profile

,
Ilias Iliadis

IBM Zurich Research Laboratory, Rüschlikon, Switzerland

IBM Zurich Research Laboratory, Rüschlikon, Switzerland
View Profile

,
Jai Menon

IBM Systems and Technology Group, San Jose, CA

IBM Systems and Technology Group, San Jose, CA
View Profile

,
K.K. Rao

IBM Almaden Research Center, San Jose, CA

IBM Almaden Research Center, San Jose, CA
View Profile

Authors Info & Claims

ACM Transactions on Storage Volume 4 Issue 1Article No.: 1pp 1–42https://doi.org/10.1145/1353452.1353453

Published:28 May 2008Publication History

ACM Transactions on Storage

Abstract

Today's data storage systems are increasingly adopting low-cost disk drives that have higher capacity but lower reliability, leading to more frequent rebuilds and to a higher risk of unrecoverable media errors. We propose an efficient intradisk redundancy scheme to enhance the reliability of RAID systems. This scheme introduces an additional level of redundancy inside each disk, on top of the RAID redundancy across multiple disks. The RAID parity provides protection against disk failures, whereas the proposed scheme aims to protect against media-related unrecoverable errors. In particular, we consider an intradisk redundancy architecture that is based on an interleaved parity-check coding scheme, which incurs only negligible I/O performance degradation. A comparison between this coding scheme and schemes based on traditional Reed--Solomon codes and single-parity-check codes is conducted by analytical means. A new model is developed to capture the effect of correlated unrecoverable sector errors. The probability of an unrecoverable failure associated with these schemes is derived for the new correlated model, as well as for the simpler independent error model. We also derive closed-form expressions for the mean time to data loss of RAID-5 and RAID-6 systems in the presence of unrecoverable errors and disk failures. We then combine these results to characterize the reliability of RAID systems that incorporate the intradisk redundancy scheme. Our results show that in the practical case of correlated errors, the interleaved parity-check scheme provides the same reliability as the optimum, albeit more complex, Reed--Solomon coding scheme. Finally, the I/O and throughput performances are evaluated by means of analysis and event-driven simulation.

References

Blaum, M., Brady, J., Bruck, J., and Mennon, J. 1995. EVENODD: An efficient scheme for tolerating double disk failures in RAID architectures. IEEE Trans. Comput. 44, 2 (Feb.), 192--202. Google ScholarDigital Library
Burkhard, W. A. and Menon, J. 1993. Disk array storage system reliability. In Proceedings of the 23rd International Symposium on Fault-Tolerant Computing, Toulouse, France. 432--441.Google Scholar
Chen, P. M., Lee, E., Gibson, G., Katz, R., and Patterson, D. 1994. RAID: High-Performance, reliable secondary storage. ACM Comput. Surv. 26, 2 (Jun.), 145--185. Google ScholarDigital Library
Chen, S. and Towsley, D. 1996. A performance evaluation of RAID architectures. IEEE Trans. Comput. 45, 10 (Oct.), 1116--1130. Google ScholarDigital Library
Corbett, P., English, R., Goel, A., Grcanac, T., Kleiman, S., Leong, J., and Sankar, S. 2004. Row-Diagonal parity for double disk failure correction. In Proceedings of the 3rd USENIX Conference on File and Storage Technologies (FAST), San Francisco, CA. 1--14. Google ScholarDigital Library
DiskSim simulation environment (Version 3.0) Carnegie Mellon University. 2007. http://www.pdl.cmu.edu/DiskSim/.Google Scholar
Hafner, J. L., Deenadhayalan, V., Kanungo, T., and Rao, K. 2004. Performance metrics for erasure codes in storage systems. IBM Res. Rep. RJ 10321.Google Scholar
Hitachi Global Storage Technologies. 2007. Hitachi disk drive product datasheets. http://www.hitachigst.com/.Google Scholar
HP Labs. 2006. Private software. http://tesla.hpl.hp.com/private_software/.Google Scholar
Hughes, G. F. and Murray, J. F. 2004. Reliability and security of RAID storage systems and D2D archives using SATA disk drives. ACM Trans. Storage 1, 1 (Dec.), 95--107. Google ScholarDigital Library
Keeton, K., Santos, C., Beyer, D., Chase, J., and Wilkes, J. 2004. Designing for disasters. In Proceedings of the 3rd USENIX Conference on File and Storage Technologies (FAST), San Francisco, CA. 59--72. Google ScholarDigital Library
Kleinrock, L. 1975. Queueing Systems, Volume 1: Theory. Wiley, New York. Google ScholarDigital Library
Malhotra, M. and Trivedi, K. S. 1993. Reliability analysis of redundant arrays of inexpensive disks. J. Parallel Distrib. Comput. 17, 146--151. Google ScholarDigital Library
Malhotra, M. and Trivedi, K. S. 1995. Data integrity analysis of disk array systems with analytic modeling of coverage. Perform. Eval. 22, 111--133. Google ScholarDigital Library
Patterson, D. A., Gibson, G., and Katz, R. H. 1988. A case for redundant arrays of inexpensive disks (RAID). In Proceedings of the ACM SIGMOD International Conference on Management of Data, Chicago, IL. 109--116. Google ScholarDigital Library
Ruemmler, C. and Wilkes, J. 1994. An introduction to disk drive modeling. IEEE Comput. 27, 3 (Mar.), 17--28. Google ScholarDigital Library
Schulze, M., Gibson, G., Katz, R., and Patterson, D. 1989. How reliable is a RAID? In Proceedings of the 34th IEEE COMPCON, San Francisco, CA. 118--123.Google Scholar
SPC. 2007a. Storage performance council, storage OLTP application I/O traces. http://prisms.cs.umass.edu/repository/.Google Scholar
SPC. 2007b. Storage performance council, storage search engine I/O traces. http://prisms.cs.umass.edu/repository/.Google Scholar
Trivedi, K. S. 2002. Probabilistic and Statistics with Reliability, Queueing and Computer Science Applications, 2nd ed. Wiley, New York. Google ScholarDigital Library
Varki, E., Merchant, A., Xu, J., and Qiu, X. 2004. Issues and challenges in the performance analysis of real disk arrays. IEEE Trans. Parallel. Distrib. Syst. 15, 6 (Jun.), 559--574. Google ScholarDigital Library
Wu, X., Li, J., and Kameda, H. 1997. Reliability analysis of disk array organizations by considering uncorrectable bit errors. In Proceedings of the 16th IEEE Symposium on Reliable Distributed Systems, Durham, NC. 2--9. Google ScholarDigital Library
Xia, H. and Chien, A. A. 2007. RobuSTore: A distributed storage architecture with robust and high performance. In Proceedings of the ACM/IEEE International Conference on Supercomputing (SC), Reno, NV. Google ScholarDigital Library
Xin, Q., Miller, E. L., Schwarz, T., Long, D. D. E., Brandt, S. A., and Litwin, W. 2003. Reliability mechanisms for very large storage systems. In Proceedings of the 20th IEEE/11th NASA Goddard Conference on Mass Storage Systems and Technologies (MSST), San Diego, CA. 146--156. Google ScholarDigital Library

Index Terms

A new intra-disk redundancy scheme for high-reliability RAID storage systems in the presence of unrecoverable errors

Recommendations

Reliability Evaluation of Erasure-coded Storage Systems with Latent Errors
Large-scale storage systems employ erasure-coding redundancy schemes to protect against device failures. The adverse effect of latent sector errors on the Mean Time to Data Loss (MTTDL) and the Expected Annual Fraction of Data Loss (EAFDL) reliability ...
Read More
Higher reliability redundant disk arrays: Organization, operation, and coding

Parity is a popular form of data protection in redundant arrays of inexpensive/independent disks (RAID). RAID5 dedicates one out of N disks to parity to mask single disk failures, that is, the contents of a block on a failed disk can be reconstructed by ...
Read More
Disk Scrubbing Versus Intradisk Redundancy for RAID Storage Systems

Two schemes proposed to cope with unrecoverable or latent media errors and enhance the reliability of RAID systems are examined. The first scheme is the established, widely used, disk scrubbing scheme, which operates by periodically accessing disk ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Storage Volume 4, Issue 1
May 2008
90 pages
ISSN:1553-3077
EISSN:1553-3093
DOI:10.1145/1353452
Issue’s Table of Contents

Copyright © 2008 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 28 May 2008
- Revised: 1 January 2008
- Accepted: 1 January 2008
- Received: 1 October 2007
Published in tos Volume 4, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
File and I/O systems
RAID
reliability analysis
stochastic modeling
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 67
  Total Citations
  View Citations
- 1,054
  Total Downloads
- Downloads (Last 12 months)12
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A new intra-disk redundancy scheme for high-reliability RAID storage systems in the presence of unrecoverable errors

ACM Transactions on Storage

Abstract

References

Cited By

Index Terms

Recommendations

Reliability Evaluation of Erasure-coded Storage Systems with Latent Errors

Higher reliability redundant disk arrays: Organization, operation, and coding

Disk Scrubbing Versus Intradisk Redundancy for RAID Storage Systems

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

A new intra-disk redundancy scheme for high-reliability RAID storage systems in the presence of unrecoverable errors

ACM Transactions on Storage

Abstract

References

Cited By

Index Terms

Recommendations

Reliability Evaluation of Erasure-coded Storage Systems with Latent Errors

Higher reliability redundant disk arrays: Organization, operation, and coding

Disk Scrubbing Versus Intradisk Redundancy for RAID Storage Systems

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media