article

Improving duplicate elimination in storage systems

Authors:
Deepak R. Bobbarjung

Purdue University, West Lafayette, IN

Purdue University, West Lafayette, IN
View Profile

,
Suresh Jagannathan

Purdue University, West Lafayette, IN

Purdue University, West Lafayette, IN
View Profile

,
Cezary Dubnicki

NEC Laboratories America, Princeton, NJ

NEC Laboratories America, Princeton, NJ
View Profile

Authors Info & Claims

ACM Transactions on Storage Volume 2 Issue 4pp 424–448https://doi.org/10.1145/1210596.1210599

Published:01 November 2006Publication History

ACM Transactions on Storage

Abstract

Minimizing the amount of data that must be stored and managed is a key goal for any storage architecture that purports to be scalable. One way to achieve this goal is to avoid maintaining duplicate copies of the same data. Eliminating redundant data at the source by not writing data which has already been stored not only reduces storage overheads, but can also improve bandwidth utilization. For these reasons, in the face of today's exponentially growing data volumes, redundant data elimination techniques have assumed critical significance in the design of modern storage systems.Intelligent object partitioning techniques identify data that is new when objects are updated, and transfer only these chunks to a storage server. In this article, we propose a new object partitioning technique, called fingerdiff, that improves upon existing schemes in several important respects. Most notably, fingerdiff dynamically chooses a partitioning strategy for a data object based on its similarities with previously stored objects in order to improve storage and bandwidth utilization. We present a detailed evaluation of fingerdiff, and other existing object partitioning schemes, using a set of real-world workloads. We show that for these workloads, the duplicate elimination strategies employed by fingerdiff improve storage utilization on average by 25%, and bandwidth utilization on average by 40% over comparable techniques.

References

Ajtai, M., Burns, R., Fagin, R., Long, D., and Stockmeyer, L. 2000. Compactly encoding unstructured input with differential compression. J. ACM. 49, 3, 318--367.]] Google ScholarDigital Library
Berlekamp, E. R. 1968. Algebraic Coding Theory. McGraw-Hill, New York.]]Google Scholar
Blomer, J., Kalfane, M., Karp, R., Karpinski, M., Luby, M., and Zuckerman, D. 1995. An xor-based erasure-resilient coding scheme. Tech. Rep., International Computer Science Institute, Berkeley, California.]]Google Scholar
Broder, A. 1997. On the resemblance and containment of documents. In Proceedings of the Compression and Complexity of Sequences Conference. IEEE Computer Society, 21.]] Google ScholarDigital Library
Broder, A., Glassman, S., Manasse, M., and Zweig, G. 1997. Syntactic clustering of the web. In Proceedings of the 6th International WWW Conference. 391--404.]] Google ScholarDigital Library
Broder, A. Z. 2000. Identifying and filtering near-duplicate documents. In Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching. Springer Verlag, 1--10.]] Google ScholarDigital Library
Cederqvist, P. 1992. Version management with cvs. http://www.cvshome.org/docs/manual/.]]Google Scholar
Cox, L., Murray, C., and Noble, B. 2002. Pastiche: Making backup cheap and easy. In Proceedings of the 5th USENIX Symposium on Operating Systems Design and Implementation. Boston.]] Google ScholarDigital Library
Douglis, F. and Iyengar, A. 2003. Application-Specific deltaencoding via resemblance detection. In Usenix Annual Technical Conference. 59--72.]]Google Scholar
Douglis, P. K. F., LaVoie, J., and Tracey, J. M. 2004. Redundancy elimination within large collections of files. In Usenix Annual Technical Conference. 59--72.]] Google ScholarDigital Library
Goldberg, A. V. and Yianilos, P. N. 1998. Towards an archival intermemory. In Proceedings of the Advances in Digital Libraries Conference.]] Google ScholarDigital Library
Hong, B., Plantenberg, D., Long, D. D. E., and Sivan-Zimet, M. 2004. Duplicate data elimination in a san file system. In Proceedings of the 21st IEEE/12th NASA Goddard Conference on Mass Storage Systems and Technologies (MSST). 301--314.]]Google Scholar
Hunt, J. J., Vo, K.-P., and Tichy, W. F. 1998. Delta algorithms an empirical analysis. ACM Trans. Softw. Eng. Methodol. 7, 2, 192--214.]] Google ScholarDigital Library
Jain, N., Dahlin, M., and Tewari, R. 2005. Taper: Tiered approach for eliminating redundancy in replica sychronization. In Proceedings of the 4th Usenix Conference on File and Storage Technologies (FAST).]] Google ScholarDigital Library
Kubiatowicz, J., Bindel, D., Chen, Y., Czerwinski, S., Eaton, P., Geels, D., Gummadi, R., Rhea, S., Weatherspoon, H., Weimer, W., Wells, C., and Zhao, B. 2000. Oceanstore: An architecture for global store persistent storage. In Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). Cambridge, MA.]] Google ScholarDigital Library
Lelewer, D. A. and Hirschberg, D. S. 1987. Data compression. ACM Comput., Surv. 3, 261--296.]] Google ScholarDigital Library
Lv, Q., Cao, P., Cohen, E., Li, K., and Shenker, S. 2002. Search and replication in unstructured peer-to-peer networks. In Proceedings of the 16th International Conference on Supercomputing. ACM, New York, 84--95.]] Google ScholarDigital Library
Manber, U. 1994. Finding similar files in a large file system. In Usenix Winter Conference. 1--10.]] Google ScholarDigital Library
Muthitacharoen, A., Chen, B., and Mazieres, D. 2001. A low-bandwidth network file system. In Symposium on Operating Systems Principles. 174--187.]] Google ScholarDigital Library
National Institute of Standards and Technology FIPS PUB 180-1. 1995. Secure hash standard.]]Google Scholar
Ouyang, Z., Memon, N., Suel, T., and Trendafilov, D. 2006. Cluster-Based delta compression of a collection of files. In International Conference on Web Information Systems Engineering (WISE).]] Google ScholarDigital Library
Policroniades, C. and Pratt, I. 2004. Alternatives for detecting redundancy in storage systems data. In Usenix Annual Technical Conference. 73--86.]] Google ScholarDigital Library
Quinlan, S. and Dorwards, S. 2002. Venti: A new approach to archival storage. In Usenix Conference on File and Storage Technologies.]] Google ScholarDigital Library
Rabin, M. 1981. Fingerprinting by Random Polynomials. Tech. Rep. TR-15-81, Center for Research in Computing Technology, Harvard University.]]Google Scholar
Rochkind, M. J. 1975. The source code control system. IEEE Trans. Softw. Eng. 1, 4, 364--370.]]Google ScholarDigital Library
Shivakumar, N. and García-Molina, H. 1995. SCAM: A copy detection mechanism for digital documents. In Proceedings of the 2nd Annual Conference on the Theory and Practice of Digital Libraries.]]Google Scholar
Tichy, W. F. 1984. String to string correction problem with block moves. ACM Trans. Softw. Eng. 2, 4 (Dec.), 364--370.]] Google ScholarDigital Library
Tichy, W. F. 1985. RCS---A system for version control. Softw.---Pract. Exper. 15, 7, 637--654.]] Google ScholarDigital Library
W. J. Bolosky, S. Corbin, D. G. and Douceur, J. R. Single instance storage in windows 2000. In Usenix Annual Technical Conference.]] Google ScholarDigital Library
Weatherspoon, H. and Kubiatowicz, J. 2002. Erasure coding vs. replication: A quantitative comparison. In 1st International Workshop on Peer-to-Peer Systems. Cambridge, MA.]] Google ScholarDigital Library
You, L. L. and Karamanolis, C. 2004. Evaluation of efficient archival storage techniques. In Proceedings of the 21st IEEE Symposium on Mass Storage Systems and Technologies (MSST).]]Google Scholar
Ziv, J. and Lempel, A. 1977. A universal algorithm for sequential data compression. IEEE Trans. Inf. Theor. 23, 3, 337--343.]]Google ScholarDigital Library

Index Terms

Improving duplicate elimination in storage systems
1. Information systems
  1. Information retrieval
    1. Document representation
  2. Information storage systems
    1. Record storage systems

Recommendations

Duplicate record elimination in large data files

The issue of duplicate elimination for large data files in which many occurrences of the same record may appear is addressed. A comprehensive cost analysis of the duplicate elimination operation is presented. This analysis is based on a combinatorial ...
Read More
Breaking Down Memory Walls in LSM-based Storage Systems
SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data

The log-structured merge-tree (LSM-tree) [16, 18] is widely used in modern NoSQL systems. Different from traditional update-in-place structures, an LSM-tree first buffers all writes in memory, which are subsequently flushed to disk when memory is full. ...
Read More
Analysis of striping techniques in robotic storage libraries
MSS '95: Proceedings of the 14th IEEE Symposium on Mass Storage Systems

In recent years advances in computational speed have been the main focus of research and development in high performance computing. In comparison, the improvement in I/O performance has been modest. Faster processing speeds have created a need for ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Storage Volume 2, Issue 4
November 2006
133 pages
ISSN:1553-3077
EISSN:1553-3093
DOI:10.1145/1210596
Issue’s Table of Contents

Copyright © 2006 ACM
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 November 2006
Published in tos Volume 2, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Rabin's fingerprints
Storage management
content-based addressing
duplicate elimination
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 116
  Total Citations
  View Citations
- 1,332
  Total Downloads
- Downloads (Last 12 months)33
- Downloads (Last 6 weeks)5
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Improving duplicate elimination in storage systems

ACM Transactions on Storage

Abstract

References

Cited By

Index Terms

Recommendations

Duplicate record elimination in large data files

Breaking Down Memory Walls in LSM-based Storage Systems

Analysis of striping techniques in robotic storage libraries

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Improving duplicate elimination in storage systems

ACM Transactions on Storage

Abstract

References

Cited By

Index Terms

Recommendations

Duplicate record elimination in large data files

Breaking Down Memory Walls in LSM-based Storage Systems

Analysis of striping techniques in robotic storage libraries

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media