Abstract
Minimizing the amount of data that must be stored and managed is a key goal for any storage architecture that purports to be scalable. One way to achieve this goal is to avoid maintaining duplicate copies of the same data. Eliminating redundant data at the source by not writing data which has already been stored not only reduces storage overheads, but can also improve bandwidth utilization. For these reasons, in the face of today's exponentially growing data volumes, redundant data elimination techniques have assumed critical significance in the design of modern storage systems.Intelligent object partitioning techniques identify data that is new when objects are updated, and transfer only these chunks to a storage server. In this article, we propose a new object partitioning technique, called fingerdiff, that improves upon existing schemes in several important respects. Most notably, fingerdiff dynamically chooses a partitioning strategy for a data object based on its similarities with previously stored objects in order to improve storage and bandwidth utilization. We present a detailed evaluation of fingerdiff, and other existing object partitioning schemes, using a set of real-world workloads. We show that for these workloads, the duplicate elimination strategies employed by fingerdiff improve storage utilization on average by 25%, and bandwidth utilization on average by 40% over comparable techniques.
- Ajtai, M., Burns, R., Fagin, R., Long, D., and Stockmeyer, L. 2000. Compactly encoding unstructured input with differential compression. J. ACM. 49, 3, 318--367.]] Google ScholarDigital Library
- Berlekamp, E. R. 1968. Algebraic Coding Theory. McGraw-Hill, New York.]]Google Scholar
- Blomer, J., Kalfane, M., Karp, R., Karpinski, M., Luby, M., and Zuckerman, D. 1995. An xor-based erasure-resilient coding scheme. Tech. Rep., International Computer Science Institute, Berkeley, California.]]Google Scholar
- Broder, A. 1997. On the resemblance and containment of documents. In Proceedings of the Compression and Complexity of Sequences Conference. IEEE Computer Society, 21.]] Google ScholarDigital Library
- Broder, A., Glassman, S., Manasse, M., and Zweig, G. 1997. Syntactic clustering of the web. In Proceedings of the 6th International WWW Conference. 391--404.]] Google ScholarDigital Library
- Broder, A. Z. 2000. Identifying and filtering near-duplicate documents. In Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching. Springer Verlag, 1--10.]] Google ScholarDigital Library
- Cederqvist, P. 1992. Version management with cvs. http://www.cvshome.org/docs/manual/.]]Google Scholar
- Cox, L., Murray, C., and Noble, B. 2002. Pastiche: Making backup cheap and easy. In Proceedings of the 5th USENIX Symposium on Operating Systems Design and Implementation. Boston.]] Google ScholarDigital Library
- Douglis, F. and Iyengar, A. 2003. Application-Specific deltaencoding via resemblance detection. In Usenix Annual Technical Conference. 59--72.]]Google Scholar
- Douglis, P. K. F., LaVoie, J., and Tracey, J. M. 2004. Redundancy elimination within large collections of files. In Usenix Annual Technical Conference. 59--72.]] Google ScholarDigital Library
- Goldberg, A. V. and Yianilos, P. N. 1998. Towards an archival intermemory. In Proceedings of the Advances in Digital Libraries Conference.]] Google ScholarDigital Library
- Hong, B., Plantenberg, D., Long, D. D. E., and Sivan-Zimet, M. 2004. Duplicate data elimination in a san file system. In Proceedings of the 21st IEEE/12th NASA Goddard Conference on Mass Storage Systems and Technologies (MSST). 301--314.]]Google Scholar
- Hunt, J. J., Vo, K.-P., and Tichy, W. F. 1998. Delta algorithms an empirical analysis. ACM Trans. Softw. Eng. Methodol. 7, 2, 192--214.]] Google ScholarDigital Library
- Jain, N., Dahlin, M., and Tewari, R. 2005. Taper: Tiered approach for eliminating redundancy in replica sychronization. In Proceedings of the 4th Usenix Conference on File and Storage Technologies (FAST).]] Google ScholarDigital Library
- Kubiatowicz, J., Bindel, D., Chen, Y., Czerwinski, S., Eaton, P., Geels, D., Gummadi, R., Rhea, S., Weatherspoon, H., Weimer, W., Wells, C., and Zhao, B. 2000. Oceanstore: An architecture for global store persistent storage. In Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). Cambridge, MA.]] Google ScholarDigital Library
- Lelewer, D. A. and Hirschberg, D. S. 1987. Data compression. ACM Comput., Surv. 3, 261--296.]] Google ScholarDigital Library
- Lv, Q., Cao, P., Cohen, E., Li, K., and Shenker, S. 2002. Search and replication in unstructured peer-to-peer networks. In Proceedings of the 16th International Conference on Supercomputing. ACM, New York, 84--95.]] Google ScholarDigital Library
- Manber, U. 1994. Finding similar files in a large file system. In Usenix Winter Conference. 1--10.]] Google ScholarDigital Library
- Muthitacharoen, A., Chen, B., and Mazieres, D. 2001. A low-bandwidth network file system. In Symposium on Operating Systems Principles. 174--187.]] Google ScholarDigital Library
- National Institute of Standards and Technology FIPS PUB 180-1. 1995. Secure hash standard.]]Google Scholar
- Ouyang, Z., Memon, N., Suel, T., and Trendafilov, D. 2006. Cluster-Based delta compression of a collection of files. In International Conference on Web Information Systems Engineering (WISE).]] Google ScholarDigital Library
- Policroniades, C. and Pratt, I. 2004. Alternatives for detecting redundancy in storage systems data. In Usenix Annual Technical Conference. 73--86.]] Google ScholarDigital Library
- Quinlan, S. and Dorwards, S. 2002. Venti: A new approach to archival storage. In Usenix Conference on File and Storage Technologies.]] Google ScholarDigital Library
- Rabin, M. 1981. Fingerprinting by Random Polynomials. Tech. Rep. TR-15-81, Center for Research in Computing Technology, Harvard University.]]Google Scholar
- Rochkind, M. J. 1975. The source code control system. IEEE Trans. Softw. Eng. 1, 4, 364--370.]]Google ScholarDigital Library
- Shivakumar, N. and García-Molina, H. 1995. SCAM: A copy detection mechanism for digital documents. In Proceedings of the 2nd Annual Conference on the Theory and Practice of Digital Libraries.]]Google Scholar
- Tichy, W. F. 1984. String to string correction problem with block moves. ACM Trans. Softw. Eng. 2, 4 (Dec.), 364--370.]] Google ScholarDigital Library
- Tichy, W. F. 1985. RCS---A system for version control. Softw.---Pract. Exper. 15, 7, 637--654.]] Google ScholarDigital Library
- W. J. Bolosky, S. Corbin, D. G. and Douceur, J. R. Single instance storage in windows 2000. In Usenix Annual Technical Conference.]] Google ScholarDigital Library
- Weatherspoon, H. and Kubiatowicz, J. 2002. Erasure coding vs. replication: A quantitative comparison. In 1st International Workshop on Peer-to-Peer Systems. Cambridge, MA.]] Google ScholarDigital Library
- You, L. L. and Karamanolis, C. 2004. Evaluation of efficient archival storage techniques. In Proceedings of the 21st IEEE Symposium on Mass Storage Systems and Technologies (MSST).]]Google Scholar
- Ziv, J. and Lempel, A. 1977. A universal algorithm for sequential data compression. IEEE Trans. Inf. Theor. 23, 3, 337--343.]]Google ScholarDigital Library
Index Terms
- Improving duplicate elimination in storage systems
Recommendations
Duplicate record elimination in large data files
The issue of duplicate elimination for large data files in which many occurrences of the same record may appear is addressed. A comprehensive cost analysis of the duplicate elimination operation is presented. This analysis is based on a combinatorial ...
Breaking Down Memory Walls in LSM-based Storage Systems
SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of DataThe log-structured merge-tree (LSM-tree) [16, 18] is widely used in modern NoSQL systems. Different from traditional update-in-place structures, an LSM-tree first buffers all writes in memory, which are subsequently flushed to disk when memory is full. ...
Analysis of striping techniques in robotic storage libraries
MSS '95: Proceedings of the 14th IEEE Symposium on Mass Storage SystemsIn recent years advances in computational speed have been the main focus of research and development in high performance computing. In comparison, the improvement in I/O performance has been modest. Faster processing speeds have created a need for ...
Comments