skip to main content
article

Improving duplicate elimination in storage systems

Published:01 November 2006Publication History
Skip Abstract Section

Abstract

Minimizing the amount of data that must be stored and managed is a key goal for any storage architecture that purports to be scalable. One way to achieve this goal is to avoid maintaining duplicate copies of the same data. Eliminating redundant data at the source by not writing data which has already been stored not only reduces storage overheads, but can also improve bandwidth utilization. For these reasons, in the face of today's exponentially growing data volumes, redundant data elimination techniques have assumed critical significance in the design of modern storage systems.Intelligent object partitioning techniques identify data that is new when objects are updated, and transfer only these chunks to a storage server. In this article, we propose a new object partitioning technique, called fingerdiff, that improves upon existing schemes in several important respects. Most notably, fingerdiff dynamically chooses a partitioning strategy for a data object based on its similarities with previously stored objects in order to improve storage and bandwidth utilization. We present a detailed evaluation of fingerdiff, and other existing object partitioning schemes, using a set of real-world workloads. We show that for these workloads, the duplicate elimination strategies employed by fingerdiff improve storage utilization on average by 25%, and bandwidth utilization on average by 40% over comparable techniques.

References

  1. Ajtai, M., Burns, R., Fagin, R., Long, D., and Stockmeyer, L. 2000. Compactly encoding unstructured input with differential compression. J. ACM. 49, 3, 318--367.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Berlekamp, E. R. 1968. Algebraic Coding Theory. McGraw-Hill, New York.]]Google ScholarGoogle Scholar
  3. Blomer, J., Kalfane, M., Karp, R., Karpinski, M., Luby, M., and Zuckerman, D. 1995. An xor-based erasure-resilient coding scheme. Tech. Rep., International Computer Science Institute, Berkeley, California.]]Google ScholarGoogle Scholar
  4. Broder, A. 1997. On the resemblance and containment of documents. In Proceedings of the Compression and Complexity of Sequences Conference. IEEE Computer Society, 21.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Broder, A., Glassman, S., Manasse, M., and Zweig, G. 1997. Syntactic clustering of the web. In Proceedings of the 6th International WWW Conference. 391--404.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Broder, A. Z. 2000. Identifying and filtering near-duplicate documents. In Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching. Springer Verlag, 1--10.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Cederqvist, P. 1992. Version management with cvs. http://www.cvshome.org/docs/manual/.]]Google ScholarGoogle Scholar
  8. Cox, L., Murray, C., and Noble, B. 2002. Pastiche: Making backup cheap and easy. In Proceedings of the 5th USENIX Symposium on Operating Systems Design and Implementation. Boston.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Douglis, F. and Iyengar, A. 2003. Application-Specific deltaencoding via resemblance detection. In Usenix Annual Technical Conference. 59--72.]]Google ScholarGoogle Scholar
  10. Douglis, P. K. F., LaVoie, J., and Tracey, J. M. 2004. Redundancy elimination within large collections of files. In Usenix Annual Technical Conference. 59--72.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Goldberg, A. V. and Yianilos, P. N. 1998. Towards an archival intermemory. In Proceedings of the Advances in Digital Libraries Conference.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Hong, B., Plantenberg, D., Long, D. D. E., and Sivan-Zimet, M. 2004. Duplicate data elimination in a san file system. In Proceedings of the 21st IEEE/12th NASA Goddard Conference on Mass Storage Systems and Technologies (MSST). 301--314.]]Google ScholarGoogle Scholar
  13. Hunt, J. J., Vo, K.-P., and Tichy, W. F. 1998. Delta algorithms an empirical analysis. ACM Trans. Softw. Eng. Methodol. 7, 2, 192--214.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Jain, N., Dahlin, M., and Tewari, R. 2005. Taper: Tiered approach for eliminating redundancy in replica sychronization. In Proceedings of the 4th Usenix Conference on File and Storage Technologies (FAST).]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Kubiatowicz, J., Bindel, D., Chen, Y., Czerwinski, S., Eaton, P., Geels, D., Gummadi, R., Rhea, S., Weatherspoon, H., Weimer, W., Wells, C., and Zhao, B. 2000. Oceanstore: An architecture for global store persistent storage. In Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). Cambridge, MA.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Lelewer, D. A. and Hirschberg, D. S. 1987. Data compression. ACM Comput., Surv. 3, 261--296.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Lv, Q., Cao, P., Cohen, E., Li, K., and Shenker, S. 2002. Search and replication in unstructured peer-to-peer networks. In Proceedings of the 16th International Conference on Supercomputing. ACM, New York, 84--95.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Manber, U. 1994. Finding similar files in a large file system. In Usenix Winter Conference. 1--10.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Muthitacharoen, A., Chen, B., and Mazieres, D. 2001. A low-bandwidth network file system. In Symposium on Operating Systems Principles. 174--187.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. National Institute of Standards and Technology FIPS PUB 180-1. 1995. Secure hash standard.]]Google ScholarGoogle Scholar
  21. Ouyang, Z., Memon, N., Suel, T., and Trendafilov, D. 2006. Cluster-Based delta compression of a collection of files. In International Conference on Web Information Systems Engineering (WISE).]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Policroniades, C. and Pratt, I. 2004. Alternatives for detecting redundancy in storage systems data. In Usenix Annual Technical Conference. 73--86.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Quinlan, S. and Dorwards, S. 2002. Venti: A new approach to archival storage. In Usenix Conference on File and Storage Technologies.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Rabin, M. 1981. Fingerprinting by Random Polynomials. Tech. Rep. TR-15-81, Center for Research in Computing Technology, Harvard University.]]Google ScholarGoogle Scholar
  25. Rochkind, M. J. 1975. The source code control system. IEEE Trans. Softw. Eng. 1, 4, 364--370.]]Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Shivakumar, N. and García-Molina, H. 1995. SCAM: A copy detection mechanism for digital documents. In Proceedings of the 2nd Annual Conference on the Theory and Practice of Digital Libraries.]]Google ScholarGoogle Scholar
  27. Tichy, W. F. 1984. String to string correction problem with block moves. ACM Trans. Softw. Eng. 2, 4 (Dec.), 364--370.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Tichy, W. F. 1985. RCS---A system for version control. Softw.---Pract. Exper. 15, 7, 637--654.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. W. J. Bolosky, S. Corbin, D. G. and Douceur, J. R. Single instance storage in windows 2000. In Usenix Annual Technical Conference.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Weatherspoon, H. and Kubiatowicz, J. 2002. Erasure coding vs. replication: A quantitative comparison. In 1st International Workshop on Peer-to-Peer Systems. Cambridge, MA.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. You, L. L. and Karamanolis, C. 2004. Evaluation of efficient archival storage techniques. In Proceedings of the 21st IEEE Symposium on Mass Storage Systems and Technologies (MSST).]]Google ScholarGoogle Scholar
  32. Ziv, J. and Lempel, A. 1977. A universal algorithm for sequential data compression. IEEE Trans. Inf. Theor. 23, 3, 337--343.]]Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Improving duplicate elimination in storage systems

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader