ABSTRACT
Effectiveness and tradeoffs of deduplication technologies are not well understood -- vendors tout Deduplication as a "silver bullet" that can help any enterprise optimize its deployed storage capacity. This paper aims to provide a comprehensive taxonomy and experimental evaluation using real-world data. While the rate of change of data on a day-to-day basis has the greatest influence on the duplication in backup data, we investigate the duplication inherent in this data, independent of rate of change of data or backup schedule or backup algorithm used. Our experimental results show that between different deduplication techniques the space savings varies by about 30%, the CPU usage differs by almost 6 times and the time to reconstruct a deduplicated file can vary by more than 15 times.
- A. Z. Broder. Identifying and filtering near duplicate documents. In Combinatorial Pattern Matching: 11th Annual Symposium, 2000. Google ScholarDigital Library
- J. J. Hunt, K.-P. Vo, and W. F. Tichy. An empirical study of delta algorithms. In ICSE '96: Proceedings of the SCM-6 Workshop on System Configuration Management, pages 49--66, London, UK, 1996. Springer-Verlag. Google ScholarDigital Library
- P. Kulkarni, F. Douglis, J. LaVoie, and J. M. Tracey. Redundancy elimination within large collections of files. In ATEC '04: Proceedings of the annual conference on USENIX Annual Technical Conference, pages 5--5, Berkeley, CA, USA, 2004. USENIX Association. Google ScholarDigital Library
- C. Policroniades and I. Pratt. Alternatives for detecting redundancy in storage systems data. In USENIX Annual Technical Conference, 2004. Google ScholarDigital Library
- M. O. Rabin. Fingerprinting by random polynomials. In Center for Research in Computing Technology, Harvard University. Tech Report TRCSE-03-01, 2006, 1981.Google Scholar
- L. You and C. Karamanolis. Evaluation of efficient archival storage techniques. In 21st IEEE/12th NASA Goddard Conference on Mass Storage systems and Technologies, 2004.Google Scholar
- L. You, K. Pollack, and D. Long. Deep store: An archival storage system architecture. In 21st International Conference on Data Engineering, 2005. Google ScholarDigital Library
- B. Zhu, K. Li, and H. Patterson. Avoiding the disk bottleneck in the data domain deduplication file system. In FAST, 2008. Google ScholarDigital Library
Index Terms
- Demystifying data deduplication
Recommendations
A study of practical deduplication
We collected file system content data from 857 desktop computers at Microsoft over a span of 4 weeks. We analyzed the data to determine the relative efficacy of data deduplication, particularly considering whole-file versus block-level elimination of ...
Multi-level comparison of data deduplication in a backup scenario
SYSTOR '09: Proceedings of SYSTOR 2009: The Israeli Experimental Systems ConferenceData deduplication systems detect redundancies between data blocks to either reduce storage needs or to reduce network traffic. A class of deduplication systems splits the data stream into data blocks (chunks) and then finds exact duplicates of these ...
Flash-Based Storage Deduplication Techniques: A Survey
Exponential growth of the amount of data stored worldwide together with high level of data redundancy motivates the active development of data deduplication techniques. The overall increasing popularity of solid-state drives (SSDs) as primary storage ...
Comments