research-article

Demystifying data deduplication

Authors:
Nagapramod Mandagere

University of Minnesota

University of Minnesota
View Profile

,
Pin Zhou

IBM Almaden Research Center

IBM Almaden Research Center
View Profile

,
Mark A Smith

IBM Almaden Research Center

IBM Almaden Research Center
View Profile

,
Sandeep Uttamchandani

IBM Almaden Research Center

IBM Almaden Research Center
View Profile

Companion '08: Proceedings of the ACM/IFIP/USENIX Middleware '08 Conference CompanionDecember 2008Pages 12–17https://doi.org/10.1145/1462735.1462739

Published:01 December 2008Publication History

Companion '08: Proceedings of the ACM/IFIP/USENIX Middleware '08 Conference Companion

Pages 12–17

ABSTRACT

Effectiveness and tradeoffs of deduplication technologies are not well understood -- vendors tout Deduplication as a "silver bullet" that can help any enterprise optimize its deployed storage capacity. This paper aims to provide a comprehensive taxonomy and experimental evaluation using real-world data. While the rate of change of data on a day-to-day basis has the greatest influence on the duplication in backup data, we investigate the duplication inherent in this data, independent of rate of change of data or backup schedule or backup algorithm used. Our experimental results show that between different deduplication techniques the space savings varies by about 30%, the CPU usage differs by almost 6 times and the time to reconstruct a deduplicated file can vary by more than 15 times.

References

A. Z. Broder. Identifying and filtering near duplicate documents. In Combinatorial Pattern Matching: 11th Annual Symposium, 2000. Google ScholarDigital Library
J. J. Hunt, K.-P. Vo, and W. F. Tichy. An empirical study of delta algorithms. In ICSE '96: Proceedings of the SCM-6 Workshop on System Configuration Management, pages 49--66, London, UK, 1996. Springer-Verlag. Google ScholarDigital Library
P. Kulkarni, F. Douglis, J. LaVoie, and J. M. Tracey. Redundancy elimination within large collections of files. In ATEC '04: Proceedings of the annual conference on USENIX Annual Technical Conference, pages 5--5, Berkeley, CA, USA, 2004. USENIX Association. Google ScholarDigital Library
C. Policroniades and I. Pratt. Alternatives for detecting redundancy in storage systems data. In USENIX Annual Technical Conference, 2004. Google ScholarDigital Library
M. O. Rabin. Fingerprinting by random polynomials. In Center for Research in Computing Technology, Harvard University. Tech Report TRCSE-03-01, 2006, 1981.Google Scholar
L. You and C. Karamanolis. Evaluation of efficient archival storage techniques. In 21st IEEE/12th NASA Goddard Conference on Mass Storage systems and Technologies, 2004.Google Scholar
L. You, K. Pollack, and D. Long. Deep store: An archival storage system architecture. In 21st International Conference on Data Engineering, 2005. Google ScholarDigital Library
B. Zhu, K. Li, and H. Patterson. Avoiding the disk bottleneck in the data domain deduplication file system. In FAST, 2008. Google ScholarDigital Library

Index Terms

Demystifying data deduplication
1. General and reference
  1. Cross-computing tools and techniques
    1. Performance
  2. Document types
    1. Surveys and overviews
2. Software and its engineering
  1. Software organization and properties
    1. Extra-functional properties
      1. Software performance

Recommendations

A study of practical deduplication

We collected file system content data from 857 desktop computers at Microsoft over a span of 4 weeks. We analyzed the data to determine the relative efficacy of data deduplication, particularly considering whole-file versus block-level elimination of ...
Read More
Multi-level comparison of data deduplication in a backup scenario
SYSTOR '09: Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference

Data deduplication systems detect redundancies between data blocks to either reduce storage needs or to reduce network traffic. A class of deduplication systems splits the data stream into data blocks (chunks) and then finds exact duplicates of these ...
Read More
Flash-Based Storage Deduplication Techniques: A Survey

Exponential growth of the amount of data stored worldwide together with high level of data redundancy motivates the active development of data deduplication techniques. The overall increasing popularity of solid-state drives (SSDs) as primary storage ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
Companion '08: Proceedings of the ACM/IFIP/USENIX Middleware '08 Conference Companion
December 2008
134 pages
ISBN:9781605583693
DOI:10.1145/1462735
Program Chair:
Fred Douglis
IBM T.J. Watson Research Center
Copyright © 2008 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 December 2008
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
compression
deduplication
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 87
  Total Citations
  View Citations
- 1,895
  Total Downloads
- Downloads (Last 12 months)42
- Downloads (Last 6 weeks)5
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Demystifying data deduplication

Companion '08: Proceedings of the ACM/IFIP/USENIX Middleware '08 Conference Companion

ABSTRACT

References

Cited By

Index Terms

Recommendations

A study of practical deduplication

Multi-level comparison of data deduplication in a backup scenario

Flash-Based Storage Deduplication Techniques: A Survey