research-article

Reducing impact of data fragmentation caused by in-line deduplication

Authors:
Michal Kaczmarczyk

9LivesData, LLC

9LivesData, LLC
View Profile

,
Marcin Barczynski

9LivesData, LLC

9LivesData, LLC
View Profile

,
Wojciech Kilian

9LivesData, LLC

9LivesData, LLC
View Profile

,
Cezary Dubnicki

9LivesData, LLC

9LivesData, LLC
View Profile

SYSTOR '12: Proceedings of the 5th Annual International Systems and Storage ConferenceJune 2012Article No.: 15Pages 1–12https://doi.org/10.1145/2367589.2367600

Published:04 June 2012Publication History

SYSTOR '12: Proceedings of the 5th Annual International Systems and Storage Conference

Pages 1–12

ABSTRACT

Deduplication results inevitably in data fragmentation, because logically continuous data is scattered across many disk locations. In this work we focus on fragmentation caused by duplicates from previous backups of the same backup set, since such duplicates are very common due to repeated full backups containing a lot of unchanged data. For systems with in-line dedup which detects duplicates during writing and avoids storing them, such fragmentation causes data from the latest backup being scattered across older backups. As a result, the time of restore from the latest backup can be significantly increased, sometimes more than doubled.

We propose an algorithm called context-based rewriting (CBR in short) minimizing this drop in restore performance for latest backups by shifting fragmentation to older backups, which are rarely used for restore. By selectively rewriting a small percentage of duplicates during backup, we can reduce the drop in restore bandwidth from 12--55% to only 4--7%, as shown by experiments driven by a set of backup traces. All of this is achieved with only small increase in writing time, between 1% and 5%. Since we rewrite only few duplicates and old copies of rewritten data are removed in the background, the whole process introduces small and temporary space overhead.

References

L. Aronovich, R. Asher, E. Bachmat, H. Bitner, M. Hirsch, and S. T. Klein. The design of a similarity based dedupli-cation system. In Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference, SYSTOR '09, pages 6:1--6:14, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
T. Asaro and H. Biggar. Data De-duplication and Disk-to-Disk Backup Systems: Technical and Business Considerations, 2007. The Enterprise Strategy Group.Google Scholar
D. Bhagwat, K. Eshghi, D. D. E. Long, and M. Lillibridge. Extreme binning: Scalable, parallel deduplication for chunk-based file backup. In Proceedings of the 17th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS 2009), Sep 2009.Google ScholarCross Ref
H. Biggar. Experiencing Data De-Duplication: Improving Efficiency and Reducing Capacity Requirements, 2007. The Enterprise Strategy Group.Google Scholar
A. Z. Broder. Some aplications of rabin's fingerprinting method. In Sequences II: Methods in Communications, Security, and Computer Science, pages 143--152. Springer-Verlag, 1993.Google ScholarCross Ref
L. P. Cox, C. D. Murray, and B. D. Noble. Pastiche: making backup cheap and easy. In OSDI '02: Proceedings of the 5th symposium on Operating systems design and implementation, pages 285--298, New York, NY, USA, 2002. ACM. Google ScholarDigital Library
B. Debnath, S. Sengupta, and J. Li. Chunkstash: Speeding up inline storage deduplication using flash memory. In 2010 USENIX Annual Technical Conference, June 2010. Google ScholarDigital Library
W. Dong, F. Douglis, K. Li, H. Patterson, S. Reddy, and P. Shilane. Tradeoffs in scalable data routing for deduplication clusters. In Proceedings of the 9th USENIX conference on File and Storage Technologies, FAST'11, pages 15--29, Berkeley, CA, USA, 2011. USENIX Association. Google ScholarDigital Library
C. Dubnicki, L. Gryz, L. Heldt, M. Kaczmarczyk, W. Kilian, P. Strzelczak, J. Szczepkowski, C. Ungureanu, and M. Welnicki. HYDRAstor: a Scalable Secondary Storage. In FAST'09: Proceedings of the 7th USENIX Conference on File and Storage Technologies, pages 197--210, Berkeley, CA, USA, 2009. USENIX Association. Google ScholarDigital Library
C. Dubnicki, C. Ungureanu, and W. Kilian. FPN: A distributed hash table for commercial applications. In Proceedings of the 13th IEEE International Symposium on High Performance Distributed Computing, pages 120--128, Washington, DC, USA, 2004. IEEE Computer Society. Google ScholarDigital Library
D. E. Eastlake and P. E. Jones. US Secure Hash Algorithm 1 (SHA1). RFC 3174 (Informational), September 2001. Google ScholarDigital Library
EMC Avamar: Backup and recovery with global deduplication, 2008. http://www.emc.com/avamar.Google Scholar
EMC Centera: Content addressed storage system, January 2008. http://www.emc.com/centera.Google Scholar
EMC Corporation: Data Domain Global Deduplication Array, 2011. http://www.datadomain.com/products/global-deduplication-array.html.Google Scholar
EMC Corporation: DataDomain - Deduplication Storage for Backup, Archiving and Disaster Recovery, 2011. http://www.datadomain.com.Google Scholar
Exagrid. http://www.exagrid.com.Google Scholar
D. Floyer. Wikibon Data De-duplication Performance Tables. Wikibon.org, May 2011. http://wikibon.org/wiki/v/Wikibon_Data_De-duplication_Performance_Tables.Google Scholar
R. Koller and R. Rangaswami. I/O Deduplication: Utilizing content similarity to improve I/O performance. volume 6, pages 13:1--13:26, New York, NY, USA, September 2010. ACM. Google ScholarDigital Library
E. Kruus, C. Ungureanu, and C. Dubnicki. Bimodal content defined chunking for backup streams. In Proceedings of the 8th USENIX conference on File and Storage Technologies, FAST'10, pages 239--252, Berkeley, CA, USA, 2010. USENIX Association. Google ScholarDigital Library
M. Lillibridge, K. Eshghi, D. Bhagwat, V. Deolalikar, G. Trezis, and P. Camble. Sparse indexing: Large scale, inline deduplication using sampling and locality. In FAST'09: Proceedings of the 7th USENIX Conference on File and Storage Technologies, pages 111--123, 2009. Google ScholarDigital Library
J. Livens. Deduplication and restore performance. Wikibon.org, January 2009. http://wikibon.org/wiki/v/Deduplication_and_restore_performance.Google Scholar
J. Livens. Defragmentation, rehydration and deduplication. AboutRestore.com, June 2009. http://www.aboutrestore.com/2009/06/24/defragmentation-rehydration-and-deduplication/.Google Scholar
D. Meister and A. Brinkmann. dedupv1: Improving Deduplication Throughput using Solid State Drives (SSD). In Proceedings of the 26th IEEE Symposium on Massive Storage Systems and Technologies (MSST), May 2010. Google ScholarDigital Library
A. Muthitacharoen, B. Chen, and D. Mazires. A low-bandwidth network file system. In In Proceedings of the 18th ACM Symposium on Operating Systems Principles (SOSP '01, pages 174--187, New York, NY, USA, 2001. ACM. Google ScholarDigital Library
P. Nath, B. Urgaonkar, and A. Sivasubramaniam. Evaluating the usefulness of content addressable storage for high-performance data intensive applications. In HPDC '08: Proceedings of the 17th international symposium on High performance distributed computing, pages 35--44, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
NEC Corporation. HYDRAstor Grid Storage System, 2008. http://www.hydrastor.com.Google Scholar
W. C. Preston. The Rehydration Myth. BackupCentral.com, June 2009. http://www.backupcentral.com/mr-backup-blog-mainmenu-47/13-mr-backup-blog/247-rehydration-myth.html/.Google Scholar
W. C. Preston. Restoring deduped data in deduplication systems. SearchDataBackup.com, April 2010. http://searchdatabackup.techtarget.com/feature/Restoring-deduped-data-in-deduplication-systems.Google Scholar
W. C. Preston. Solving common data deduplication system problems. SearchDataBackup.com, November 2010. http://searchdatabackup.techtarget.com/feature/Solving-common-data-deduplication-system-problems.Google Scholar
W. C. Preston. Target deduplication appliance performance comparison. BackupCentral.com, October 2010. http://www.backupcentral.com/mr-backup-blog-mainmenu-47/13-mr-backup-blog/348-target-deduplication-appliance-performance-comparison.html.Google Scholar
Quantum Corporation: DXi Deduplication Solution, 2011. http://www.quantum.com.Google Scholar
S. Quinlan and S. Dorward. Venti: A new approach to archival storage. In FAST'02: Proceedings of the Conference on File and Storage Technologies, pages 89--101, Berkeley, CA, USA, 2002. USENIX Association. Google ScholarDigital Library
M. Rabin. Fingerprinting by random polynomials. Technical report, Center for Research in Computing Technology, Harvard University, New York, NY, USA, 1981.Google Scholar
B. Romanski, L. Heldt, W. Kilian, K. Lichota, and C. Dubnicki. Anchor-driven subchunk deduplication. In Proceedings of the 4th Annual International Conference on Systems and Storage, SYSTOR '11, pages 16:1--16:13, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
SEPATON Scalable Data Deduplication Solutions. http://sepaton.com/solutions/data-deduplication.Google Scholar
L. Whitehouse. Restoring deduped data. searchdatabackup.techtarget.com, August 2008. http://searchdatabackup.techtarget.com/tip/Restoring-deduped-data.Google Scholar
B. Zhu, K. Li, and H. Patterson. Avoiding the disk bottleneck in the Data Domain deduplication file system. In FAST'08: Proceedings of the 6th USENIX Conference on File and Storage Technologies, pages 1--14, Berkeley, CA, USA, 2008. USENIX Association. Google ScholarDigital Library

Index Terms

Reducing impact of data fragmentation caused by in-line deduplication
1. Information systems
  1. Information retrieval
    1. Document representation
    2. Search engine architectures and scalability
      1. Search engine indexing
  2. Information storage systems
    1. Storage replication
      1. Storage recovery strategies
2. Software and its engineering
  1. Software creation and management
    1. Software post-development issues
      1. Backup procedures

Recommendations

Reducing fragmentation impact with forward knowledge in backup systems with deduplication
SYSTOR '15: Proceedings of the 8th ACM International Systems and Storage Conference

Deduplication of backups is very effective in saving storage, but may also cause significant restore slowdown. This problem is caused by data fragmentation, where logically continuous but duplicate data is not placed sequentially on the disk. Two types ...
Read More
A study of practical deduplication

We collected file system content data from 857 desktop computers at Microsoft over a span of 4 weeks. We analyzed the data to determine the relative efficacy of data deduplication, particularly considering whole-file versus block-level elimination of ...
Read More
Anchor-driven subchunk deduplication
SYSTOR '11: Proceedings of the 4th Annual International Conference on Systems and Storage

Data deduplication, implemented usually with content defined chunking (CDC), is today one of key features of advanced storage systems providing space for backup applications. Although simple and effective, CDC generates chunks with sizes clustered ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SYSTOR '12: Proceedings of the 5th Annual International Systems and Storage Conference
June 2012
183 pages
ISBN:9781450314480
DOI:10.1145/2367589
General Chair:
Michael Vinov
IBM Haifa
,
Program Chairs:
Dan Tsafrir
Technion
,
Erez Zadok
Stony Brook University
Copyright © 2012 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 4 June 2012
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
CAS
backup
chunking
deduplication
fragmentation
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate94of285submissions,33%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 72
  Total Citations
  View Citations
- 625
  Total Downloads
- Downloads (Last 12 months)44
- Downloads (Last 6 weeks)8
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Reducing impact of data fragmentation caused by in-line deduplication

SYSTOR '12: Proceedings of the 5th Annual International Systems and Storage Conference

ABSTRACT

References

Cited By

Index Terms

Recommendations

Reducing fragmentation impact with forward knowledge in backup systems with deduplication

A study of practical deduplication

Anchor-driven subchunk deduplication