research-article

Reducing fragmentation impact with forward knowledge in backup systems with deduplication

Authors:
Michal Kaczmarczyk

9LivesData, LLC

9LivesData, LLC
View Profile

,
Cezary Dubnicki

9LivesData, LLC

9LivesData, LLC
View Profile

SYSTOR '15: Proceedings of the 8th ACM International Systems and Storage ConferenceMay 2015Article No.: 17Pages 1–12https://doi.org/10.1145/2757667.2757678

Published:26 May 2015Publication History

SYSTOR '15: Proceedings of the 8th ACM International Systems and Storage Conference

Pages 1–12

ABSTRACT

Deduplication of backups is very effective in saving storage, but may also cause significant restore slowdown. This problem is caused by data fragmentation, where logically continuous but duplicate data is not placed sequentially on the disk. Two types of fragmentation introduce high restore penalty: inter-version fragmentation, caused by duplicates present in multiple versions of the same backup, and internal fragmentation, caused by duplicates present in a single backup stream.

This paper introduces Limited Forward Knowledge cache (LFK) reducing the internal fragmentation problem. The cache performs blocks eviction based on available limited forward knowledge. As keeping the full knowledge requires memory proportional to the size of a backup, we limit the forward knowledge to an 8GB window and show that such limitation does not impact the performance significantly. In order to further increase the LFK effectiveness in presence of inter-version fragmentation we combined this algorithm with already known solution called Context-Based Rewriting --- CBR (Kaczmarczyk et al. 2012).

Our evaluation with real-world traces shows that data fragmentation results in an average 42% slowdown for backups stored on a single disk. LFK alone reduces this drop to 21%. CBR+LFK eliminates it completely so the restore speed is equal to reading non-duplicated data. In amulti-disk setup the standard approach suffers from 83% restore performance drop. The combined algorithms reduce this drop to 35%, assuring a 4 times better restore bandwidth.

References

R. Amatruda. Worldwide Purpose-Built Backup Appliance 2012--2016 Forecast and 2011 Vendor Shares. International Data Corporation, April 2012. URL http://www.emc.com/collateral/analyst-reports/idc-worldwide-purpose-built-backup-appliance.pdf.Google Scholar
L. Aronovich, R. Asher, E. Bachmat, H. Bitner, M. Hirsch, and S. T. Klein. The design of a similarity based deduplication system. In Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference, SYSTOR '09, pages 6:1--6:14, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-623-6. URL http://doi.acm.org/10.1145/1534530.1534539. Google ScholarDigital Library
T. Asaro and H. Biggar. Data De-duplication and Disk-to-Disk Backup Systems: Technical and Business Considerations. Enterprise Strategy Group, July 2007.Google Scholar
B. Babineau and D. A. Chapa. Deduplication's Business Imperatives. Enterprise Strategy Group, December 2010. Sponsored by EMC Corporation.Google Scholar
L. A. Belady. A study of replacement algorithms for a virtual-storage computer. IBM Syst. J., 5(2): 78--101, June 1966. ISSN 0018-8670. URL http://dx.doi.org/10.1147/sj.52.0078. Google ScholarDigital Library
C. Dubnicki, L. Gryz, L. Heldt, M. Kaczmarczyk, W. Kilian, P. Strzelczak, J. Szczepkowski, C. Ungureanu, and M. Welnicki. Hydrastor: A scalable secondary storage. In Proccedings of the 7th Conference on File and Storage Technologies, FAST '09, pages 197--210, Berkeley, CA, USA, 2009. USENIX Association. URL http://dl.acm.org/citation.cfm?id=1525908.1525923. Google ScholarDigital Library
EMC. DataDomain - Deduplication Storage for Backup, Archiving and Disaster Recovery. URL http://www.datadomain.com.Google Scholar
ExaGrid. Exagrid. URL http://www.exagrid.com.Google Scholar
M. Fu, D. Feng, Y. Hua, X. He, Z. Chen, W. Xia, F. Huang, and Q. Liu. Accelerating restore and garbage collection in deduplication-based backup systems via exploiting historical information. In Proceedings of the 2014 USENIX Conference on USENIX Annual Technical Conference, USENIX ATC'14, pages 181--192, Berkeley, CA, USA, 2014. USENIX Association. ISBN 978-1-931971-10-2. URL http://dl.acm.org/citation.cfm?id=2643634.2643653. Google ScholarDigital Library
M. Fu, D. Feng, Y. Hua, X. He, Z. Chen, W. Xia, Y. Zhang, and Y. Tan. Design tradeoffs for data deduplication performance in backup workloads. In 13th USENIX Conference on File and Storage Technologies (FAST 15), pages 331--344, Santa Clara, CA, Feb. 2015. USENIX Association. ISBN 978-1-931971-201. URL https://www.usenix.org/conference/fast15/technical-sessions/presentation/fu. Google ScholarDigital Library
G. L. Heileman and W. Luo. How caching affects hashing. In C. Demetrescu, R. Sedgewick, and R. Tamassia, editors, ALENEX/ANALCO, pages 141--154. SIAM, 2005. ISBN 0-89871-596-2. URL http://www.siam.org/meetings/alenex05/papers/13gheileman.pdf.Google Scholar
HP. HP StoreOnce Backup. URL http://www8.hp.com/us/en/products/data-storage/storage-backup-archive.html.Google Scholar
IBM. IBM ProtecTIER Deduplication Solution. URL http://www-03.ibm.com/systems/storage/tape/ts7650g/.Google Scholar
M. Kaczmarczyk. Fragmentation in storage systems with duplicate elimination. PhD thesis, University of Warsaw, Poland, 2015. To be published in June 2015.Google Scholar
M. Kaczmarczyk, M. Barczynski, W. Kilian, and C. Dubnicki. Reducing impact of data fragmentation caused by in-line deduplication. In Proceedings of the 5th Annual International Systems and Storage Conference, SYSTOR '12, pages 15:1--15:12, New York, NY, USA, 2012. ACM. ISBN 978-1-4503-1448-0. URL http://doi.acm.org/10.1145/2367589.2367600. Google ScholarDigital Library
M. Lillibridge, K. Eshghi, and D. Bhagwat. Improving restore speed for backup systems that use inline chunk-based deduplication. In Proceedings of the 11th USENIX Conference on File and Storage Technologies, FAST'13, pages 183--198, Berkeley, CA, USA, 2013. USENIX Association. URL http://dl.acm.org/citation.cfm?id=2591272.2591292. Google ScholarDigital Library
J. Livens. Deduplication and restore performance. Wikibon.org, January 2009a. URL http://wikibon.org/wiki/v/Deduplication_and_restore_performance.Google Scholar
J. Livens. Defragmentation, rehydration and deduplication. AboutRestore.com, June 2009b. URL http://www.aboutrestore.com/2009/06/24/defragmentation-rehydration-and-deduplication/.Google Scholar
A. Muthitacharoen, B. Chen, and D. Mazires. A low-bandwidth network file system. In In Proceedings of the 18th ACM Symposium on Operating Systems Principles (SOSP '01, pages 174--187, New York, NY, USA, 2001. ACM. ISBN 1-58113-389-8. URL http://pdos.csail.mit.edu/papers/lbfs:sosp01/lbfs.pdf. Google ScholarDigital Library
Y. Nam, G. Lu, N. Park, W. Xiao, and D. H. C. Du. Chunk fragmentation level: An effective indicator for read performance degradation in deduplication storage. In Proceedings of the 2011 IEEE International Conference on High Performance Computing and Communications, HPCC '11, pages 581--586, Washington, DC, USA, 2011. IEEE Computer Society. ISBN 978-0-7695-4538-7. URL http://dx.doi.org/10.1109/HPCC.2011.82. Google ScholarDigital Library
Y. J. Nam, D. Park, and D. H. C. Du. Assuring demanded read performance of data deduplication storage with backup datasets. In Proceedings of the 2012 IEEE 20th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, MASCOTS '12, pages 201--208, Washington, DC, USA, 2012. IEEE Computer Society. ISBN 978-0-7695-4793-0. URL http://dx.doi.org/10.1109/MASCOTS.2012.32. Google ScholarDigital Library
NEC. HYDRAstor Grid Storage System. URL http://www.hydrastor.com.Google Scholar
NEC HS8-4000. NEC HYDRAstor HS8-4000 Specification, 2013. URL http://www.necam.com/HYDRAstor/doc.cfm?t=HS8-4000.Google Scholar
W. C. Preston. Target deduplication appliance performance comparison. BackupCentral.com, October 2010a. URL http://www.backupcentral.com/mr-backup-blog-mainmenu-47/13-mr-backup-blog/348-target-deduplication-appliance-performance-comparison.html.Google Scholar
W. C. Preston. Restoring deduped data in deduplication systems. SearchDataBackup.com, April 2010b. URL http://searchdatabackup.techtarget.com/feature/Restoring-deduped-data-in-deduplication-systems.Google Scholar
Quantum. DXi Deduplication Solution. URL http://www.quantum.com/products/disk-basedbackup/index.aspxm.Google Scholar
S. Quinlan and S. Dorward. Venti: A new approach to archival storage. In Proceedings of the 1st USENIX Conference on File and Storage Technologies, FAST'02, pages 7--7, Berkeley, CA, USA, 2002. USENIX Association. URL http://dl.acm.org/citation.cfm?id=1973333.1973340. Google ScholarDigital Library
M. Rabin. Fingerprinting by random polynomials. Technical report, Center for Research in Computing Technology, Harvard University, New York, NY, USA, 1981. URL http://www.xmailserver.org/rabin.pdf.Google Scholar
B. Romanski, L. Heldt, W. Kilian, K. Lichota, and C. Dubnicki. Anchor-driven subchunk deduplication. In Proceedings of the 4th Annual International Conference on Systems and Storage, SYSTOR '11, pages 16:1--16:13, New York, NY, USA, 2011. ACM. ISBN 978-1-4503-0773-4. URL http://doi.acm.org/10.1145/1987816.1987837. Google ScholarDigital Library
Seagate. Common enterprise disk specification (based on Seagate Constellation ES.3 4TB, model 2012). URL http://www.seagate.com/www-content/product-content/constellation-fam/constellation-es/constellation-es-3/en-us/docs/constellation-es-3-data-sheet-ds1769-1-1210us.pdf.Google Scholar
Symantec. NetBackup Appliances. URL http://www.symantec.com/backup-appliance.Google Scholar
G. Wallace, F. Douglis, H. Qian, P. Shilane, S. Smaldone, M. Chamness, and W. Hsu. Characteristics of backup workloads in production systems. In Proceedings of the 10th USENIX conference on File and Storage Technologies, FAST'12, pages 4--4, Berkeley, CA, USA, 2012. USENIX Association. URL http://dl.acm.org/citation.cfm?id=2208461.2208465. Google ScholarDigital Library
H. Weatherspoon and J. Kubiatowicz. Erasure coding vs. replication: A quantitative comparison. In IPTPS '01: Revised Papers from the First International Workshop on Peer-to-Peer Systems, pages 328--338, London, UK, 2002. ISBN 3-540-44179-4. URL http://www.cs.rice.edu/Conferences/IPTPS02/170.pdf. Google ScholarDigital Library
L. Whitehouse. Restoring deduped data. searchdatabackup.techtarget.com, August 2008. URL http://searchdatabackup.techtarget.com/tip/Restoring-deduped-data.Google Scholar
L. Whitehouse, B. Lundell, J. McKnight, and J. Gahm. 2010 Data Protection Trends. Enterprise Strategy Group, April 2010.Google Scholar
B. Zhu, K. Li, and H. Patterson. Avoiding the disk bottleneck in the data domain deduplication file system. In Proceedings of the 6th USENIX Conference on File and Storage Technologies, FAST'08, pages 18:1--18:14, Berkeley, CA, USA, 2008. USENIX Association. URL http://dl.acm.org/citation.cfm?id=1364813.1364831. Google ScholarDigital Library

Index Terms

Reducing fragmentation impact with forward knowledge in backup systems with deduplication
1. Information systems
  1. Information storage systems
    1. Storage replication
      1. Storage recovery strategies
2. Software and its engineering
  1. Software creation and management
    1. Software post-development issues
      1. Backup procedures

Recommendations

Boosting the Restoring Performance of Deduplication Data by Classifying Backup Metadata
Restoring data is the main purpose of data backup in storage systems. The fragmentation issue, caused by physically scattering logically continuous data across a variety of disk locations, poses a negative impact on the restoring performance of a ...
Read More
Reducing impact of data fragmentation caused by in-line deduplication
SYSTOR '12: Proceedings of the 5th Annual International Systems and Storage Conference

Deduplication results inevitably in data fragmentation, because logically continuous data is scattered across many disk locations. In this work we focus on fragmentation caused by duplicates from previous backups of the same backup set, since such ...
Read More
Efficient Hybrid Inline and Out-of-Line Deduplication for Backup Storage

Backup storage systems often remove redundancy across backups via inline deduplication, which works by referring duplicate chunks of the latest backup to those of existing backups. However, inline deduplication degrades restore performance of the latest ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

SYSTOR '15: Proceedings of the 8th ACM International Systems and Storage Conference
May 2015
183 pages
ISBN:9781450336079
DOI:10.1145/2757667

Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 26 May 2015
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
backup
deduplication
fragmentation
restore
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate94of285submissions,33%
Upcoming Conference
SYSTOR '24

Sponsor:

sigops

The 17th ACM International Systems and Storage Conference

September 23 - 25, 2024

Tel-Aviv , Israel
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 11
  Total Citations
  View Citations
- 354
  Total Downloads
- Downloads (Last 12 months)6
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Reducing fragmentation impact with forward knowledge in backup systems with deduplication

SYSTOR '15: Proceedings of the 8th ACM International Systems and Storage Conference

ABSTRACT

References

Cited By

Index Terms

Recommendations

Boosting the Restoring Performance of Deduplication Data by Classifying Backup Metadata

Reducing impact of data fragmentation caused by in-line deduplication

Efficient Hybrid Inline and Out-of-Line Deduplication for Backup Storage

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Reducing fragmentation impact with forward knowledge in backup systems with deduplication

SYSTOR '15: Proceedings of the 8th ACM International Systems and Storage Conference

ABSTRACT

References

Cited By

Index Terms

Recommendations

Boosting the Restoring Performance of Deduplication Data by Classifying Backup Metadata

Reducing impact of data fragmentation caused by in-line deduplication

Efficient Hybrid Inline and Out-of-Line Deduplication for Backup Storage

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media