ABSTRACT
Deduplication of backups is very effective in saving storage, but may also cause significant restore slowdown. This problem is caused by data fragmentation, where logically continuous but duplicate data is not placed sequentially on the disk. Two types of fragmentation introduce high restore penalty: inter-version fragmentation, caused by duplicates present in multiple versions of the same backup, and internal fragmentation, caused by duplicates present in a single backup stream.
This paper introduces Limited Forward Knowledge cache (LFK) reducing the internal fragmentation problem. The cache performs blocks eviction based on available limited forward knowledge. As keeping the full knowledge requires memory proportional to the size of a backup, we limit the forward knowledge to an 8GB window and show that such limitation does not impact the performance significantly. In order to further increase the LFK effectiveness in presence of inter-version fragmentation we combined this algorithm with already known solution called Context-Based Rewriting --- CBR (Kaczmarczyk et al. 2012).
Our evaluation with real-world traces shows that data fragmentation results in an average 42% slowdown for backups stored on a single disk. LFK alone reduces this drop to 21%. CBR+LFK eliminates it completely so the restore speed is equal to reading non-duplicated data. In amulti-disk setup the standard approach suffers from 83% restore performance drop. The combined algorithms reduce this drop to 35%, assuring a 4 times better restore bandwidth.
- R. Amatruda. Worldwide Purpose-Built Backup Appliance 2012--2016 Forecast and 2011 Vendor Shares. International Data Corporation, April 2012. URL http://www.emc.com/collateral/analyst-reports/idc-worldwide-purpose-built-backup-appliance.pdf.Google Scholar
- L. Aronovich, R. Asher, E. Bachmat, H. Bitner, M. Hirsch, and S. T. Klein. The design of a similarity based deduplication system. In Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference, SYSTOR '09, pages 6:1--6:14, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-623-6. URL http://doi.acm.org/10.1145/1534530.1534539. Google ScholarDigital Library
- T. Asaro and H. Biggar. Data De-duplication and Disk-to-Disk Backup Systems: Technical and Business Considerations. Enterprise Strategy Group, July 2007.Google Scholar
- B. Babineau and D. A. Chapa. Deduplication's Business Imperatives. Enterprise Strategy Group, December 2010. Sponsored by EMC Corporation.Google Scholar
- L. A. Belady. A study of replacement algorithms for a virtual-storage computer. IBM Syst. J., 5(2): 78--101, June 1966. ISSN 0018-8670. URL http://dx.doi.org/10.1147/sj.52.0078. Google ScholarDigital Library
- C. Dubnicki, L. Gryz, L. Heldt, M. Kaczmarczyk, W. Kilian, P. Strzelczak, J. Szczepkowski, C. Ungureanu, and M. Welnicki. Hydrastor: A scalable secondary storage. In Proccedings of the 7th Conference on File and Storage Technologies, FAST '09, pages 197--210, Berkeley, CA, USA, 2009. USENIX Association. URL http://dl.acm.org/citation.cfm?id=1525908.1525923. Google ScholarDigital Library
- EMC. DataDomain - Deduplication Storage for Backup, Archiving and Disaster Recovery. URL http://www.datadomain.com.Google Scholar
- ExaGrid. Exagrid. URL http://www.exagrid.com.Google Scholar
- M. Fu, D. Feng, Y. Hua, X. He, Z. Chen, W. Xia, F. Huang, and Q. Liu. Accelerating restore and garbage collection in deduplication-based backup systems via exploiting historical information. In Proceedings of the 2014 USENIX Conference on USENIX Annual Technical Conference, USENIX ATC'14, pages 181--192, Berkeley, CA, USA, 2014. USENIX Association. ISBN 978-1-931971-10-2. URL http://dl.acm.org/citation.cfm?id=2643634.2643653. Google ScholarDigital Library
- M. Fu, D. Feng, Y. Hua, X. He, Z. Chen, W. Xia, Y. Zhang, and Y. Tan. Design tradeoffs for data deduplication performance in backup workloads. In 13th USENIX Conference on File and Storage Technologies (FAST 15), pages 331--344, Santa Clara, CA, Feb. 2015. USENIX Association. ISBN 978-1-931971-201. URL https://www.usenix.org/conference/fast15/technical-sessions/presentation/fu. Google ScholarDigital Library
- G. L. Heileman and W. Luo. How caching affects hashing. In C. Demetrescu, R. Sedgewick, and R. Tamassia, editors, ALENEX/ANALCO, pages 141--154. SIAM, 2005. ISBN 0-89871-596-2. URL http://www.siam.org/meetings/alenex05/papers/13gheileman.pdf.Google Scholar
- HP. HP StoreOnce Backup. URL http://www8.hp.com/us/en/products/data-storage/storage-backup-archive.html.Google Scholar
- IBM. IBM ProtecTIER Deduplication Solution. URL http://www-03.ibm.com/systems/storage/tape/ts7650g/.Google Scholar
- M. Kaczmarczyk. Fragmentation in storage systems with duplicate elimination. PhD thesis, University of Warsaw, Poland, 2015. To be published in June 2015.Google Scholar
- M. Kaczmarczyk, M. Barczynski, W. Kilian, and C. Dubnicki. Reducing impact of data fragmentation caused by in-line deduplication. In Proceedings of the 5th Annual International Systems and Storage Conference, SYSTOR '12, pages 15:1--15:12, New York, NY, USA, 2012. ACM. ISBN 978-1-4503-1448-0. URL http://doi.acm.org/10.1145/2367589.2367600. Google ScholarDigital Library
- M. Lillibridge, K. Eshghi, and D. Bhagwat. Improving restore speed for backup systems that use inline chunk-based deduplication. In Proceedings of the 11th USENIX Conference on File and Storage Technologies, FAST'13, pages 183--198, Berkeley, CA, USA, 2013. USENIX Association. URL http://dl.acm.org/citation.cfm?id=2591272.2591292. Google ScholarDigital Library
- J. Livens. Deduplication and restore performance. Wikibon.org, January 2009a. URL http://wikibon.org/wiki/v/Deduplication_and_restore_performance.Google Scholar
- J. Livens. Defragmentation, rehydration and deduplication. AboutRestore.com, June 2009b. URL http://www.aboutrestore.com/2009/06/24/defragmentation-rehydration-and-deduplication/.Google Scholar
- A. Muthitacharoen, B. Chen, and D. Mazires. A low-bandwidth network file system. In In Proceedings of the 18th ACM Symposium on Operating Systems Principles (SOSP '01, pages 174--187, New York, NY, USA, 2001. ACM. ISBN 1-58113-389-8. URL http://pdos.csail.mit.edu/papers/lbfs:sosp01/lbfs.pdf. Google ScholarDigital Library
- Y. Nam, G. Lu, N. Park, W. Xiao, and D. H. C. Du. Chunk fragmentation level: An effective indicator for read performance degradation in deduplication storage. In Proceedings of the 2011 IEEE International Conference on High Performance Computing and Communications, HPCC '11, pages 581--586, Washington, DC, USA, 2011. IEEE Computer Society. ISBN 978-0-7695-4538-7. URL http://dx.doi.org/10.1109/HPCC.2011.82. Google ScholarDigital Library
- Y. J. Nam, D. Park, and D. H. C. Du. Assuring demanded read performance of data deduplication storage with backup datasets. In Proceedings of the 2012 IEEE 20th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, MASCOTS '12, pages 201--208, Washington, DC, USA, 2012. IEEE Computer Society. ISBN 978-0-7695-4793-0. URL http://dx.doi.org/10.1109/MASCOTS.2012.32. Google ScholarDigital Library
- NEC. HYDRAstor Grid Storage System. URL http://www.hydrastor.com.Google Scholar
- NEC HS8-4000. NEC HYDRAstor HS8-4000 Specification, 2013. URL http://www.necam.com/HYDRAstor/doc.cfm?t=HS8-4000.Google Scholar
- W. C. Preston. Target deduplication appliance performance comparison. BackupCentral.com, October 2010a. URL http://www.backupcentral.com/mr-backup-blog-mainmenu-47/13-mr-backup-blog/348-target-deduplication-appliance-performance-comparison.html.Google Scholar
- W. C. Preston. Restoring deduped data in deduplication systems. SearchDataBackup.com, April 2010b. URL http://searchdatabackup.techtarget.com/feature/Restoring-deduped-data-in-deduplication-systems.Google Scholar
- Quantum. DXi Deduplication Solution. URL http://www.quantum.com/products/disk-basedbackup/index.aspxm.Google Scholar
- S. Quinlan and S. Dorward. Venti: A new approach to archival storage. In Proceedings of the 1st USENIX Conference on File and Storage Technologies, FAST'02, pages 7--7, Berkeley, CA, USA, 2002. USENIX Association. URL http://dl.acm.org/citation.cfm?id=1973333.1973340. Google ScholarDigital Library
- M. Rabin. Fingerprinting by random polynomials. Technical report, Center for Research in Computing Technology, Harvard University, New York, NY, USA, 1981. URL http://www.xmailserver.org/rabin.pdf.Google Scholar
- B. Romanski, L. Heldt, W. Kilian, K. Lichota, and C. Dubnicki. Anchor-driven subchunk deduplication. In Proceedings of the 4th Annual International Conference on Systems and Storage, SYSTOR '11, pages 16:1--16:13, New York, NY, USA, 2011. ACM. ISBN 978-1-4503-0773-4. URL http://doi.acm.org/10.1145/1987816.1987837. Google ScholarDigital Library
- Seagate. Common enterprise disk specification (based on Seagate Constellation ES.3 4TB, model 2012). URL http://www.seagate.com/www-content/product-content/constellation-fam/constellation-es/constellation-es-3/en-us/docs/constellation-es-3-data-sheet-ds1769-1-1210us.pdf.Google Scholar
- Symantec. NetBackup Appliances. URL http://www.symantec.com/backup-appliance.Google Scholar
- G. Wallace, F. Douglis, H. Qian, P. Shilane, S. Smaldone, M. Chamness, and W. Hsu. Characteristics of backup workloads in production systems. In Proceedings of the 10th USENIX conference on File and Storage Technologies, FAST'12, pages 4--4, Berkeley, CA, USA, 2012. USENIX Association. URL http://dl.acm.org/citation.cfm?id=2208461.2208465. Google ScholarDigital Library
- H. Weatherspoon and J. Kubiatowicz. Erasure coding vs. replication: A quantitative comparison. In IPTPS '01: Revised Papers from the First International Workshop on Peer-to-Peer Systems, pages 328--338, London, UK, 2002. ISBN 3-540-44179-4. URL http://www.cs.rice.edu/Conferences/IPTPS02/170.pdf. Google ScholarDigital Library
- L. Whitehouse. Restoring deduped data. searchdatabackup.techtarget.com, August 2008. URL http://searchdatabackup.techtarget.com/tip/Restoring-deduped-data.Google Scholar
- L. Whitehouse, B. Lundell, J. McKnight, and J. Gahm. 2010 Data Protection Trends. Enterprise Strategy Group, April 2010.Google Scholar
- B. Zhu, K. Li, and H. Patterson. Avoiding the disk bottleneck in the data domain deduplication file system. In Proceedings of the 6th USENIX Conference on File and Storage Technologies, FAST'08, pages 18:1--18:14, Berkeley, CA, USA, 2008. USENIX Association. URL http://dl.acm.org/citation.cfm?id=1364813.1364831. Google ScholarDigital Library
Index Terms
- Reducing fragmentation impact with forward knowledge in backup systems with deduplication
Recommendations
Boosting the Restoring Performance of Deduplication Data by Classifying Backup Metadata
Restoring data is the main purpose of data backup in storage systems. The fragmentation issue, caused by physically scattering logically continuous data across a variety of disk locations, poses a negative impact on the restoring performance of a ...
Reducing impact of data fragmentation caused by in-line deduplication
SYSTOR '12: Proceedings of the 5th Annual International Systems and Storage ConferenceDeduplication results inevitably in data fragmentation, because logically continuous data is scattered across many disk locations. In this work we focus on fragmentation caused by duplicates from previous backups of the same backup set, since such ...
Efficient Hybrid Inline and Out-of-Line Deduplication for Backup Storage
Backup storage systems often remove redundancy across backups via inline deduplication, which works by referring duplicate chunks of the latest backup to those of existing backups. However, inline deduplication degrades restore performance of the latest ...
Comments