skip to main content
10.1145/2757667.2757678acmconferencesArticle/Chapter ViewAbstractPublication PagessystorConference Proceedingsconference-collections
research-article

Reducing fragmentation impact with forward knowledge in backup systems with deduplication

Published:26 May 2015Publication History

ABSTRACT

Deduplication of backups is very effective in saving storage, but may also cause significant restore slowdown. This problem is caused by data fragmentation, where logically continuous but duplicate data is not placed sequentially on the disk. Two types of fragmentation introduce high restore penalty: inter-version fragmentation, caused by duplicates present in multiple versions of the same backup, and internal fragmentation, caused by duplicates present in a single backup stream.

This paper introduces Limited Forward Knowledge cache (LFK) reducing the internal fragmentation problem. The cache performs blocks eviction based on available limited forward knowledge. As keeping the full knowledge requires memory proportional to the size of a backup, we limit the forward knowledge to an 8GB window and show that such limitation does not impact the performance significantly. In order to further increase the LFK effectiveness in presence of inter-version fragmentation we combined this algorithm with already known solution called Context-Based Rewriting --- CBR (Kaczmarczyk et al. 2012).

Our evaluation with real-world traces shows that data fragmentation results in an average 42% slowdown for backups stored on a single disk. LFK alone reduces this drop to 21%. CBR+LFK eliminates it completely so the restore speed is equal to reading non-duplicated data. In amulti-disk setup the standard approach suffers from 83% restore performance drop. The combined algorithms reduce this drop to 35%, assuring a 4 times better restore bandwidth.

References

  1. R. Amatruda. Worldwide Purpose-Built Backup Appliance 2012--2016 Forecast and 2011 Vendor Shares. International Data Corporation, April 2012. URL http://www.emc.com/collateral/analyst-reports/idc-worldwide-purpose-built-backup-appliance.pdf.Google ScholarGoogle Scholar
  2. L. Aronovich, R. Asher, E. Bachmat, H. Bitner, M. Hirsch, and S. T. Klein. The design of a similarity based deduplication system. In Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference, SYSTOR '09, pages 6:1--6:14, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-623-6. URL http://doi.acm.org/10.1145/1534530.1534539. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. T. Asaro and H. Biggar. Data De-duplication and Disk-to-Disk Backup Systems: Technical and Business Considerations. Enterprise Strategy Group, July 2007.Google ScholarGoogle Scholar
  4. B. Babineau and D. A. Chapa. Deduplication's Business Imperatives. Enterprise Strategy Group, December 2010. Sponsored by EMC Corporation.Google ScholarGoogle Scholar
  5. L. A. Belady. A study of replacement algorithms for a virtual-storage computer. IBM Syst. J., 5(2): 78--101, June 1966. ISSN 0018-8670. URL http://dx.doi.org/10.1147/sj.52.0078. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. C. Dubnicki, L. Gryz, L. Heldt, M. Kaczmarczyk, W. Kilian, P. Strzelczak, J. Szczepkowski, C. Ungureanu, and M. Welnicki. Hydrastor: A scalable secondary storage. In Proccedings of the 7th Conference on File and Storage Technologies, FAST '09, pages 197--210, Berkeley, CA, USA, 2009. USENIX Association. URL http://dl.acm.org/citation.cfm?id=1525908.1525923. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. EMC. DataDomain - Deduplication Storage for Backup, Archiving and Disaster Recovery. URL http://www.datadomain.com.Google ScholarGoogle Scholar
  8. ExaGrid. Exagrid. URL http://www.exagrid.com.Google ScholarGoogle Scholar
  9. M. Fu, D. Feng, Y. Hua, X. He, Z. Chen, W. Xia, F. Huang, and Q. Liu. Accelerating restore and garbage collection in deduplication-based backup systems via exploiting historical information. In Proceedings of the 2014 USENIX Conference on USENIX Annual Technical Conference, USENIX ATC'14, pages 181--192, Berkeley, CA, USA, 2014. USENIX Association. ISBN 978-1-931971-10-2. URL http://dl.acm.org/citation.cfm?id=2643634.2643653. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. M. Fu, D. Feng, Y. Hua, X. He, Z. Chen, W. Xia, Y. Zhang, and Y. Tan. Design tradeoffs for data deduplication performance in backup workloads. In 13th USENIX Conference on File and Storage Technologies (FAST 15), pages 331--344, Santa Clara, CA, Feb. 2015. USENIX Association. ISBN 978-1-931971-201. URL https://www.usenix.org/conference/fast15/technical-sessions/presentation/fu. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. G. L. Heileman and W. Luo. How caching affects hashing. In C. Demetrescu, R. Sedgewick, and R. Tamassia, editors, ALENEX/ANALCO, pages 141--154. SIAM, 2005. ISBN 0-89871-596-2. URL http://www.siam.org/meetings/alenex05/papers/13gheileman.pdf.Google ScholarGoogle Scholar
  12. HP. HP StoreOnce Backup. URL http://www8.hp.com/us/en/products/data-storage/storage-backup-archive.html.Google ScholarGoogle Scholar
  13. IBM. IBM ProtecTIER Deduplication Solution. URL http://www-03.ibm.com/systems/storage/tape/ts7650g/.Google ScholarGoogle Scholar
  14. M. Kaczmarczyk. Fragmentation in storage systems with duplicate elimination. PhD thesis, University of Warsaw, Poland, 2015. To be published in June 2015.Google ScholarGoogle Scholar
  15. M. Kaczmarczyk, M. Barczynski, W. Kilian, and C. Dubnicki. Reducing impact of data fragmentation caused by in-line deduplication. In Proceedings of the 5th Annual International Systems and Storage Conference, SYSTOR '12, pages 15:1--15:12, New York, NY, USA, 2012. ACM. ISBN 978-1-4503-1448-0. URL http://doi.acm.org/10.1145/2367589.2367600. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. M. Lillibridge, K. Eshghi, and D. Bhagwat. Improving restore speed for backup systems that use inline chunk-based deduplication. In Proceedings of the 11th USENIX Conference on File and Storage Technologies, FAST'13, pages 183--198, Berkeley, CA, USA, 2013. USENIX Association. URL http://dl.acm.org/citation.cfm?id=2591272.2591292. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. J. Livens. Deduplication and restore performance. Wikibon.org, January 2009a. URL http://wikibon.org/wiki/v/Deduplication_and_restore_performance.Google ScholarGoogle Scholar
  18. J. Livens. Defragmentation, rehydration and deduplication. AboutRestore.com, June 2009b. URL http://www.aboutrestore.com/2009/06/24/defragmentation-rehydration-and-deduplication/.Google ScholarGoogle Scholar
  19. A. Muthitacharoen, B. Chen, and D. Mazires. A low-bandwidth network file system. In In Proceedings of the 18th ACM Symposium on Operating Systems Principles (SOSP '01, pages 174--187, New York, NY, USA, 2001. ACM. ISBN 1-58113-389-8. URL http://pdos.csail.mit.edu/papers/lbfs:sosp01/lbfs.pdf. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Y. Nam, G. Lu, N. Park, W. Xiao, and D. H. C. Du. Chunk fragmentation level: An effective indicator for read performance degradation in deduplication storage. In Proceedings of the 2011 IEEE International Conference on High Performance Computing and Communications, HPCC '11, pages 581--586, Washington, DC, USA, 2011. IEEE Computer Society. ISBN 978-0-7695-4538-7. URL http://dx.doi.org/10.1109/HPCC.2011.82. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Y. J. Nam, D. Park, and D. H. C. Du. Assuring demanded read performance of data deduplication storage with backup datasets. In Proceedings of the 2012 IEEE 20th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, MASCOTS '12, pages 201--208, Washington, DC, USA, 2012. IEEE Computer Society. ISBN 978-0-7695-4793-0. URL http://dx.doi.org/10.1109/MASCOTS.2012.32. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. NEC. HYDRAstor Grid Storage System. URL http://www.hydrastor.com.Google ScholarGoogle Scholar
  23. NEC HS8-4000. NEC HYDRAstor HS8-4000 Specification, 2013. URL http://www.necam.com/HYDRAstor/doc.cfm?t=HS8-4000.Google ScholarGoogle Scholar
  24. W. C. Preston. Target deduplication appliance performance comparison. BackupCentral.com, October 2010a. URL http://www.backupcentral.com/mr-backup-blog-mainmenu-47/13-mr-backup-blog/348-target-deduplication-appliance-performance-comparison.html.Google ScholarGoogle Scholar
  25. W. C. Preston. Restoring deduped data in deduplication systems. SearchDataBackup.com, April 2010b. URL http://searchdatabackup.techtarget.com/feature/Restoring-deduped-data-in-deduplication-systems.Google ScholarGoogle Scholar
  26. Quantum. DXi Deduplication Solution. URL http://www.quantum.com/products/disk-basedbackup/index.aspxm.Google ScholarGoogle Scholar
  27. S. Quinlan and S. Dorward. Venti: A new approach to archival storage. In Proceedings of the 1st USENIX Conference on File and Storage Technologies, FAST'02, pages 7--7, Berkeley, CA, USA, 2002. USENIX Association. URL http://dl.acm.org/citation.cfm?id=1973333.1973340. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. M. Rabin. Fingerprinting by random polynomials. Technical report, Center for Research in Computing Technology, Harvard University, New York, NY, USA, 1981. URL http://www.xmailserver.org/rabin.pdf.Google ScholarGoogle Scholar
  29. B. Romanski, L. Heldt, W. Kilian, K. Lichota, and C. Dubnicki. Anchor-driven subchunk deduplication. In Proceedings of the 4th Annual International Conference on Systems and Storage, SYSTOR '11, pages 16:1--16:13, New York, NY, USA, 2011. ACM. ISBN 978-1-4503-0773-4. URL http://doi.acm.org/10.1145/1987816.1987837. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Seagate. Common enterprise disk specification (based on Seagate Constellation ES.3 4TB, model 2012). URL http://www.seagate.com/www-content/product-content/constellation-fam/constellation-es/constellation-es-3/en-us/docs/constellation-es-3-data-sheet-ds1769-1-1210us.pdf.Google ScholarGoogle Scholar
  31. Symantec. NetBackup Appliances. URL http://www.symantec.com/backup-appliance.Google ScholarGoogle Scholar
  32. G. Wallace, F. Douglis, H. Qian, P. Shilane, S. Smaldone, M. Chamness, and W. Hsu. Characteristics of backup workloads in production systems. In Proceedings of the 10th USENIX conference on File and Storage Technologies, FAST'12, pages 4--4, Berkeley, CA, USA, 2012. USENIX Association. URL http://dl.acm.org/citation.cfm?id=2208461.2208465. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. H. Weatherspoon and J. Kubiatowicz. Erasure coding vs. replication: A quantitative comparison. In IPTPS '01: Revised Papers from the First International Workshop on Peer-to-Peer Systems, pages 328--338, London, UK, 2002. ISBN 3-540-44179-4. URL http://www.cs.rice.edu/Conferences/IPTPS02/170.pdf. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. L. Whitehouse. Restoring deduped data. searchdatabackup.techtarget.com, August 2008. URL http://searchdatabackup.techtarget.com/tip/Restoring-deduped-data.Google ScholarGoogle Scholar
  35. L. Whitehouse, B. Lundell, J. McKnight, and J. Gahm. 2010 Data Protection Trends. Enterprise Strategy Group, April 2010.Google ScholarGoogle Scholar
  36. B. Zhu, K. Li, and H. Patterson. Avoiding the disk bottleneck in the data domain deduplication file system. In Proceedings of the 6th USENIX Conference on File and Storage Technologies, FAST'08, pages 18:1--18:14, Berkeley, CA, USA, 2008. USENIX Association. URL http://dl.acm.org/citation.cfm?id=1364813.1364831. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Reducing fragmentation impact with forward knowledge in backup systems with deduplication

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        SYSTOR '15: Proceedings of the 8th ACM International Systems and Storage Conference
        May 2015
        183 pages
        ISBN:9781450336079
        DOI:10.1145/2757667

        Copyright © 2015 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 26 May 2015

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate94of285submissions,33%

        Upcoming Conference

        SYSTOR '24
        The 17th ACM International Systems and Storage Conference
        September 23 - 25, 2024
        Tel-Aviv , Israel

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader