skip to main content
10.1145/2367589.2367600acmotherconferencesArticle/Chapter ViewAbstractPublication PagessystorConference Proceedingsconference-collections
research-article

Reducing impact of data fragmentation caused by in-line deduplication

Published:04 June 2012Publication History

ABSTRACT

Deduplication results inevitably in data fragmentation, because logically continuous data is scattered across many disk locations. In this work we focus on fragmentation caused by duplicates from previous backups of the same backup set, since such duplicates are very common due to repeated full backups containing a lot of unchanged data. For systems with in-line dedup which detects duplicates during writing and avoids storing them, such fragmentation causes data from the latest backup being scattered across older backups. As a result, the time of restore from the latest backup can be significantly increased, sometimes more than doubled.

We propose an algorithm called context-based rewriting (CBR in short) minimizing this drop in restore performance for latest backups by shifting fragmentation to older backups, which are rarely used for restore. By selectively rewriting a small percentage of duplicates during backup, we can reduce the drop in restore bandwidth from 12--55% to only 4--7%, as shown by experiments driven by a set of backup traces. All of this is achieved with only small increase in writing time, between 1% and 5%. Since we rewrite only few duplicates and old copies of rewritten data are removed in the background, the whole process introduces small and temporary space overhead.

References

  1. L. Aronovich, R. Asher, E. Bachmat, H. Bitner, M. Hirsch, and S. T. Klein. The design of a similarity based dedupli-cation system. In Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference, SYSTOR '09, pages 6:1--6:14, New York, NY, USA, 2009. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. T. Asaro and H. Biggar. Data De-duplication and Disk-to-Disk Backup Systems: Technical and Business Considerations, 2007. The Enterprise Strategy Group.Google ScholarGoogle Scholar
  3. D. Bhagwat, K. Eshghi, D. D. E. Long, and M. Lillibridge. Extreme binning: Scalable, parallel deduplication for chunk-based file backup. In Proceedings of the 17th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS 2009), Sep 2009.Google ScholarGoogle ScholarCross RefCross Ref
  4. H. Biggar. Experiencing Data De-Duplication: Improving Efficiency and Reducing Capacity Requirements, 2007. The Enterprise Strategy Group.Google ScholarGoogle Scholar
  5. A. Z. Broder. Some aplications of rabin's fingerprinting method. In Sequences II: Methods in Communications, Security, and Computer Science, pages 143--152. Springer-Verlag, 1993.Google ScholarGoogle ScholarCross RefCross Ref
  6. L. P. Cox, C. D. Murray, and B. D. Noble. Pastiche: making backup cheap and easy. In OSDI '02: Proceedings of the 5th symposium on Operating systems design and implementation, pages 285--298, New York, NY, USA, 2002. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. B. Debnath, S. Sengupta, and J. Li. Chunkstash: Speeding up inline storage deduplication using flash memory. In 2010 USENIX Annual Technical Conference, June 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. W. Dong, F. Douglis, K. Li, H. Patterson, S. Reddy, and P. Shilane. Tradeoffs in scalable data routing for deduplication clusters. In Proceedings of the 9th USENIX conference on File and Storage Technologies, FAST'11, pages 15--29, Berkeley, CA, USA, 2011. USENIX Association. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. C. Dubnicki, L. Gryz, L. Heldt, M. Kaczmarczyk, W. Kilian, P. Strzelczak, J. Szczepkowski, C. Ungureanu, and M. Welnicki. HYDRAstor: a Scalable Secondary Storage. In FAST'09: Proceedings of the 7th USENIX Conference on File and Storage Technologies, pages 197--210, Berkeley, CA, USA, 2009. USENIX Association. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. C. Dubnicki, C. Ungureanu, and W. Kilian. FPN: A distributed hash table for commercial applications. In Proceedings of the 13th IEEE International Symposium on High Performance Distributed Computing, pages 120--128, Washington, DC, USA, 2004. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. D. E. Eastlake and P. E. Jones. US Secure Hash Algorithm 1 (SHA1). RFC 3174 (Informational), September 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. EMC Avamar: Backup and recovery with global deduplication, 2008. http://www.emc.com/avamar.Google ScholarGoogle Scholar
  13. EMC Centera: Content addressed storage system, January 2008. http://www.emc.com/centera.Google ScholarGoogle Scholar
  14. EMC Corporation: Data Domain Global Deduplication Array, 2011. http://www.datadomain.com/products/global-deduplication-array.html.Google ScholarGoogle Scholar
  15. EMC Corporation: DataDomain - Deduplication Storage for Backup, Archiving and Disaster Recovery, 2011. http://www.datadomain.com.Google ScholarGoogle Scholar
  16. Exagrid. http://www.exagrid.com.Google ScholarGoogle Scholar
  17. D. Floyer. Wikibon Data De-duplication Performance Tables. Wikibon.org, May 2011. http://wikibon.org/wiki/v/Wikibon_Data_De-duplication_Performance_Tables.Google ScholarGoogle Scholar
  18. R. Koller and R. Rangaswami. I/O Deduplication: Utilizing content similarity to improve I/O performance. volume 6, pages 13:1--13:26, New York, NY, USA, September 2010. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. E. Kruus, C. Ungureanu, and C. Dubnicki. Bimodal content defined chunking for backup streams. In Proceedings of the 8th USENIX conference on File and Storage Technologies, FAST'10, pages 239--252, Berkeley, CA, USA, 2010. USENIX Association. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. M. Lillibridge, K. Eshghi, D. Bhagwat, V. Deolalikar, G. Trezis, and P. Camble. Sparse indexing: Large scale, inline deduplication using sampling and locality. In FAST'09: Proceedings of the 7th USENIX Conference on File and Storage Technologies, pages 111--123, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. J. Livens. Deduplication and restore performance. Wikibon.org, January 2009. http://wikibon.org/wiki/v/Deduplication_and_restore_performance.Google ScholarGoogle Scholar
  22. J. Livens. Defragmentation, rehydration and deduplication. AboutRestore.com, June 2009. http://www.aboutrestore.com/2009/06/24/defragmentation-rehydration-and-deduplication/.Google ScholarGoogle Scholar
  23. D. Meister and A. Brinkmann. dedupv1: Improving Deduplication Throughput using Solid State Drives (SSD). In Proceedings of the 26th IEEE Symposium on Massive Storage Systems and Technologies (MSST), May 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. A. Muthitacharoen, B. Chen, and D. Mazires. A low-bandwidth network file system. In In Proceedings of the 18th ACM Symposium on Operating Systems Principles (SOSP '01, pages 174--187, New York, NY, USA, 2001. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. P. Nath, B. Urgaonkar, and A. Sivasubramaniam. Evaluating the usefulness of content addressable storage for high-performance data intensive applications. In HPDC '08: Proceedings of the 17th international symposium on High performance distributed computing, pages 35--44, New York, NY, USA, 2008. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. NEC Corporation. HYDRAstor Grid Storage System, 2008. http://www.hydrastor.com.Google ScholarGoogle Scholar
  27. W. C. Preston. The Rehydration Myth. BackupCentral.com, June 2009. http://www.backupcentral.com/mr-backup-blog-mainmenu-47/13-mr-backup-blog/247-rehydration-myth.html/.Google ScholarGoogle Scholar
  28. W. C. Preston. Restoring deduped data in deduplication systems. SearchDataBackup.com, April 2010. http://searchdatabackup.techtarget.com/feature/Restoring-deduped-data-in-deduplication-systems.Google ScholarGoogle Scholar
  29. W. C. Preston. Solving common data deduplication system problems. SearchDataBackup.com, November 2010. http://searchdatabackup.techtarget.com/feature/Solving-common-data-deduplication-system-problems.Google ScholarGoogle Scholar
  30. W. C. Preston. Target deduplication appliance performance comparison. BackupCentral.com, October 2010. http://www.backupcentral.com/mr-backup-blog-mainmenu-47/13-mr-backup-blog/348-target-deduplication-appliance-performance-comparison.html.Google ScholarGoogle Scholar
  31. Quantum Corporation: DXi Deduplication Solution, 2011. http://www.quantum.com.Google ScholarGoogle Scholar
  32. S. Quinlan and S. Dorward. Venti: A new approach to archival storage. In FAST'02: Proceedings of the Conference on File and Storage Technologies, pages 89--101, Berkeley, CA, USA, 2002. USENIX Association. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. M. Rabin. Fingerprinting by random polynomials. Technical report, Center for Research in Computing Technology, Harvard University, New York, NY, USA, 1981.Google ScholarGoogle Scholar
  34. B. Romanski, L. Heldt, W. Kilian, K. Lichota, and C. Dubnicki. Anchor-driven subchunk deduplication. In Proceedings of the 4th Annual International Conference on Systems and Storage, SYSTOR '11, pages 16:1--16:13, New York, NY, USA, 2011. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. SEPATON Scalable Data Deduplication Solutions. http://sepaton.com/solutions/data-deduplication.Google ScholarGoogle Scholar
  36. L. Whitehouse. Restoring deduped data. searchdatabackup.techtarget.com, August 2008. http://searchdatabackup.techtarget.com/tip/Restoring-deduped-data.Google ScholarGoogle Scholar
  37. B. Zhu, K. Li, and H. Patterson. Avoiding the disk bottleneck in the Data Domain deduplication file system. In FAST'08: Proceedings of the 6th USENIX Conference on File and Storage Technologies, pages 1--14, Berkeley, CA, USA, 2008. USENIX Association. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Reducing impact of data fragmentation caused by in-line deduplication

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Other conferences
            SYSTOR '12: Proceedings of the 5th Annual International Systems and Storage Conference
            June 2012
            183 pages
            ISBN:9781450314480
            DOI:10.1145/2367589

            Copyright © 2012 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 4 June 2012

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article

            Acceptance Rates

            Overall Acceptance Rate94of285submissions,33%

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader