ABSTRACT
Deduplication results inevitably in data fragmentation, because logically continuous data is scattered across many disk locations. In this work we focus on fragmentation caused by duplicates from previous backups of the same backup set, since such duplicates are very common due to repeated full backups containing a lot of unchanged data. For systems with in-line dedup which detects duplicates during writing and avoids storing them, such fragmentation causes data from the latest backup being scattered across older backups. As a result, the time of restore from the latest backup can be significantly increased, sometimes more than doubled.
We propose an algorithm called context-based rewriting (CBR in short) minimizing this drop in restore performance for latest backups by shifting fragmentation to older backups, which are rarely used for restore. By selectively rewriting a small percentage of duplicates during backup, we can reduce the drop in restore bandwidth from 12--55% to only 4--7%, as shown by experiments driven by a set of backup traces. All of this is achieved with only small increase in writing time, between 1% and 5%. Since we rewrite only few duplicates and old copies of rewritten data are removed in the background, the whole process introduces small and temporary space overhead.
- L. Aronovich, R. Asher, E. Bachmat, H. Bitner, M. Hirsch, and S. T. Klein. The design of a similarity based dedupli-cation system. In Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference, SYSTOR '09, pages 6:1--6:14, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
- T. Asaro and H. Biggar. Data De-duplication and Disk-to-Disk Backup Systems: Technical and Business Considerations, 2007. The Enterprise Strategy Group.Google Scholar
- D. Bhagwat, K. Eshghi, D. D. E. Long, and M. Lillibridge. Extreme binning: Scalable, parallel deduplication for chunk-based file backup. In Proceedings of the 17th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS 2009), Sep 2009.Google ScholarCross Ref
- H. Biggar. Experiencing Data De-Duplication: Improving Efficiency and Reducing Capacity Requirements, 2007. The Enterprise Strategy Group.Google Scholar
- A. Z. Broder. Some aplications of rabin's fingerprinting method. In Sequences II: Methods in Communications, Security, and Computer Science, pages 143--152. Springer-Verlag, 1993.Google ScholarCross Ref
- L. P. Cox, C. D. Murray, and B. D. Noble. Pastiche: making backup cheap and easy. In OSDI '02: Proceedings of the 5th symposium on Operating systems design and implementation, pages 285--298, New York, NY, USA, 2002. ACM. Google ScholarDigital Library
- B. Debnath, S. Sengupta, and J. Li. Chunkstash: Speeding up inline storage deduplication using flash memory. In 2010 USENIX Annual Technical Conference, June 2010. Google ScholarDigital Library
- W. Dong, F. Douglis, K. Li, H. Patterson, S. Reddy, and P. Shilane. Tradeoffs in scalable data routing for deduplication clusters. In Proceedings of the 9th USENIX conference on File and Storage Technologies, FAST'11, pages 15--29, Berkeley, CA, USA, 2011. USENIX Association. Google ScholarDigital Library
- C. Dubnicki, L. Gryz, L. Heldt, M. Kaczmarczyk, W. Kilian, P. Strzelczak, J. Szczepkowski, C. Ungureanu, and M. Welnicki. HYDRAstor: a Scalable Secondary Storage. In FAST'09: Proceedings of the 7th USENIX Conference on File and Storage Technologies, pages 197--210, Berkeley, CA, USA, 2009. USENIX Association. Google ScholarDigital Library
- C. Dubnicki, C. Ungureanu, and W. Kilian. FPN: A distributed hash table for commercial applications. In Proceedings of the 13th IEEE International Symposium on High Performance Distributed Computing, pages 120--128, Washington, DC, USA, 2004. IEEE Computer Society. Google ScholarDigital Library
- D. E. Eastlake and P. E. Jones. US Secure Hash Algorithm 1 (SHA1). RFC 3174 (Informational), September 2001. Google ScholarDigital Library
- EMC Avamar: Backup and recovery with global deduplication, 2008. http://www.emc.com/avamar.Google Scholar
- EMC Centera: Content addressed storage system, January 2008. http://www.emc.com/centera.Google Scholar
- EMC Corporation: Data Domain Global Deduplication Array, 2011. http://www.datadomain.com/products/global-deduplication-array.html.Google Scholar
- EMC Corporation: DataDomain - Deduplication Storage for Backup, Archiving and Disaster Recovery, 2011. http://www.datadomain.com.Google Scholar
- Exagrid. http://www.exagrid.com.Google Scholar
- D. Floyer. Wikibon Data De-duplication Performance Tables. Wikibon.org, May 2011. http://wikibon.org/wiki/v/Wikibon_Data_De-duplication_Performance_Tables.Google Scholar
- R. Koller and R. Rangaswami. I/O Deduplication: Utilizing content similarity to improve I/O performance. volume 6, pages 13:1--13:26, New York, NY, USA, September 2010. ACM. Google ScholarDigital Library
- E. Kruus, C. Ungureanu, and C. Dubnicki. Bimodal content defined chunking for backup streams. In Proceedings of the 8th USENIX conference on File and Storage Technologies, FAST'10, pages 239--252, Berkeley, CA, USA, 2010. USENIX Association. Google ScholarDigital Library
- M. Lillibridge, K. Eshghi, D. Bhagwat, V. Deolalikar, G. Trezis, and P. Camble. Sparse indexing: Large scale, inline deduplication using sampling and locality. In FAST'09: Proceedings of the 7th USENIX Conference on File and Storage Technologies, pages 111--123, 2009. Google ScholarDigital Library
- J. Livens. Deduplication and restore performance. Wikibon.org, January 2009. http://wikibon.org/wiki/v/Deduplication_and_restore_performance.Google Scholar
- J. Livens. Defragmentation, rehydration and deduplication. AboutRestore.com, June 2009. http://www.aboutrestore.com/2009/06/24/defragmentation-rehydration-and-deduplication/.Google Scholar
- D. Meister and A. Brinkmann. dedupv1: Improving Deduplication Throughput using Solid State Drives (SSD). In Proceedings of the 26th IEEE Symposium on Massive Storage Systems and Technologies (MSST), May 2010. Google ScholarDigital Library
- A. Muthitacharoen, B. Chen, and D. Mazires. A low-bandwidth network file system. In In Proceedings of the 18th ACM Symposium on Operating Systems Principles (SOSP '01, pages 174--187, New York, NY, USA, 2001. ACM. Google ScholarDigital Library
- P. Nath, B. Urgaonkar, and A. Sivasubramaniam. Evaluating the usefulness of content addressable storage for high-performance data intensive applications. In HPDC '08: Proceedings of the 17th international symposium on High performance distributed computing, pages 35--44, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
- NEC Corporation. HYDRAstor Grid Storage System, 2008. http://www.hydrastor.com.Google Scholar
- W. C. Preston. The Rehydration Myth. BackupCentral.com, June 2009. http://www.backupcentral.com/mr-backup-blog-mainmenu-47/13-mr-backup-blog/247-rehydration-myth.html/.Google Scholar
- W. C. Preston. Restoring deduped data in deduplication systems. SearchDataBackup.com, April 2010. http://searchdatabackup.techtarget.com/feature/Restoring-deduped-data-in-deduplication-systems.Google Scholar
- W. C. Preston. Solving common data deduplication system problems. SearchDataBackup.com, November 2010. http://searchdatabackup.techtarget.com/feature/Solving-common-data-deduplication-system-problems.Google Scholar
- W. C. Preston. Target deduplication appliance performance comparison. BackupCentral.com, October 2010. http://www.backupcentral.com/mr-backup-blog-mainmenu-47/13-mr-backup-blog/348-target-deduplication-appliance-performance-comparison.html.Google Scholar
- Quantum Corporation: DXi Deduplication Solution, 2011. http://www.quantum.com.Google Scholar
- S. Quinlan and S. Dorward. Venti: A new approach to archival storage. In FAST'02: Proceedings of the Conference on File and Storage Technologies, pages 89--101, Berkeley, CA, USA, 2002. USENIX Association. Google ScholarDigital Library
- M. Rabin. Fingerprinting by random polynomials. Technical report, Center for Research in Computing Technology, Harvard University, New York, NY, USA, 1981.Google Scholar
- B. Romanski, L. Heldt, W. Kilian, K. Lichota, and C. Dubnicki. Anchor-driven subchunk deduplication. In Proceedings of the 4th Annual International Conference on Systems and Storage, SYSTOR '11, pages 16:1--16:13, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
- SEPATON Scalable Data Deduplication Solutions. http://sepaton.com/solutions/data-deduplication.Google Scholar
- L. Whitehouse. Restoring deduped data. searchdatabackup.techtarget.com, August 2008. http://searchdatabackup.techtarget.com/tip/Restoring-deduped-data.Google Scholar
- B. Zhu, K. Li, and H. Patterson. Avoiding the disk bottleneck in the Data Domain deduplication file system. In FAST'08: Proceedings of the 6th USENIX Conference on File and Storage Technologies, pages 1--14, Berkeley, CA, USA, 2008. USENIX Association. Google ScholarDigital Library
Index Terms
- Reducing impact of data fragmentation caused by in-line deduplication
Recommendations
Reducing fragmentation impact with forward knowledge in backup systems with deduplication
SYSTOR '15: Proceedings of the 8th ACM International Systems and Storage ConferenceDeduplication of backups is very effective in saving storage, but may also cause significant restore slowdown. This problem is caused by data fragmentation, where logically continuous but duplicate data is not placed sequentially on the disk. Two types ...
A study of practical deduplication
We collected file system content data from 857 desktop computers at Microsoft over a span of 4 weeks. We analyzed the data to determine the relative efficacy of data deduplication, particularly considering whole-file versus block-level elimination of ...
Anchor-driven subchunk deduplication
SYSTOR '11: Proceedings of the 4th Annual International Conference on Systems and StorageData deduplication, implemented usually with content defined chunking (CDC), is today one of key features of advanced storage systems providing space for backup applications. Although simple and effective, CDC generates chunks with sizes clustered ...
Comments