Abstract
Backup storage systems often remove redundancy across backups via inline deduplication, which works by referring duplicate chunks of the latest backup to those of existing backups. However, inline deduplication degrades restore performance of the latest backup due to fragmentation, and complicates deletion of expired backups due to the sharing of data chunks. While out-of-line deduplication addresses the problems by forward-pointing existing duplicate chunks to those of the latest backup, it introduces additional I/Os of writing and removing duplicate chunks.
We design and implement RevDedup, an efficient hybrid inline and out-of-line deduplication system for backup storage. It applies coarse-grained inline deduplication to remove duplicates of the latest backup, and then fine-grained out-of-line reverse deduplication to remove duplicates from older backups. Our reverse deduplication design limits the I/O overhead and prepares for efficient deletion of expired backups. Through extensive testbed experiments using synthetic and real-world datasets, we show that RevDedup can bring high performance to the backup, restore, and deletion operations, while maintaining high storage efficiency comparable to conventional inline deduplication.
- Bill Andrews. 2013. Straight Talk About Disk Backup with Deduplication. ExaGrid, Westborough, MA.Google Scholar
- D. Bhagwat, K. Eshghi, D. D. E. Long, and M. Lillibridge. 2009. Extreme binning: Scalable, parallel deduplication for chunk-based file backup. In Proceedings of the 17th IEEE/ACM International Symposium on Modelling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS’09). 1--9.Google Scholar
- J. Black. 2006. Compare-by-hash: A reasoned analysis. In Proceedings of the USENIX Annual Technical Conference (ATC’06). 85--90. Google ScholarDigital Library
- B. H. Bloom. 1970. Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13, 7 (July 1970), 422--426. Google ScholarDigital Library
- F. C. Botelho, P. Shilane, N. Garg, and W. Hsu. 2013. Memory efficient sanitization of a deduplicated storage system. In Proceedings of the 11th USENIX Conference on File and Storage Technologies (FAST’13). 81--94. Google ScholarDigital Library
- B. Debnath, S. Sengupta, and J. Li. 2010. Chunkstash: Speeding up inline storage deduplication using flash memory. In Proceedings of the USENIX Annual Technical Conference (ATC’10). 215--230. Google ScholarDigital Library
- C. Dubnicki, L. Gryz, L. Heldt, M. Kaczmarczyk, W. Kilian, P. Strzelczak, J. Szczepkowski, C. Ungureanu, and M. Welnicki. 2009. Hydrastor: A scalable secondary storage. In Proccedings of the 7th Conference on File and Storage Technologies (FAST’09). 197--210. Google ScholarDigital Library
- FAL Labs. 2012. Kyoto Cabinet. Retrieved from http://fallabs.com/kyotocabinet/. Last Accessed 2013/11/17.Google Scholar
- F. Guo and P. Efstathopoulos. 2011. Building a high-performance deduplication system. In Proceedings of the USENIX Annual Technical Conference (ATC’11). 271--284. Google ScholarDigital Library
- K. Jin and E. L. Miller. 2009. The effectiveness of deduplication on virtual machine disk images. In Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference. 7:1--7:12. Google ScholarDigital Library
- M. Kaczmarczyk, M. Barczynski, W. Kilian, and C. Dubnicki. 2012. Reducing impact of data fragmentation caused by in-line deduplication. In Proceedings of the 5th Annual International Systems and Storage Conference (SYSTOR’12). 15:1--15:12. Google ScholarDigital Library
- E. Kruus, C. Ungureanu, and C. Dubnicki. 2010. Bimodal content defined chunking for backup streams. In Proceedings of the 8th USENIX Conference on File and Storage Technologies (FAST’10). 239--252. Google ScholarDigital Library
- M. Lillibridge, K. Eshghi, and D. Bhagwat. 2013. Improving restore speed for backup systems that use inline chunk-based deduplication. In Proceedings of the 11th USENIX Conference on File and Storage Technologies (FAST’13). 183--198. Google ScholarDigital Library
- M. Lillibridge, K. Eshghi, D. Bhagwat, V. Deolalikar, G. Trezise, and P. Camble. 2009. Sparse indexing: Large scale, inline deduplication using sampling and locality. In Proccedings of the 7th Conference on File and Storage Technologies (FAST’09). 111--123. Google ScholarDigital Library
- B. Mao, H. Jiang, S. Wu, Y. Fu, and L. Tian. 2014. Read-performance optimization for deduplication-based storage systems in the cloud. ACM Trans. Storage 10, 2, 6:1--6:22. Google ScholarDigital Library
- D. Meister and A. Brinkmann. 2010. dedupv1: Improving deduplication throughput using solid state drives (ssd). In Proceedings of the 26th Symposium on Mass Storage Systems and Technologies (MSST’10). 1--6. Google ScholarDigital Library
- D. Meister, J. Kaiser, and A. Brinkmann. 2013. Block locality caching for data deduplication. In Proceedings of the 6th International Systems and Storage Conference (SYSTOR’13). 15:1--15:12. Google ScholarDigital Library
- Y. Nam, D. Park, and D. Du. 2012. Assuring demanded read performance of data deduplication storage with backup datasets. In Proceedings of the 20th IEEE International Symposium on Modeling, Analysis and Simulations of Computer and Telecommunication Systems (MASCOTS’12). 248--255. Google ScholarDigital Library
- C. Ng and P. Lee. 2013. RevDedup: A reverse deduplication storage system optimized for reads to latest backups. In Proceedings of the 4th Asia-Pacific Workshop on Systems (APSYS’13). 15:1--15:7. Google ScholarDigital Library
- OpenSSL. 2014. Homepage. Retrieved from http://www.openssl.org.Google Scholar
- S. Quinlan and S. Dorward. 2002. Venti: A new approach to archival data storage. In Proceedings of the 1st USENIX Conference on File and Storage Technologies (FAST’02). 89--101. Google ScholarDigital Library
- M. Rabin. 1981. Fingerprint by Random Polynomials. Technical Report TR-15-81. Center for Research in Computing Technology, Harvard University.Google Scholar
- S. Rhea, R. Cox, and A. Pesterev. 2008. Fast, inexpensive content-addressed storage in foundation. In Proceedings of the USENIX Annual Technical Conference (ATC’08). 143--156. Google ScholarDigital Library
- D. Simha, M. Lu, and T. Chiueh. 2013. A scalable deduplication and garbage collection engine for incremental backup. In Proceedings of the 6th International Systems and Storage Conference (SYSTOR’13). 16:1--16:12. Google ScholarDigital Library
- K. Srinivasan, T. Bisson, G. Goodson, and K. Voruganti. 2012. iDedup: Latency-aware, inline data deduplication for primary storage. In Proceedings of the 10th USENIX Conference on File and Storage Technologies (FAST’12). 299--312. Google ScholarDigital Library
- P. Strzelczak, E. Adamczyk, U. Herman-Izycka, J. Sakowicz, L. Slusarczyk, J. Wrona, and C. Dubnicki. 2013. Concurrent deletion in a distributed content-addressable storage system with global deduplication. In Proceedings of the 11th USENIX Conference on File and Storage Technologies (FAST’13). 161--174. Google ScholarDigital Library
- G. Wallace, F. Douglis, H. Qian, P. Shilane, S. Smaldone, M. Chamness, and W. Hsu. 2012. Characteristics of backup workloads in production systems. In Proceedings of the 10th USENIX Conference on File and Storage Technologies (FAST’12). 33--48. Google ScholarDigital Library
- W. Xia, H. Jiang, D. Feng, and Y. Hua. 2011. Silo: A similarity-locality based near-exact deduplication scheme with low ram overhead and high throughput. In Proceedings of the USENIX Annual Technical Conference (ATC’11). 285--298. Google ScholarDigital Library
- B. Zhu, K. Li, and H. Patterson. 2008. Avoiding the disk bottleneck in the data domain deduplication file system. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST’08). 269--282. Google ScholarDigital Library
- J. Ziv and A. Lempel. 1977. A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23, 3 (May 1977), 337--343. Google ScholarDigital Library
Index Terms
- Efficient Hybrid Inline and Out-of-Line Deduplication for Backup Storage
Recommendations
A study of practical deduplication
We collected file system content data from 857 desktop computers at Microsoft over a span of 4 weeks. We analyzed the data to determine the relative efficacy of data deduplication, particularly considering whole-file versus block-level elimination of ...
Storage Deduplication by Virtual Large-Scale Disks
NBIS '12: Proceedings of the 2012 15th International Conference on Network-Based Information SystemsRecently, the demand of low cost large scale storages increases. We developed VLSD (Virtual Large Scale Disks) toolkit for constructing virtual disk based distributed storages, which aggregate free spaces of individual disks. VLSD realizes low-cost ...
A scalable deduplication and garbage collection engine for incremental backup
SYSTOR '13: Proceedings of the 6th International Systems and Storage ConferenceVery large block-level data backup systems need scalable data deduplication and garbage collection techniques to make efficient use of the storage space and to minimize the performance overhead of doing so. Although the deduplication and garbage ...
Comments