skip to main content
research-article

Efficient Hybrid Inline and Out-of-Line Deduplication for Backup Storage

Published:29 December 2014Publication History
Skip Abstract Section

Abstract

Backup storage systems often remove redundancy across backups via inline deduplication, which works by referring duplicate chunks of the latest backup to those of existing backups. However, inline deduplication degrades restore performance of the latest backup due to fragmentation, and complicates deletion of expired backups due to the sharing of data chunks. While out-of-line deduplication addresses the problems by forward-pointing existing duplicate chunks to those of the latest backup, it introduces additional I/Os of writing and removing duplicate chunks.

We design and implement RevDedup, an efficient hybrid inline and out-of-line deduplication system for backup storage. It applies coarse-grained inline deduplication to remove duplicates of the latest backup, and then fine-grained out-of-line reverse deduplication to remove duplicates from older backups. Our reverse deduplication design limits the I/O overhead and prepares for efficient deletion of expired backups. Through extensive testbed experiments using synthetic and real-world datasets, we show that RevDedup can bring high performance to the backup, restore, and deletion operations, while maintaining high storage efficiency comparable to conventional inline deduplication.

References

  1. Bill Andrews. 2013. Straight Talk About Disk Backup with Deduplication. ExaGrid, Westborough, MA.Google ScholarGoogle Scholar
  2. D. Bhagwat, K. Eshghi, D. D. E. Long, and M. Lillibridge. 2009. Extreme binning: Scalable, parallel deduplication for chunk-based file backup. In Proceedings of the 17th IEEE/ACM International Symposium on Modelling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS’09). 1--9.Google ScholarGoogle Scholar
  3. J. Black. 2006. Compare-by-hash: A reasoned analysis. In Proceedings of the USENIX Annual Technical Conference (ATC’06). 85--90. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. B. H. Bloom. 1970. Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13, 7 (July 1970), 422--426. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. F. C. Botelho, P. Shilane, N. Garg, and W. Hsu. 2013. Memory efficient sanitization of a deduplicated storage system. In Proceedings of the 11th USENIX Conference on File and Storage Technologies (FAST’13). 81--94. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. B. Debnath, S. Sengupta, and J. Li. 2010. Chunkstash: Speeding up inline storage deduplication using flash memory. In Proceedings of the USENIX Annual Technical Conference (ATC’10). 215--230. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. C. Dubnicki, L. Gryz, L. Heldt, M. Kaczmarczyk, W. Kilian, P. Strzelczak, J. Szczepkowski, C. Ungureanu, and M. Welnicki. 2009. Hydrastor: A scalable secondary storage. In Proccedings of the 7th Conference on File and Storage Technologies (FAST’09). 197--210. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. FAL Labs. 2012. Kyoto Cabinet. Retrieved from http://fallabs.com/kyotocabinet/. Last Accessed 2013/11/17.Google ScholarGoogle Scholar
  9. F. Guo and P. Efstathopoulos. 2011. Building a high-performance deduplication system. In Proceedings of the USENIX Annual Technical Conference (ATC’11). 271--284. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. K. Jin and E. L. Miller. 2009. The effectiveness of deduplication on virtual machine disk images. In Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference. 7:1--7:12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. M. Kaczmarczyk, M. Barczynski, W. Kilian, and C. Dubnicki. 2012. Reducing impact of data fragmentation caused by in-line deduplication. In Proceedings of the 5th Annual International Systems and Storage Conference (SYSTOR’12). 15:1--15:12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. E. Kruus, C. Ungureanu, and C. Dubnicki. 2010. Bimodal content defined chunking for backup streams. In Proceedings of the 8th USENIX Conference on File and Storage Technologies (FAST’10). 239--252. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. M. Lillibridge, K. Eshghi, and D. Bhagwat. 2013. Improving restore speed for backup systems that use inline chunk-based deduplication. In Proceedings of the 11th USENIX Conference on File and Storage Technologies (FAST’13). 183--198. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. M. Lillibridge, K. Eshghi, D. Bhagwat, V. Deolalikar, G. Trezise, and P. Camble. 2009. Sparse indexing: Large scale, inline deduplication using sampling and locality. In Proccedings of the 7th Conference on File and Storage Technologies (FAST’09). 111--123. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. B. Mao, H. Jiang, S. Wu, Y. Fu, and L. Tian. 2014. Read-performance optimization for deduplication-based storage systems in the cloud. ACM Trans. Storage 10, 2, 6:1--6:22. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. D. Meister and A. Brinkmann. 2010. dedupv1: Improving deduplication throughput using solid state drives (ssd). In Proceedings of the 26th Symposium on Mass Storage Systems and Technologies (MSST’10). 1--6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. D. Meister, J. Kaiser, and A. Brinkmann. 2013. Block locality caching for data deduplication. In Proceedings of the 6th International Systems and Storage Conference (SYSTOR’13). 15:1--15:12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Y. Nam, D. Park, and D. Du. 2012. Assuring demanded read performance of data deduplication storage with backup datasets. In Proceedings of the 20th IEEE International Symposium on Modeling, Analysis and Simulations of Computer and Telecommunication Systems (MASCOTS’12). 248--255. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. C. Ng and P. Lee. 2013. RevDedup: A reverse deduplication storage system optimized for reads to latest backups. In Proceedings of the 4th Asia-Pacific Workshop on Systems (APSYS’13). 15:1--15:7. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. OpenSSL. 2014. Homepage. Retrieved from http://www.openssl.org.Google ScholarGoogle Scholar
  21. S. Quinlan and S. Dorward. 2002. Venti: A new approach to archival data storage. In Proceedings of the 1st USENIX Conference on File and Storage Technologies (FAST’02). 89--101. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. M. Rabin. 1981. Fingerprint by Random Polynomials. Technical Report TR-15-81. Center for Research in Computing Technology, Harvard University.Google ScholarGoogle Scholar
  23. S. Rhea, R. Cox, and A. Pesterev. 2008. Fast, inexpensive content-addressed storage in foundation. In Proceedings of the USENIX Annual Technical Conference (ATC’08). 143--156. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. D. Simha, M. Lu, and T. Chiueh. 2013. A scalable deduplication and garbage collection engine for incremental backup. In Proceedings of the 6th International Systems and Storage Conference (SYSTOR’13). 16:1--16:12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. K. Srinivasan, T. Bisson, G. Goodson, and K. Voruganti. 2012. iDedup: Latency-aware, inline data deduplication for primary storage. In Proceedings of the 10th USENIX Conference on File and Storage Technologies (FAST’12). 299--312. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. P. Strzelczak, E. Adamczyk, U. Herman-Izycka, J. Sakowicz, L. Slusarczyk, J. Wrona, and C. Dubnicki. 2013. Concurrent deletion in a distributed content-addressable storage system with global deduplication. In Proceedings of the 11th USENIX Conference on File and Storage Technologies (FAST’13). 161--174. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. G. Wallace, F. Douglis, H. Qian, P. Shilane, S. Smaldone, M. Chamness, and W. Hsu. 2012. Characteristics of backup workloads in production systems. In Proceedings of the 10th USENIX Conference on File and Storage Technologies (FAST’12). 33--48. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. W. Xia, H. Jiang, D. Feng, and Y. Hua. 2011. Silo: A similarity-locality based near-exact deduplication scheme with low ram overhead and high throughput. In Proceedings of the USENIX Annual Technical Conference (ATC’11). 285--298. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. B. Zhu, K. Li, and H. Patterson. 2008. Avoiding the disk bottleneck in the data domain deduplication file system. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST’08). 269--282. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. J. Ziv and A. Lempel. 1977. A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23, 3 (May 1977), 337--343. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Efficient Hybrid Inline and Out-of-Line Deduplication for Backup Storage

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image ACM Transactions on Storage
    ACM Transactions on Storage  Volume 11, Issue 1
    February 2015
    100 pages
    ISSN:1553-3077
    EISSN:1553-3093
    DOI:10.1145/2705611
    • Editor:
    • Darrell Long
    Issue’s Table of Contents

    Copyright © 2014 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 29 December 2014
    • Accepted: 1 June 2014
    • Received: 1 April 2014
    Published in tos Volume 11, Issue 1

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article
    • Research
    • Refereed

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader