research-article

Efficient Hybrid Inline and Out-of-Line Deduplication for Backup Storage

Authors:
Yan-Kit Li

The Chinese University of Hong Kong

The Chinese University of Hong Kong
View Profile

,
Min Xu

The Chinese University of Hong Kong

The Chinese University of Hong Kong
View Profile

,
Chun-Ho Ng

The Chinese University of Hong Kong

The Chinese University of Hong Kong
View Profile

,
Patrick P. C. Lee

The Chinese University of Hong Kong

The Chinese University of Hong Kong
View Profile

Authors Info & Claims

ACM Transactions on Storage Volume 11 Issue 1Article No.: 2pp 1–21https://doi.org/10.1145/2641572

Published:29 December 2014Publication History

ACM Transactions on Storage

Abstract

Backup storage systems often remove redundancy across backups via inline deduplication, which works by referring duplicate chunks of the latest backup to those of existing backups. However, inline deduplication degrades restore performance of the latest backup due to fragmentation, and complicates deletion of expired backups due to the sharing of data chunks. While out-of-line deduplication addresses the problems by forward-pointing existing duplicate chunks to those of the latest backup, it introduces additional I/Os of writing and removing duplicate chunks.

We design and implement RevDedup, an efficient hybrid inline and out-of-line deduplication system for backup storage. It applies coarse-grained inline deduplication to remove duplicates of the latest backup, and then fine-grained out-of-line reverse deduplication to remove duplicates from older backups. Our reverse deduplication design limits the I/O overhead and prepares for efficient deletion of expired backups. Through extensive testbed experiments using synthetic and real-world datasets, we show that RevDedup can bring high performance to the backup, restore, and deletion operations, while maintaining high storage efficiency comparable to conventional inline deduplication.

References

Bill Andrews. 2013. Straight Talk About Disk Backup with Deduplication. ExaGrid, Westborough, MA.Google Scholar
D. Bhagwat, K. Eshghi, D. D. E. Long, and M. Lillibridge. 2009. Extreme binning: Scalable, parallel deduplication for chunk-based file backup. In Proceedings of the 17th IEEE/ACM International Symposium on Modelling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS’09). 1--9.Google Scholar
J. Black. 2006. Compare-by-hash: A reasoned analysis. In Proceedings of the USENIX Annual Technical Conference (ATC’06). 85--90. Google ScholarDigital Library
B. H. Bloom. 1970. Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13, 7 (July 1970), 422--426. Google ScholarDigital Library
F. C. Botelho, P. Shilane, N. Garg, and W. Hsu. 2013. Memory efficient sanitization of a deduplicated storage system. In Proceedings of the 11th USENIX Conference on File and Storage Technologies (FAST’13). 81--94. Google ScholarDigital Library
B. Debnath, S. Sengupta, and J. Li. 2010. Chunkstash: Speeding up inline storage deduplication using flash memory. In Proceedings of the USENIX Annual Technical Conference (ATC’10). 215--230. Google ScholarDigital Library
C. Dubnicki, L. Gryz, L. Heldt, M. Kaczmarczyk, W. Kilian, P. Strzelczak, J. Szczepkowski, C. Ungureanu, and M. Welnicki. 2009. Hydrastor: A scalable secondary storage. In Proccedings of the 7th Conference on File and Storage Technologies (FAST’09). 197--210. Google ScholarDigital Library
FAL Labs. 2012. Kyoto Cabinet. Retrieved from http://fallabs.com/kyotocabinet/. Last Accessed 2013/11/17.Google Scholar
F. Guo and P. Efstathopoulos. 2011. Building a high-performance deduplication system. In Proceedings of the USENIX Annual Technical Conference (ATC’11). 271--284. Google ScholarDigital Library
K. Jin and E. L. Miller. 2009. The effectiveness of deduplication on virtual machine disk images. In Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference. 7:1--7:12. Google ScholarDigital Library
M. Kaczmarczyk, M. Barczynski, W. Kilian, and C. Dubnicki. 2012. Reducing impact of data fragmentation caused by in-line deduplication. In Proceedings of the 5th Annual International Systems and Storage Conference (SYSTOR’12). 15:1--15:12. Google ScholarDigital Library
E. Kruus, C. Ungureanu, and C. Dubnicki. 2010. Bimodal content defined chunking for backup streams. In Proceedings of the 8th USENIX Conference on File and Storage Technologies (FAST’10). 239--252. Google ScholarDigital Library
M. Lillibridge, K. Eshghi, and D. Bhagwat. 2013. Improving restore speed for backup systems that use inline chunk-based deduplication. In Proceedings of the 11th USENIX Conference on File and Storage Technologies (FAST’13). 183--198. Google ScholarDigital Library
M. Lillibridge, K. Eshghi, D. Bhagwat, V. Deolalikar, G. Trezise, and P. Camble. 2009. Sparse indexing: Large scale, inline deduplication using sampling and locality. In Proccedings of the 7th Conference on File and Storage Technologies (FAST’09). 111--123. Google ScholarDigital Library
B. Mao, H. Jiang, S. Wu, Y. Fu, and L. Tian. 2014. Read-performance optimization for deduplication-based storage systems in the cloud. ACM Trans. Storage 10, 2, 6:1--6:22. Google ScholarDigital Library
D. Meister and A. Brinkmann. 2010. dedupv1: Improving deduplication throughput using solid state drives (ssd). In Proceedings of the 26th Symposium on Mass Storage Systems and Technologies (MSST’10). 1--6. Google ScholarDigital Library
D. Meister, J. Kaiser, and A. Brinkmann. 2013. Block locality caching for data deduplication. In Proceedings of the 6th International Systems and Storage Conference (SYSTOR’13). 15:1--15:12. Google ScholarDigital Library
Y. Nam, D. Park, and D. Du. 2012. Assuring demanded read performance of data deduplication storage with backup datasets. In Proceedings of the 20th IEEE International Symposium on Modeling, Analysis and Simulations of Computer and Telecommunication Systems (MASCOTS’12). 248--255. Google ScholarDigital Library
C. Ng and P. Lee. 2013. RevDedup: A reverse deduplication storage system optimized for reads to latest backups. In Proceedings of the 4th Asia-Pacific Workshop on Systems (APSYS’13). 15:1--15:7. Google ScholarDigital Library
OpenSSL. 2014. Homepage. Retrieved from http://www.openssl.org.Google Scholar
S. Quinlan and S. Dorward. 2002. Venti: A new approach to archival data storage. In Proceedings of the 1st USENIX Conference on File and Storage Technologies (FAST’02). 89--101. Google ScholarDigital Library
M. Rabin. 1981. Fingerprint by Random Polynomials. Technical Report TR-15-81. Center for Research in Computing Technology, Harvard University.Google Scholar
S. Rhea, R. Cox, and A. Pesterev. 2008. Fast, inexpensive content-addressed storage in foundation. In Proceedings of the USENIX Annual Technical Conference (ATC’08). 143--156. Google ScholarDigital Library
D. Simha, M. Lu, and T. Chiueh. 2013. A scalable deduplication and garbage collection engine for incremental backup. In Proceedings of the 6th International Systems and Storage Conference (SYSTOR’13). 16:1--16:12. Google ScholarDigital Library
K. Srinivasan, T. Bisson, G. Goodson, and K. Voruganti. 2012. iDedup: Latency-aware, inline data deduplication for primary storage. In Proceedings of the 10th USENIX Conference on File and Storage Technologies (FAST’12). 299--312. Google ScholarDigital Library
P. Strzelczak, E. Adamczyk, U. Herman-Izycka, J. Sakowicz, L. Slusarczyk, J. Wrona, and C. Dubnicki. 2013. Concurrent deletion in a distributed content-addressable storage system with global deduplication. In Proceedings of the 11th USENIX Conference on File and Storage Technologies (FAST’13). 161--174. Google ScholarDigital Library
G. Wallace, F. Douglis, H. Qian, P. Shilane, S. Smaldone, M. Chamness, and W. Hsu. 2012. Characteristics of backup workloads in production systems. In Proceedings of the 10th USENIX Conference on File and Storage Technologies (FAST’12). 33--48. Google ScholarDigital Library
W. Xia, H. Jiang, D. Feng, and Y. Hua. 2011. Silo: A similarity-locality based near-exact deduplication scheme with low ram overhead and high throughput. In Proceedings of the USENIX Annual Technical Conference (ATC’11). 285--298. Google ScholarDigital Library
B. Zhu, K. Li, and H. Patterson. 2008. Avoiding the disk bottleneck in the data domain deduplication file system. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST’08). 269--282. Google ScholarDigital Library
J. Ziv and A. Lempel. 1977. A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23, 3 (May 1977), 337--343. Google ScholarDigital Library

Index Terms

Efficient Hybrid Inline and Out-of-Line Deduplication for Backup Storage
1. Software and its engineering
  1. Software organization and properties
    1. Contextual software domains
      1. Operating systems
        Memory management
        Secondary storage

Recommendations

A study of practical deduplication

We collected file system content data from 857 desktop computers at Microsoft over a span of 4 weeks. We analyzed the data to determine the relative efficacy of data deduplication, particularly considering whole-file versus block-level elimination of ...
Read More
Storage Deduplication by Virtual Large-Scale Disks
NBIS '12: Proceedings of the 2012 15th International Conference on Network-Based Information Systems

Recently, the demand of low cost large scale storages increases. We developed VLSD (Virtual Large Scale Disks) toolkit for constructing virtual disk based distributed storages, which aggregate free spaces of individual disks. VLSD realizes low-cost ...
Read More
A scalable deduplication and garbage collection engine for incremental backup
SYSTOR '13: Proceedings of the 6th International Systems and Storage Conference

Very large block-level data backup systems need scalable data deduplication and garbage collection techniques to make efficient use of the storage space and to minimize the performance overhead of doing so. Although the deduplication and garbage ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Storage Volume 11, Issue 1
February 2015
100 pages
ISSN:1553-3077
EISSN:1553-3093
DOI:10.1145/2705611
Editor:
Darrell Long
University of California Santa Cruz, USA
Issue’s Table of Contents
Copyright © 2014 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 29 December 2014
- Accepted: 1 June 2014
- Received: 1 April 2014
Published in tos Volume 11, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Deduplication
backup storage
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 31
  Total Citations
  View Citations
- 597
  Total Downloads
- Downloads (Last 12 months)31
- Downloads (Last 6 weeks)6
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Efficient Hybrid Inline and Out-of-Line Deduplication for Backup Storage

ACM Transactions on Storage

Abstract

References

Cited By

Index Terms

Recommendations

A study of practical deduplication

Storage Deduplication by Virtual Large-Scale Disks

A scalable deduplication and garbage collection engine for incremental backup