research-article

Lazy Exact Deduplication

Authors:
Jingwei Ma

College of Computer and Control Engineering, Nankai University, Jinnan District, Tianjin, CN

College of Computer and Control Engineering, Nankai University, Jinnan District, Tianjin, CN
View Profile

,
Rebecca J. Stones

College of Computer and Control Engineering, Nankai University, Jinnan District, Tianjin, CN

College of Computer and Control Engineering, Nankai University, Jinnan District, Tianjin, CN
View Profile

,
Yuxiang Ma

College of Computer and Control Engineering, Nankai University, Jinnan District, Tianjin, CN

College of Computer and Control Engineering, Nankai University, Jinnan District, Tianjin, CN
View Profile

,
Jingui Wang

College of Computer and Control Engineering, Nankai University, Jinnan District, Tianjin, CN

College of Computer and Control Engineering, Nankai University, Jinnan District, Tianjin, CN
View Profile

,
Junjie Ren

College of Computer and Control Engineering, Nankai University, Jinnan District, Tianjin, CN

College of Computer and Control Engineering, Nankai University, Jinnan District, Tianjin, CN
View Profile

,
Gang Wang

College of Computer and Control Engineering, Nankai University, Jinnan District, Tianjin, CN

College of Computer and Control Engineering, Nankai University, Jinnan District, Tianjin, CN
View Profile

,
Xiaoguang Liu

College of Computer and Control Engineering, Nankai University, Jinnan District, Tianjin, CN

College of Computer and Control Engineering, Nankai University, Jinnan District, Tianjin, CN
View Profile

Authors Info & Claims

ACM Transactions on Storage Volume 13 Issue 2Article No.: 11pp 1–26https://doi.org/10.1145/3078837

Published:10 June 2017Publication History

ACM Transactions on Storage

Abstract

Deduplication aims to reduce duplicate data in storage systems by removing redundant copies of data blocks, which are compared to one another using fingerprints. However, repeated on-disk fingerprint lookups lead to high disk traffic, which results in a bottleneck.

In this article, we propose a “lazy” data deduplication method, which buffers incoming fingerprints that are used to perform on-disk lookups in batches, with the aim of improving subsequent prefetching. In deduplication in general, prefetching is used to improve the cache hit rate by exploiting locality within the incoming fingerprint stream. For lazy deduplication, we design a buffering strategy that preserves locality in order to facilitate prefetching. Furthermore, as the proportion of deduplication time spent on I/O decreases, the proportion spent on fingerprint calculation and chunking increases. Thus, we also utilize parallel approaches (utilizing multiple CPU cores and a graphics processing unit) to further improve the overall performance.

Experimental results indicate that the lazy method improves fingerprint identification performance by over 50% compared with an “eager” method with the same data layout. The GPU improves the hash calculation by a factor of 4.6 and multithreaded chunking by a factor of 4.16. Deduplication performance can be improved by over 45% on SSD and 80% on HDD in the last round on the real datasets.

References

D. Bhagwat, K. Eshghi, D. D. E. Long, and M. Lillibridge. 2009. Extreme binning: Scalable, parallel deduplication for chunk-based file backup. In Proceedings of IEEE International Symposium on Modeling, Analysis 8 Simulation of Computer and Telecommunication Systems (MASCOTS’09). 1--9.Google Scholar
P. Bhatotia, R. Rodrigues, and A. Verma. 2012. Shredder: GPU-accelerated incremental storage and computation. In Proceedings of USENIX Conference on File and Storage Technologies (FAST’12). 1--15.Google Scholar
B. H. Bloom. 1970. Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13, 7 (1970), 422--426.Google ScholarDigital Library
P. Bose, H. Guo, E. Kranakis, A. Maheshwari, P. Morin, J. Morrison, M. Smid, and Y. Tang. 2008. On the false-positive rate of bloom filters. Inform. Process. Lett. 108, 4 (2008), 210--213.Google ScholarDigital Library
F. C. Botelho, P. Shilane, N. Garg, and W. Hsu. 2013. Memory efficient sanitization of a deduplicated storage system. In Proceedings of USENIX Conference on File and Storage Technologies (FAST’13). 81--94.Google Scholar
A. T. Clements, I. Ahmad, M. Vilayannur, and J. Li. 2009. Decentralized deduplication in SAN cluster file systems. In Proceedings of USENIX Annual Technical Conference (ATC’09). 101--114.Google Scholar
B. Debnath, S. Sengupta, and J. Li. 2010. ChunkStash: Speeding up inline storage deduplication using flash memory. In Proceedings of USENIX Annual Technical Conference (ATC’10).Google Scholar
B. Debnath, S. Sengupta, and J. Li. 2011. SkimpyStash: RAM space skimpy key-value store on flash-based storage. In Proceedings of ACM International Conference on Management of Data (SIGMOD’11). 25--36.Google Scholar
M. Fu, D. Feng, Y. Hua, X. He, Z. Chen, W. Xia, F. Huang, and Q. Liu. 2014. Accelerating restore and garbage collection in deduplication-based backup systems via exploiting historical information. In Proceedings of USENIX Annual Technical Conference (ATC’14). 181--192.Google Scholar
F. Guo and P. Efstathopoulos. 2011. Building a high-performance deduplication system. In Proceedings of USENIX Annual Technical Conference (ATC’11). 25--25.Google Scholar
K. Jin and E. L. Miller. 2009. The effectiveness of deduplication on virtual machine disk images. In Proceedings of ACM Israeli Experimental Systems Conference (SYSTOR’09), Vol. 7.Google Scholar
C. Kim, K.-W. Park, K. Park, and K. H. Park. 2011. Rethinking deduplication in cloud: From data profiling to blueprint. In Proceedings of IEEE International Conference on Networked Computing and Advanced Information Management (NCM’11). 101--104.Google Scholar
J. Kim, C. Lee, S. Lee, I. Son, J. Choi, S. Yoon, H. u. Lee, S. Kang, Y. Won, and J. Cha. 2012a. Deduplication in SSDs: Model and quantitative analysis. In 2012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST’12). 1--12. DOI:http://dx.doi.org/10.1109/MSST.2012.6232379Google ScholarCross Ref
J. Kim, C. Lee, S. Lee, I. Son, J. Choi, S. Yoon, H. u. Lee, S. Kang, Y. Won, and J. Cha. 2012b. Deduplication in SSDs: Model and quantitative analysis. In Proceedings of IEEE Symposium on Mass Storage Systems and Technologies (MSST’12). 1--12.Google Scholar
R. Koller and R. Rangaswami. 2010. I/O deduplication: Utilizing content similarity to improve I/O performance. ACM Trans. Storage 6, 3 (2010), 13.Google ScholarDigital Library
H. Li, T. Liang, and J. Chiu. 2013. A compound OpenMP/MPI program development toolkit for hybrid CPU/GPU clusters. J. Supercomput. 66, 1 (2013), 381--405.Google ScholarDigital Library
X. Li and D. J. Lilja. 2009. A highly parallel GPU-based hash accelerator for a data deduplication system. In Proceedings of IASTED International Conference on Parallel and Distributed Computing and Systems (PDCS’09). 268--275.Google Scholar
M. Lillibridge, K. Eshghi, D. Bhagwat, V. Deolalikar, G. Trezise, and P. Camble. 2009. Sparse indexing: Large scale, inline deduplication using sampling and locality. In Proceedings of USENIX Conference on File and Storage Technologies (FAST’09). 111--123.Google Scholar
X. Lin, F. Douglis, J. Li, X. Li, R. Ricci, S. Smaldone, and G. Wallace. 2015a. Metadata considered harmful... to deduplication. In Proceedings of USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage’15). 11--16.Google Scholar
X. Lin, M. Hibler, E. Eide, and R. Ricci. 2015b. Using deduplicating storage for efficient disk image deployment. EAI Endorsed Trans. Scalable Information Systems 2, 6 (2015), e1.Google Scholar
G. Lu, Y. J. Nam, and D. H. C. Du. 2012. BloomStore: Bloom-filter based memory-efficient key-value store for indexing of data deduplication on flash. In Proceedings of IEEE Symposium on Mass Storage Systems and Technologies (MSST’12). 1--11.Google ScholarCross Ref
J. Ma, R. J. Stones, Y. Ma, J. Wang, J. Ren, G. Wang, and X. Liu. 2016. Lazy exact deduplication. In Proceedings of IEEE 32th Symposium on Mass Storage Systems and Technologies (MSST’16). 1--10.Google Scholar
L. Ma, C. Zhen, B. Zhao, J. Ma, G. Wang, and X. Liu. 2010. Towards fast de-duplication using low energy coprocessor. In Proceedings of IEEE International Conference on Networking, Architecture and Storage (NAS’10). 395--402.Google Scholar
U. Manber. 1994. Finding similar files in a large file system. In Usenix Winter, Vol. 94. 1--10.Google ScholarDigital Library
B. Mao, H. Jiang, S. Wu, Y. Fu, and L. Tian. 2014. Read-performance optimization for deduplication-based storage systems in the cloud. ACM Trans. Storage (TOS) 10, 2 (2014), 6:1--6:22.Google Scholar
D. Meister and A. Brinkmann. 2010. dedupv1: Improving deduplication throughput using solid state drives (SSD). In Proceedings of IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST’10). 1--6.Google Scholar
D. T. Meyer and W. J. Bolosky. 2011. A study of practical deduplication. In Proceedings of USENIX Conference on File and Storage Technologies (FAST’11) (2011), 1--13.Google Scholar
D. T. Meyer and W. J. Bolosky. 2012. A study of practical deduplication. ACM Trans. Storage (TOS) 7, 4 (2012), 14:1--14:20.Google Scholar
J. Min, D. Yoon, and Y. Won. 2011. Efficient deduplication techniques for modern backup operation. IEEE Trans. Comput. 60, 6 (2011), 824--840.Google ScholarDigital Library
C.-H. Ng and P. P. C. Lee. 2013. RevDedup: A reverse deduplication storage system optimized for reads to latest backups. In Proceedings of ACM Asia-Pacific Workshop on Systems, Vol. 15.Google ScholarDigital Library
NVIDIA. 2013. NVIDIA CUDA. Retrieved from https://developer.nvidia.com/cuda-downloads (July 2013).Google Scholar
R. Pagh and F. F. Rodler. 2001. Cuckoo Hashing. Springer.Google Scholar
J. Paulo and J. Pereira. 2014. A survey and classification of storage deduplication systems. ACM Comput. Surv. (CSUR) 47, 1 (2014), 11:1--11:30.Google Scholar
C. Policroniades and I. Pratt. 2004. Alternatives for detecting redundancy in storage systems data. In Proceedings of USENIX Annual Technical Conference (ATC’04). 73--86.Google Scholar
S. Quinlan and S. Dorward. 2002. Venti: A new approach to archival data storage. In Proceedings of USENIX Conference on File and Storage Technologies (FAST’02), Vol. 4. 89--102.Google Scholar
M. O. Rabin. 1981. Fingerprinting by Random Polynomials. Center for Research in Computing Technology, Aiken Computation Laboratory.Google Scholar
P. Shilane, M. Huang, G. Wallace, and W. Hsu. 2012. WAN-optimized replication of backup datasets using stream-informed delta compression. ACM Trans. Storage (TOS) 8, 4 (2012), 13:1--13:26.Google Scholar
K. Srinivasan, T. Bisson, G. Goodson, and K. Voruganti. 2012. iDedup: Latency-aware, inline data deduplication for primary storage. In Proceedings of USENIX Conference on File and Storage Technologies (FAST’12). 24--24.Google Scholar
Y. Tan, Z. Yan, D. Feng, X. He, Q. Zou, and L. Yang. 2015. De-Frag: An efficient scheme to improve deduplication performance via reducing data placement de-linearization. Cluster Comput. 18, 1 (2015), 79--92.Google ScholarDigital Library
V. Tarasov, A. Mudrankit, W. Buik, P. Shilane, G. Kuenning, and E. Zadok. 2012. Generating realistic datasets for deduplication analysis. In Proceedings of USENIX Annual Technical Conference (ATC’12). 261--272.Google Scholar
VIA Technologies. 2008. VIA nano processor. Retrieved from http://www.viatech.com.cn/cn/downloads/whitepapers/processors/WP080529VIA_Nano.pdf.Google Scholar
M. M. Waldrop. 2016. The chips are down for moores law. Nature 530 (2016), 144--147.Google ScholarCross Ref
W. Xia, H. Jiang, D. Feng, and Y. Hua. 2015. Similarity and locality based indexing for high performance data deduplication. IEEE Trans. Comput. 64, 4 (2015), 1162--1176.Google ScholarCross Ref
W. Xia, H. Jiang, D. Feng, and H. Yu. 2011. SiLo: A similarity-locality based near-exact deduplication scheme with low RAM overhead and high throughput. In Proceedings of USENIX Annual Technical Conference (ATC’11). 26--28.Google Scholar
B. Zhu, K. Li, and H. Patterson. 2008. Avoiding the disk bottleneck in the data domain deduplication file system. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST’08). 269--282.Google Scholar

Index Terms

Lazy Exact Deduplication
1. Information systems
  1. Data management systems
    1. Information integration
      1. Deduplication
  2. Information systems applications
    1. Digital libraries and archives

Recommendations

A study of practical deduplication

We collected file system content data from 857 desktop computers at Microsoft over a span of 4 weeks. We analyzed the data to determine the relative efficacy of data deduplication, particularly considering whole-file versus block-level elimination of ...
Read More
Storage Deduplication by Virtual Large-Scale Disks
NBIS '12: Proceedings of the 2012 15th International Conference on Network-Based Information Systems

Recently, the demand of low cost large scale storages increases. We developed VLSD (Virtual Large Scale Disks) toolkit for constructing virtual disk based distributed storages, which aggregate free spaces of individual disks. VLSD realizes low-cost ...
Read More
Design and Implementation of Deduplication on F2FS
Data deduplication technology has gained popularity in modern file systems due to its ability to eliminate redundant writes and improve storage space efficiency. In recent years, the flash-friendly file system (F2FS) has been widely adopted in flash ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Storage Volume 13, Issue 2
Special Issue on MSST 2016 and Regular Papers
May 2017
199 pages
ISSN:1553-3077
EISSN:1553-3093
DOI:10.1145/3098275
Editor:
Sam H. Noh
Ulsan National Institute of Science and Technology, Ulsan, Korea
Issue’s Table of Contents
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 10 June 2017
- Accepted: 1 March 2017
- Revised: 1 February 2017
- Received: 1 November 2016
Published in tos Volume 13, Issue 2

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Deduplication
GPU
disk I/O
lazy method
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 12
  Total Citations
  View Citations
- 403
  Total Downloads
- Downloads (Last 12 months)26
- Downloads (Last 6 weeks)7
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Lazy Exact Deduplication

ACM Transactions on Storage

Abstract

References

Cited By

Index Terms

Recommendations

A study of practical deduplication

Storage Deduplication by Virtual Large-Scale Disks

Design and Implementation of Deduplication on F2FS

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Lazy Exact Deduplication

ACM Transactions on Storage

Abstract

References

Cited By

Index Terms

Recommendations

A study of practical deduplication

Storage Deduplication by Virtual Large-Scale Disks

Design and Implementation of Deduplication on F2FS

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media