Abstract
Deduplication aims to reduce duplicate data in storage systems by removing redundant copies of data blocks, which are compared to one another using fingerprints. However, repeated on-disk fingerprint lookups lead to high disk traffic, which results in a bottleneck.
In this article, we propose a “lazy” data deduplication method, which buffers incoming fingerprints that are used to perform on-disk lookups in batches, with the aim of improving subsequent prefetching. In deduplication in general, prefetching is used to improve the cache hit rate by exploiting locality within the incoming fingerprint stream. For lazy deduplication, we design a buffering strategy that preserves locality in order to facilitate prefetching. Furthermore, as the proportion of deduplication time spent on I/O decreases, the proportion spent on fingerprint calculation and chunking increases. Thus, we also utilize parallel approaches (utilizing multiple CPU cores and a graphics processing unit) to further improve the overall performance.
Experimental results indicate that the lazy method improves fingerprint identification performance by over 50% compared with an “eager” method with the same data layout. The GPU improves the hash calculation by a factor of 4.6 and multithreaded chunking by a factor of 4.16. Deduplication performance can be improved by over 45% on SSD and 80% on HDD in the last round on the real datasets.
- D. Bhagwat, K. Eshghi, D. D. E. Long, and M. Lillibridge. 2009. Extreme binning: Scalable, parallel deduplication for chunk-based file backup. In Proceedings of IEEE International Symposium on Modeling, Analysis 8 Simulation of Computer and Telecommunication Systems (MASCOTS’09). 1--9.Google Scholar
- P. Bhatotia, R. Rodrigues, and A. Verma. 2012. Shredder: GPU-accelerated incremental storage and computation. In Proceedings of USENIX Conference on File and Storage Technologies (FAST’12). 1--15.Google Scholar
- B. H. Bloom. 1970. Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13, 7 (1970), 422--426.Google ScholarDigital Library
- P. Bose, H. Guo, E. Kranakis, A. Maheshwari, P. Morin, J. Morrison, M. Smid, and Y. Tang. 2008. On the false-positive rate of bloom filters. Inform. Process. Lett. 108, 4 (2008), 210--213.Google ScholarDigital Library
- F. C. Botelho, P. Shilane, N. Garg, and W. Hsu. 2013. Memory efficient sanitization of a deduplicated storage system. In Proceedings of USENIX Conference on File and Storage Technologies (FAST’13). 81--94.Google Scholar
- A. T. Clements, I. Ahmad, M. Vilayannur, and J. Li. 2009. Decentralized deduplication in SAN cluster file systems. In Proceedings of USENIX Annual Technical Conference (ATC’09). 101--114.Google Scholar
- B. Debnath, S. Sengupta, and J. Li. 2010. ChunkStash: Speeding up inline storage deduplication using flash memory. In Proceedings of USENIX Annual Technical Conference (ATC’10).Google Scholar
- B. Debnath, S. Sengupta, and J. Li. 2011. SkimpyStash: RAM space skimpy key-value store on flash-based storage. In Proceedings of ACM International Conference on Management of Data (SIGMOD’11). 25--36.Google Scholar
- M. Fu, D. Feng, Y. Hua, X. He, Z. Chen, W. Xia, F. Huang, and Q. Liu. 2014. Accelerating restore and garbage collection in deduplication-based backup systems via exploiting historical information. In Proceedings of USENIX Annual Technical Conference (ATC’14). 181--192.Google Scholar
- F. Guo and P. Efstathopoulos. 2011. Building a high-performance deduplication system. In Proceedings of USENIX Annual Technical Conference (ATC’11). 25--25.Google Scholar
- K. Jin and E. L. Miller. 2009. The effectiveness of deduplication on virtual machine disk images. In Proceedings of ACM Israeli Experimental Systems Conference (SYSTOR’09), Vol. 7.Google Scholar
- C. Kim, K.-W. Park, K. Park, and K. H. Park. 2011. Rethinking deduplication in cloud: From data profiling to blueprint. In Proceedings of IEEE International Conference on Networked Computing and Advanced Information Management (NCM’11). 101--104.Google Scholar
- J. Kim, C. Lee, S. Lee, I. Son, J. Choi, S. Yoon, H. u. Lee, S. Kang, Y. Won, and J. Cha. 2012a. Deduplication in SSDs: Model and quantitative analysis. In 2012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST’12). 1--12. DOI:http://dx.doi.org/10.1109/MSST.2012.6232379Google ScholarCross Ref
- J. Kim, C. Lee, S. Lee, I. Son, J. Choi, S. Yoon, H. u. Lee, S. Kang, Y. Won, and J. Cha. 2012b. Deduplication in SSDs: Model and quantitative analysis. In Proceedings of IEEE Symposium on Mass Storage Systems and Technologies (MSST’12). 1--12.Google Scholar
- R. Koller and R. Rangaswami. 2010. I/O deduplication: Utilizing content similarity to improve I/O performance. ACM Trans. Storage 6, 3 (2010), 13.Google ScholarDigital Library
- H. Li, T. Liang, and J. Chiu. 2013. A compound OpenMP/MPI program development toolkit for hybrid CPU/GPU clusters. J. Supercomput. 66, 1 (2013), 381--405.Google ScholarDigital Library
- X. Li and D. J. Lilja. 2009. A highly parallel GPU-based hash accelerator for a data deduplication system. In Proceedings of IASTED International Conference on Parallel and Distributed Computing and Systems (PDCS’09). 268--275.Google Scholar
- M. Lillibridge, K. Eshghi, D. Bhagwat, V. Deolalikar, G. Trezise, and P. Camble. 2009. Sparse indexing: Large scale, inline deduplication using sampling and locality. In Proceedings of USENIX Conference on File and Storage Technologies (FAST’09). 111--123.Google Scholar
- X. Lin, F. Douglis, J. Li, X. Li, R. Ricci, S. Smaldone, and G. Wallace. 2015a. Metadata considered harmful... to deduplication. In Proceedings of USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage’15). 11--16.Google Scholar
- X. Lin, M. Hibler, E. Eide, and R. Ricci. 2015b. Using deduplicating storage for efficient disk image deployment. EAI Endorsed Trans. Scalable Information Systems 2, 6 (2015), e1.Google Scholar
- G. Lu, Y. J. Nam, and D. H. C. Du. 2012. BloomStore: Bloom-filter based memory-efficient key-value store for indexing of data deduplication on flash. In Proceedings of IEEE Symposium on Mass Storage Systems and Technologies (MSST’12). 1--11.Google ScholarCross Ref
- J. Ma, R. J. Stones, Y. Ma, J. Wang, J. Ren, G. Wang, and X. Liu. 2016. Lazy exact deduplication. In Proceedings of IEEE 32th Symposium on Mass Storage Systems and Technologies (MSST’16). 1--10.Google Scholar
- L. Ma, C. Zhen, B. Zhao, J. Ma, G. Wang, and X. Liu. 2010. Towards fast de-duplication using low energy coprocessor. In Proceedings of IEEE International Conference on Networking, Architecture and Storage (NAS’10). 395--402.Google Scholar
- U. Manber. 1994. Finding similar files in a large file system. In Usenix Winter, Vol. 94. 1--10.Google ScholarDigital Library
- B. Mao, H. Jiang, S. Wu, Y. Fu, and L. Tian. 2014. Read-performance optimization for deduplication-based storage systems in the cloud. ACM Trans. Storage (TOS) 10, 2 (2014), 6:1--6:22.Google Scholar
- D. Meister and A. Brinkmann. 2010. dedupv1: Improving deduplication throughput using solid state drives (SSD). In Proceedings of IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST’10). 1--6.Google Scholar
- D. T. Meyer and W. J. Bolosky. 2011. A study of practical deduplication. In Proceedings of USENIX Conference on File and Storage Technologies (FAST’11) (2011), 1--13.Google Scholar
- D. T. Meyer and W. J. Bolosky. 2012. A study of practical deduplication. ACM Trans. Storage (TOS) 7, 4 (2012), 14:1--14:20.Google Scholar
- J. Min, D. Yoon, and Y. Won. 2011. Efficient deduplication techniques for modern backup operation. IEEE Trans. Comput. 60, 6 (2011), 824--840.Google ScholarDigital Library
- C.-H. Ng and P. P. C. Lee. 2013. RevDedup: A reverse deduplication storage system optimized for reads to latest backups. In Proceedings of ACM Asia-Pacific Workshop on Systems, Vol. 15.Google ScholarDigital Library
- NVIDIA. 2013. NVIDIA CUDA. Retrieved from https://developer.nvidia.com/cuda-downloads (July 2013).Google Scholar
- R. Pagh and F. F. Rodler. 2001. Cuckoo Hashing. Springer.Google Scholar
- J. Paulo and J. Pereira. 2014. A survey and classification of storage deduplication systems. ACM Comput. Surv. (CSUR) 47, 1 (2014), 11:1--11:30.Google Scholar
- C. Policroniades and I. Pratt. 2004. Alternatives for detecting redundancy in storage systems data. In Proceedings of USENIX Annual Technical Conference (ATC’04). 73--86.Google Scholar
- S. Quinlan and S. Dorward. 2002. Venti: A new approach to archival data storage. In Proceedings of USENIX Conference on File and Storage Technologies (FAST’02), Vol. 4. 89--102.Google Scholar
- M. O. Rabin. 1981. Fingerprinting by Random Polynomials. Center for Research in Computing Technology, Aiken Computation Laboratory.Google Scholar
- P. Shilane, M. Huang, G. Wallace, and W. Hsu. 2012. WAN-optimized replication of backup datasets using stream-informed delta compression. ACM Trans. Storage (TOS) 8, 4 (2012), 13:1--13:26.Google Scholar
- K. Srinivasan, T. Bisson, G. Goodson, and K. Voruganti. 2012. iDedup: Latency-aware, inline data deduplication for primary storage. In Proceedings of USENIX Conference on File and Storage Technologies (FAST’12). 24--24.Google Scholar
- Y. Tan, Z. Yan, D. Feng, X. He, Q. Zou, and L. Yang. 2015. De-Frag: An efficient scheme to improve deduplication performance via reducing data placement de-linearization. Cluster Comput. 18, 1 (2015), 79--92.Google ScholarDigital Library
- V. Tarasov, A. Mudrankit, W. Buik, P. Shilane, G. Kuenning, and E. Zadok. 2012. Generating realistic datasets for deduplication analysis. In Proceedings of USENIX Annual Technical Conference (ATC’12). 261--272.Google Scholar
- VIA Technologies. 2008. VIA nano processor. Retrieved from http://www.viatech.com.cn/cn/downloads/whitepapers/processors/WP080529VIA_Nano.pdf.Google Scholar
- M. M. Waldrop. 2016. The chips are down for moores law. Nature 530 (2016), 144--147.Google ScholarCross Ref
- W. Xia, H. Jiang, D. Feng, and Y. Hua. 2015. Similarity and locality based indexing for high performance data deduplication. IEEE Trans. Comput. 64, 4 (2015), 1162--1176.Google ScholarCross Ref
- W. Xia, H. Jiang, D. Feng, and H. Yu. 2011. SiLo: A similarity-locality based near-exact deduplication scheme with low RAM overhead and high throughput. In Proceedings of USENIX Annual Technical Conference (ATC’11). 26--28.Google Scholar
- B. Zhu, K. Li, and H. Patterson. 2008. Avoiding the disk bottleneck in the data domain deduplication file system. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST’08). 269--282.Google Scholar
Index Terms
- Lazy Exact Deduplication
Recommendations
A study of practical deduplication
We collected file system content data from 857 desktop computers at Microsoft over a span of 4 weeks. We analyzed the data to determine the relative efficacy of data deduplication, particularly considering whole-file versus block-level elimination of ...
Storage Deduplication by Virtual Large-Scale Disks
NBIS '12: Proceedings of the 2012 15th International Conference on Network-Based Information SystemsRecently, the demand of low cost large scale storages increases. We developed VLSD (Virtual Large Scale Disks) toolkit for constructing virtual disk based distributed storages, which aggregate free spaces of individual disks. VLSD realizes low-cost ...
Design and Implementation of Deduplication on F2FS
Data deduplication technology has gained popularity in modern file systems due to its ability to eliminate redundant writes and improve storage space efficiency. In recent years, the flash-friendly file system (F2FS) has been widely adopted in flash ...
Comments