skip to main content
research-article

Lazy Exact Deduplication

Authors Info & Claims
Published:10 June 2017Publication History
Skip Abstract Section

Abstract

Deduplication aims to reduce duplicate data in storage systems by removing redundant copies of data blocks, which are compared to one another using fingerprints. However, repeated on-disk fingerprint lookups lead to high disk traffic, which results in a bottleneck.

In this article, we propose a “lazy” data deduplication method, which buffers incoming fingerprints that are used to perform on-disk lookups in batches, with the aim of improving subsequent prefetching. In deduplication in general, prefetching is used to improve the cache hit rate by exploiting locality within the incoming fingerprint stream. For lazy deduplication, we design a buffering strategy that preserves locality in order to facilitate prefetching. Furthermore, as the proportion of deduplication time spent on I/O decreases, the proportion spent on fingerprint calculation and chunking increases. Thus, we also utilize parallel approaches (utilizing multiple CPU cores and a graphics processing unit) to further improve the overall performance.

Experimental results indicate that the lazy method improves fingerprint identification performance by over 50% compared with an “eager” method with the same data layout. The GPU improves the hash calculation by a factor of 4.6 and multithreaded chunking by a factor of 4.16. Deduplication performance can be improved by over 45% on SSD and 80% on HDD in the last round on the real datasets.

References

  1. D. Bhagwat, K. Eshghi, D. D. E. Long, and M. Lillibridge. 2009. Extreme binning: Scalable, parallel deduplication for chunk-based file backup. In Proceedings of IEEE International Symposium on Modeling, Analysis 8 Simulation of Computer and Telecommunication Systems (MASCOTS’09). 1--9.Google ScholarGoogle Scholar
  2. P. Bhatotia, R. Rodrigues, and A. Verma. 2012. Shredder: GPU-accelerated incremental storage and computation. In Proceedings of USENIX Conference on File and Storage Technologies (FAST’12). 1--15.Google ScholarGoogle Scholar
  3. B. H. Bloom. 1970. Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13, 7 (1970), 422--426.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. P. Bose, H. Guo, E. Kranakis, A. Maheshwari, P. Morin, J. Morrison, M. Smid, and Y. Tang. 2008. On the false-positive rate of bloom filters. Inform. Process. Lett. 108, 4 (2008), 210--213.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. F. C. Botelho, P. Shilane, N. Garg, and W. Hsu. 2013. Memory efficient sanitization of a deduplicated storage system. In Proceedings of USENIX Conference on File and Storage Technologies (FAST’13). 81--94.Google ScholarGoogle Scholar
  6. A. T. Clements, I. Ahmad, M. Vilayannur, and J. Li. 2009. Decentralized deduplication in SAN cluster file systems. In Proceedings of USENIX Annual Technical Conference (ATC’09). 101--114.Google ScholarGoogle Scholar
  7. B. Debnath, S. Sengupta, and J. Li. 2010. ChunkStash: Speeding up inline storage deduplication using flash memory. In Proceedings of USENIX Annual Technical Conference (ATC’10).Google ScholarGoogle Scholar
  8. B. Debnath, S. Sengupta, and J. Li. 2011. SkimpyStash: RAM space skimpy key-value store on flash-based storage. In Proceedings of ACM International Conference on Management of Data (SIGMOD’11). 25--36.Google ScholarGoogle Scholar
  9. M. Fu, D. Feng, Y. Hua, X. He, Z. Chen, W. Xia, F. Huang, and Q. Liu. 2014. Accelerating restore and garbage collection in deduplication-based backup systems via exploiting historical information. In Proceedings of USENIX Annual Technical Conference (ATC’14). 181--192.Google ScholarGoogle Scholar
  10. F. Guo and P. Efstathopoulos. 2011. Building a high-performance deduplication system. In Proceedings of USENIX Annual Technical Conference (ATC’11). 25--25.Google ScholarGoogle Scholar
  11. K. Jin and E. L. Miller. 2009. The effectiveness of deduplication on virtual machine disk images. In Proceedings of ACM Israeli Experimental Systems Conference (SYSTOR’09), Vol. 7.Google ScholarGoogle Scholar
  12. C. Kim, K.-W. Park, K. Park, and K. H. Park. 2011. Rethinking deduplication in cloud: From data profiling to blueprint. In Proceedings of IEEE International Conference on Networked Computing and Advanced Information Management (NCM’11). 101--104.Google ScholarGoogle Scholar
  13. J. Kim, C. Lee, S. Lee, I. Son, J. Choi, S. Yoon, H. u. Lee, S. Kang, Y. Won, and J. Cha. 2012a. Deduplication in SSDs: Model and quantitative analysis. In 2012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST’12). 1--12. DOI:http://dx.doi.org/10.1109/MSST.2012.6232379Google ScholarGoogle ScholarCross RefCross Ref
  14. J. Kim, C. Lee, S. Lee, I. Son, J. Choi, S. Yoon, H. u. Lee, S. Kang, Y. Won, and J. Cha. 2012b. Deduplication in SSDs: Model and quantitative analysis. In Proceedings of IEEE Symposium on Mass Storage Systems and Technologies (MSST’12). 1--12.Google ScholarGoogle Scholar
  15. R. Koller and R. Rangaswami. 2010. I/O deduplication: Utilizing content similarity to improve I/O performance. ACM Trans. Storage 6, 3 (2010), 13.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. H. Li, T. Liang, and J. Chiu. 2013. A compound OpenMP/MPI program development toolkit for hybrid CPU/GPU clusters. J. Supercomput. 66, 1 (2013), 381--405.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. X. Li and D. J. Lilja. 2009. A highly parallel GPU-based hash accelerator for a data deduplication system. In Proceedings of IASTED International Conference on Parallel and Distributed Computing and Systems (PDCS’09). 268--275.Google ScholarGoogle Scholar
  18. M. Lillibridge, K. Eshghi, D. Bhagwat, V. Deolalikar, G. Trezise, and P. Camble. 2009. Sparse indexing: Large scale, inline deduplication using sampling and locality. In Proceedings of USENIX Conference on File and Storage Technologies (FAST’09). 111--123.Google ScholarGoogle Scholar
  19. X. Lin, F. Douglis, J. Li, X. Li, R. Ricci, S. Smaldone, and G. Wallace. 2015a. Metadata considered harmful... to deduplication. In Proceedings of USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage’15). 11--16.Google ScholarGoogle Scholar
  20. X. Lin, M. Hibler, E. Eide, and R. Ricci. 2015b. Using deduplicating storage for efficient disk image deployment. EAI Endorsed Trans. Scalable Information Systems 2, 6 (2015), e1.Google ScholarGoogle Scholar
  21. G. Lu, Y. J. Nam, and D. H. C. Du. 2012. BloomStore: Bloom-filter based memory-efficient key-value store for indexing of data deduplication on flash. In Proceedings of IEEE Symposium on Mass Storage Systems and Technologies (MSST’12). 1--11.Google ScholarGoogle ScholarCross RefCross Ref
  22. J. Ma, R. J. Stones, Y. Ma, J. Wang, J. Ren, G. Wang, and X. Liu. 2016. Lazy exact deduplication. In Proceedings of IEEE 32th Symposium on Mass Storage Systems and Technologies (MSST’16). 1--10.Google ScholarGoogle Scholar
  23. L. Ma, C. Zhen, B. Zhao, J. Ma, G. Wang, and X. Liu. 2010. Towards fast de-duplication using low energy coprocessor. In Proceedings of IEEE International Conference on Networking, Architecture and Storage (NAS’10). 395--402.Google ScholarGoogle Scholar
  24. U. Manber. 1994. Finding similar files in a large file system. In Usenix Winter, Vol. 94. 1--10.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. B. Mao, H. Jiang, S. Wu, Y. Fu, and L. Tian. 2014. Read-performance optimization for deduplication-based storage systems in the cloud. ACM Trans. Storage (TOS) 10, 2 (2014), 6:1--6:22.Google ScholarGoogle Scholar
  26. D. Meister and A. Brinkmann. 2010. dedupv1: Improving deduplication throughput using solid state drives (SSD). In Proceedings of IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST’10). 1--6.Google ScholarGoogle Scholar
  27. D. T. Meyer and W. J. Bolosky. 2011. A study of practical deduplication. In Proceedings of USENIX Conference on File and Storage Technologies (FAST’11) (2011), 1--13.Google ScholarGoogle Scholar
  28. D. T. Meyer and W. J. Bolosky. 2012. A study of practical deduplication. ACM Trans. Storage (TOS) 7, 4 (2012), 14:1--14:20.Google ScholarGoogle Scholar
  29. J. Min, D. Yoon, and Y. Won. 2011. Efficient deduplication techniques for modern backup operation. IEEE Trans. Comput. 60, 6 (2011), 824--840.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. C.-H. Ng and P. P. C. Lee. 2013. RevDedup: A reverse deduplication storage system optimized for reads to latest backups. In Proceedings of ACM Asia-Pacific Workshop on Systems, Vol. 15.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. NVIDIA. 2013. NVIDIA CUDA. Retrieved from https://developer.nvidia.com/cuda-downloads (July 2013).Google ScholarGoogle Scholar
  32. R. Pagh and F. F. Rodler. 2001. Cuckoo Hashing. Springer.Google ScholarGoogle Scholar
  33. J. Paulo and J. Pereira. 2014. A survey and classification of storage deduplication systems. ACM Comput. Surv. (CSUR) 47, 1 (2014), 11:1--11:30.Google ScholarGoogle Scholar
  34. C. Policroniades and I. Pratt. 2004. Alternatives for detecting redundancy in storage systems data. In Proceedings of USENIX Annual Technical Conference (ATC’04). 73--86.Google ScholarGoogle Scholar
  35. S. Quinlan and S. Dorward. 2002. Venti: A new approach to archival data storage. In Proceedings of USENIX Conference on File and Storage Technologies (FAST’02), Vol. 4. 89--102.Google ScholarGoogle Scholar
  36. M. O. Rabin. 1981. Fingerprinting by Random Polynomials. Center for Research in Computing Technology, Aiken Computation Laboratory.Google ScholarGoogle Scholar
  37. P. Shilane, M. Huang, G. Wallace, and W. Hsu. 2012. WAN-optimized replication of backup datasets using stream-informed delta compression. ACM Trans. Storage (TOS) 8, 4 (2012), 13:1--13:26.Google ScholarGoogle Scholar
  38. K. Srinivasan, T. Bisson, G. Goodson, and K. Voruganti. 2012. iDedup: Latency-aware, inline data deduplication for primary storage. In Proceedings of USENIX Conference on File and Storage Technologies (FAST’12). 24--24.Google ScholarGoogle Scholar
  39. Y. Tan, Z. Yan, D. Feng, X. He, Q. Zou, and L. Yang. 2015. De-Frag: An efficient scheme to improve deduplication performance via reducing data placement de-linearization. Cluster Comput. 18, 1 (2015), 79--92.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. V. Tarasov, A. Mudrankit, W. Buik, P. Shilane, G. Kuenning, and E. Zadok. 2012. Generating realistic datasets for deduplication analysis. In Proceedings of USENIX Annual Technical Conference (ATC’12). 261--272.Google ScholarGoogle Scholar
  41. VIA Technologies. 2008. VIA nano processor. Retrieved from http://www.viatech.com.cn/cn/downloads/whitepapers/processors/WP080529VIA_Nano.pdf.Google ScholarGoogle Scholar
  42. M. M. Waldrop. 2016. The chips are down for moores law. Nature 530 (2016), 144--147.Google ScholarGoogle ScholarCross RefCross Ref
  43. W. Xia, H. Jiang, D. Feng, and Y. Hua. 2015. Similarity and locality based indexing for high performance data deduplication. IEEE Trans. Comput. 64, 4 (2015), 1162--1176.Google ScholarGoogle ScholarCross RefCross Ref
  44. W. Xia, H. Jiang, D. Feng, and H. Yu. 2011. SiLo: A similarity-locality based near-exact deduplication scheme with low RAM overhead and high throughput. In Proceedings of USENIX Annual Technical Conference (ATC’11). 26--28.Google ScholarGoogle Scholar
  45. B. Zhu, K. Li, and H. Patterson. 2008. Avoiding the disk bottleneck in the data domain deduplication file system. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST’08). 269--282.Google ScholarGoogle Scholar

Index Terms

  1. Lazy Exact Deduplication

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Storage
        ACM Transactions on Storage  Volume 13, Issue 2
        Special Issue on MSST 2016 and Regular Papers
        May 2017
        199 pages
        ISSN:1553-3077
        EISSN:1553-3093
        DOI:10.1145/3098275
        • Editor:
        • Sam H. Noh
        Issue’s Table of Contents

        Copyright © 2017 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 10 June 2017
        • Accepted: 1 March 2017
        • Revised: 1 February 2017
        • Received: 1 November 2016
        Published in tos Volume 13, Issue 2

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader