skip to main content
research-article

Efficient Deduplication in a Distributed Primary Storage Infrastructure

Authors Info & Claims
Published:20 May 2016Publication History
Skip Abstract Section

Abstract

A large amount of duplicate data typically exists across volumes of virtual machines in cloud computing infrastructures. Deduplication allows reclaiming these duplicates while improving the cost-effectiveness of large-scale multitenant infrastructures. However, traditional archival and backup deduplication systems impose prohibitive storage overhead for virtual machines hosting latency-sensitive applications. Primary deduplication systems reduce such penalty but rely on special cluster filesystems, centralized components, or restrictive workload assumptions. Also, some of these systems reduce storage overhead by confining deduplication to off-peak periods that may be scarce in a cloud environment.

We present DEDIS, a dependable and fully decentralized system that performs cluster-wide off-line deduplication of virtual machines’ primary volumes. DEDIS works on top of any unsophisticated storage backend, centralized or distributed, as long as it exports a basic shared block device interface. Also, DEDIS does not rely on data locality assumptions and incorporates novel optimizations for reducing deduplication overhead and increasing its reliability.

The evaluation of an open-source prototype shows that minimal I/O overhead is achievable even when deduplication and intensive storage I/O are executed simultaneously. Also, our design scales out and allows collocating DEDIS components and virtual machines in the same servers, thus, sparing the need of additional hardware.

References

  1. Rami Al-Rfou, Nikhil Patwardhan, and Phanindra Bhagavatula. 2010. Deduplication and Compression Benchmarking in Filebench. Technical Report.Google ScholarGoogle Scholar
  2. Darrell Anderson. 2002. Fstress: A Flexible Network File Service Benchmark. Technical Report. Duke University.Google ScholarGoogle Scholar
  3. Deepavali Bhagwat, Kave Eshghi, Darrell D. E. Long, and Mark Lillibridge. 2009. Extreme binning: Scalable, parallel deduplication for chunk-based file backup. In Proceedings of International Symposium on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS).Google ScholarGoogle ScholarCross RefCross Ref
  4. William J. Bolosky, Scott Corbin, David Goebel, and John R. Douceur. 2000. Single instance storage in Windows 2000. In Proceedings of USENIX Windows System Symposium (WSS). Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Citrix Systems, Inc. 2014. Blktap documentation. Retrieved from http://wiki.xen.org/wiki/Blktap2.Google ScholarGoogle Scholar
  6. Austin T. Clements, Irfan Ahmad, Murali Vilayannur, and Jinyuan Li. 2009. Decentralized deduplication in SAN cluster file systems. In Proceedings of USENIX Annual Technical Conference (ATC). Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Russell Coker. 2015. Bonnie++ web page. Retrieved from http://www.coker.com.au/bonnie++/.Google ScholarGoogle Scholar
  8. D. Iacono. 2013. Enterprise storage: Efficient,virtualized and flash optimized. IDC White Paper.Google ScholarGoogle Scholar
  9. Biplob Debnath, Sudipta Sengupta, and Jin Li. 2010. Chunk stash: Speeding up inline storage deduplication using flash memory. In Proceedings of USENIX Annual Technical Conference (ATC). Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Wei Dong, Fred Douglis, Kai Li, Hugo Patterson, Sazzala Reddy, and Philip Shilane. 2011. Tradeoffs in scalable data routing for deduplication clusters. In Proceedings of USENIX Conference on File and Storage Technologies (FAST). Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. John R. Douceur, Atul Adya, William J. Bolosky, Dan Simon, and Marvin Theimer. 2002. Reclaiming Space from Duplicate Files in a Serverless Distributed File System. Technical Report MSR-TR-2002-30. Microsoft Research.Google ScholarGoogle Scholar
  12. Ahmed El-Shimi, Ran Kalach, Ankit Kumar, Adi Oltean, Jin Li, and Sudipta Sengupta. 2012. Primary data deduplication large scale study and system design. In Proceedings of USENIX Annual Technical Conference (ATC). Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. EMC. 2012. New Digital Universe Study Reveals Big Data Gap. http://www.emc.com/about/news/press/2012/20121211-01.htm. (2012).Google ScholarGoogle Scholar
  14. Davide Frey, Anne-Marie Kermarrec, and Konstantinos Kloudas. 2012. Probabilistic deduplication for cluster-based storage systems. In Proceedings of the Third ACM Symposium on Cloud Computing (SOCC). Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Yinjin Fu, Hong Jiang, and Nong Xiao. 2012. A scalable inline cluster deduplication framework for big data protection. In Proceedings of ACM/IFIP/USENIX International Middleware Conference. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Fanglu Guo and Petros Efstathopoulos. 2011. Building a high-performance deduplication system. In Proceedings of USENIX Annual Technical Conference (ATC). Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. HP. 2011. Complete storage and data protection architecture for VMware vSphere. White Paper (2011).Google ScholarGoogle Scholar
  18. Bo Hong and Darrell D. E. Long. 2004. Duplicate data elimination in a san file system. In Proceedings of Conference on Mass Storage Systems (MSST).Google ScholarGoogle Scholar
  19. Keren Jin and Ethan L. Miller. 2009. The effectiveness of deduplication on virtual machine disk images. In Proceedings of International Systems and Storage Conference (SYSTOR). Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Jones, M. 2010. Virtio: An I/O virtualization framework for linux. IBM White Paper (2010).Google ScholarGoogle Scholar
  21. Michal Kaczmarczyk, Marcin Barczynski, Wojciech Kilian, and Cezary Dubnicki. 2012. Reducing impact of data fragmentation caused by in-line deduplication. In Proceedings of International Systems and Storage Conference (SYSTOR). Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Jürgen Kaiser, Dirk Meister, André Brinkmann, and Sascha Effert. 2012. Design of an exact data deduplication cluster. In Proceedings of Conference on Mass Storage Systems (MSST).Google ScholarGoogle ScholarCross RefCross Ref
  23. Jeffrey Katcher. 1997. PostMark: A New File System Benchmark. Technical Report. NetApp.Google ScholarGoogle Scholar
  24. Ricardo Koller and Raju Rangaswami. 2010a. I/O deduplication: Utilizing content similarity to improve I/O performance. ACM Transactions on Storage 6, 3 (Sept. 2010), 13:1--13:26. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Ricardo Koller and Raju Rangaswami. 2010b. I/O deduplication: Utilizing content similarity to improve I/O performance. In Proceedings of USENIX Conference on File and Storage Technologies (FAST). Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Lessfs. 2014. Lessfs page. Retrieved from http://www.lessfs.com/wordpress/.Google ScholarGoogle Scholar
  27. Yan-Kit Li, Min Xu, Chun-Ho Ng, and Patrick P. C. Lee. 2014. Efficient hybrid inline and out-of-line deduplication for backup storage. Trans. Storage 11, 1 (2014), 2:1--2:21. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Anthony Liguori and Eric Van Hensbergen. 2008. Experiences with content addressable storage and virtual disks. In Proceedings of USENIX Workshop on I/O Virtualization (WIOV). Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Mark Lillibridge, Kave Eshghi, Deepavali Bhagwat, Vinay Deolalikar, Greg Trezise, and Peter Camble. 2009. Sparse indexing: Large scale, inline deduplication using sampling and locality. In Proceedings of USENIX Conference on File and Storage Technologies (FAST). Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. D. Meister and A. Brinkmann. 2010. dedupv1: Improving deduplication throughput using solid state drives (SSD). In Proceedings of Conference on Mass Storage Systems (MSST). Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Dutch T. Meyer, Gitika Aggarwal, Brendan Cully, Geoffrey Lefebvre, Michael J. Feeley, Norman C. Hutchinson, and Andrew Warfield. 2008. Parallax: Virtual disks for virtual machines. In Proceedings of European Conference on Computer Systems (EuroSys). Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Dutch T. Meyer and William J. Bolosky. 2011. A study of practical deduplication. In Proceedings of USENIX Conference on File and Storage Technologies (FAST). Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Dutch T. Meyer and William J. Bolosky. 2012. A study of practical deduplication. ACM Transactions on Storage 7, 4 (2012), 14:1--14:20. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Chun-Ho Ng, Mingcao Ma, Tsz-Yeung Wong, Patrick P. C. Lee, and John C. S. Lui. 2011. Live deduplication storage of virtual machine images in an open-source cloud. In Proceedings of ACM/IFIP/USENIX International Middleware Conference. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. William Norcott. 2015. IOzone web page. Retrieved from http://www.iozone.org/.Google ScholarGoogle Scholar
  36. Michael A. Olson, Keith Bostic, and Margo Seltzer. 1999. Berkeley DB. In Proceedings of USENIX Annual Technical Conference (ATC). Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Opendedup. 2014. Opendedup web page. Retrieved from http://opendedup.org.Google ScholarGoogle Scholar
  38. OpenSolaris. 2014. ZFS documentation. Retrieved from http://www.freebsd.org/doc/en/books/handbook/filesystems-zfs.html.Google ScholarGoogle Scholar
  39. OpenStack Foundation. 2014. OpenStack web page. Retrieved from https://www.openstack.org.Google ScholarGoogle Scholar
  40. OpenStack Foundation. 2016. Cinder documentation. Retrieved from http://docs.openstack.org/developer/cinder/.Google ScholarGoogle Scholar
  41. T. Ozawa and M. Kazutaka. 2014. ACCORD web page. Retrieved from http://www.osrg.net/accord/.Google ScholarGoogle Scholar
  42. Joao Paulo and Jose Pereira. 2011. Model checking a decentralized storage deduplication protocol. In Fast Abstract in Latin-American Symposium on Dependable Computing.Google ScholarGoogle Scholar
  43. J. Paulo and J. Pereira. 2014a. Distributed exact deduplication for primary storage infrastructures. In Distributed Applications and Interoperable Systems. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. João Paulo and José Pereira. 2014b. A survey and classification of storage deduplication systems. Comput. Surveys 47, 1 (2014), 11:1--11:30. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. J. Paulo, P. Reis, J. Pereira, and A. Sousa. 2012. DEDISbench: A benchmark for deduplicated storage systems. In Proceedings of International Symposium on Secure Virtual Infrastructures (DOA-SVI).Google ScholarGoogle Scholar
  46. J. Paulo, P. Reis, J. Pereira, and A. Sousa. 2013. Towards an accurate evaluation of deduplicated storage systems. International Journal of Computer Systems Science and Engineering 29, 1, 1:73--1:83.Google ScholarGoogle Scholar
  47. Sean Quinlan and Sean Dorward. 2002. Venti: A new approach to archival storage. In Proceedings of USENIX Conference on File and Storage Technologies (FAST). Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Sean Rhea, Russ Cox, and Alex Pesterev. 2008. Fast, inexpensive content-addressed storage in foundation. In Proceedings of USENIX Annual Technical Conference (ATC). Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Rusty Russell. 2008. Virtio: Towards a de-facto standard for virtual I/O devices. SIGOPS Operating Systems Review 42, 5 (2008), 95--103. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Philip Shilane, Grant Wallace, Mark Huang, and Windsor Hsu. 2012. Delta compressed and deduplicated storage using stream-informed locality. In Proceedings of USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage). Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Kiran Srinivasan, Tim Bisson, Garth Goodson, and Kaladhar Voruganti. 2012. iDedup: Latency-aware, inline data deduplication for primary storage. In Proceedings of USENIX Conference on File and Storage Technologies (FAST). Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Vasily Tarasov, Amar Mudrankit, Will Buik, Philip Shilane, Geoff Kuenning, and Erez Zadok. 2012. Generating realistic datasets for deduplication analysis. In Poster Session of USENIX Annual Technical Conference (ATC). Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Y. Tsuchiya and T. Watanabe. 2011. DBLK: Deduplication for primary block storage. In Proceedings of Conference on Mass Storage Systems (MSST). Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Cristian Ungureanu, Benjamin Atkin, Akshat Aranya, Salil Gokhale, Stephen Rago, Grzegorz Calkowski, Cezary Dubnicki, and Aniruddha Bohra. 2010. HydraFS: A high-throughput file system for the HYDRAstor content-addressable storage system. In Proceedings of USENIX Conference on File and Storage Technologies (FAST). Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Jiansheng Wei, Hong Jiang, Ke Zhou, and Dan Feng. 2010. MAD2: A scalable high-throughput exact deduplication approach for network backup services. In Proceedings of Conference on Mass Storage Systems (MSST). Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Avani Wildani, Ethan L. Miller, and Ohad Rodeh. 2013. HANDS: A heuristically arranged non-backup in-line deduplication system. In Proceedings of the International Conference on Data Engineering (ICDE). Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Wen Xia, Hong Jiang, Dan Feng, and Yu Hua. 2011. SiLo: A similarity-locality based near-exact deduplication scheme with low RAM overhead and high throughput. In Proceedings of USENIX Annual Technical Conference (ATC). Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Tianming Yang, Hong Jiang, Dan Feng, Zhongying Niu, Ke Zhou, and Yaping Wan. 2010. DEBAR: A scalable high-performance de-duplication storage system for backup and archiving. In Proceedings of International Parallel & Distributed Processing Symposium (IPDPS).Google ScholarGoogle ScholarCross RefCross Ref
  59. Lawrence L. You, Kristal T. Pollack, and Darrell D. E. Long. 2005. Deep store: An archival storage system architecture. In Proceedings of International Conference on Data Engineering (ICDE). Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Benjamin Zhu, Kai Li, and Hugo Patterson. 2008. Avoiding the disk bottleneck in the data domain deduplication file system. In Proceedings of USENIX Conference on File and Storage Technologies (FAST). Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Efficient Deduplication in a Distributed Primary Storage Infrastructure
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Storage
          ACM Transactions on Storage  Volume 12, Issue 4
          August 2016
          213 pages
          ISSN:1553-3077
          EISSN:1553-3093
          DOI:10.1145/2940403
          Issue’s Table of Contents

          Copyright © 2016 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 20 May 2016
          • Accepted: 1 January 2016
          • Revised: 1 September 2015
          • Received: 1 October 2014
          Published in tos Volume 12, Issue 4

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader