research-article

A Survey and Classification of Storage Deduplication Systems

Authors:
João Paulo

High-Assurance Software Lab (HASLab), INESC TEC & University of Minho, Braga, Portugal

High-Assurance Software Lab (HASLab), INESC TEC & University of Minho, Braga, Portugal
View Profile

,
José Pereira

High-Assurance Software Lab (HASLab), INESC TEC & University of Minho, Braga, Portugal

High-Assurance Software Lab (HASLab), INESC TEC & University of Minho, Braga, Portugal
View Profile

Authors Info & Claims

ACM Computing Surveys Volume 47 Issue 1Article No.: 11pp 1–30https://doi.org/10.1145/2611778

Published:01 June 2014Publication History

ACM Computing Surveys

Abstract

The automatic elimination of duplicate data in a storage system, commonly known as deduplication, is increasingly accepted as an effective technique to reduce storage costs. Thus, it has been applied to different storage types, including archives and backups, primary storage, within solid-state drives, and even to random access memory. Although the general approach to deduplication is shared by all storage types, each poses specific challenges and leads to different trade-offs and solutions. This diversity is often misunderstood, thus underestimating the relevance of new research and development.

The first contribution of this article is a classification of deduplication systems according to six criteria that correspond to key design decisions: granularity, locality, timing, indexing, technique, and scope. This classification identifies and describes the different approaches used for each of them. As a second contribution, we describe which combinations of these design decisions have been proposed and found more useful for challenges in each storage type. Finally, outstanding research challenges and unexplored design points are identified and discussed.

References

Ashok Anand, Chitra Muthukrishnan, Steven Kappes, Aditya Akella, and Suman Nath. 2010. Cheap and large CAMs for high performance data-intensive networked systems. In Proceedings of the Symposium on Networked Systems Design and Implementation (NSDI). USENIX, Berkeley, CA, 433--449. Google ScholarDigital Library
Andrea Arcangeli, Izik Eidus, and Chris Wright. 2009. Increasing memory density by using KSM. In Proceedings of the Linux Symposium. 19--28.Google Scholar
Lior Aronovich, Ron Asher, Eitan Bachmat, Haim Bitner, Michael Hirsch, and Shmuel T. Klein. 2009. The design of a similarity based deduplication system. In Proceedings of International Systems and Storage Conference (SYSTOR). ACM, New York, NY, 1--14. Google ScholarDigital Library
Brian Berliner. 1990. CVS II: Parallelizing software development. In Proceedings of USENIX Winter Technical Conference. USENIX, Berkeley, CA, 341--352.Google Scholar
Deepavali Bhagwat, Kave Eshghi, Darrell D. E. Long, and Mark Lillibridge. 2009. Extreme Binning: Scalable, parallel deduplication for chunk-based file backup. In Proceedings of International Symposium on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS). IEEE Computer Society, Washington, DC, 1--9.Google ScholarCross Ref
Deepavali Bhagwat, Kristal Pollack, Darrell D. E. Long, Thomas Schwarz, Ethan L. Miller, and Jehan Franois Pris. 2006. Providing high reliability in a minimum redundancy archival storage system. In Proceedings of International Symposium on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS). IEEE Computer Society, Washington, DC, 1--9. Google ScholarDigital Library
Deepak R. Bobbarjung, Suresh Jagannathan, and Cezary Dubnicki. 2006. Improving duplicate elimination in storage systems. ACM Transactions on Storage 2, 4 (November 2006), 424--448. Google ScholarDigital Library
William J. Bolosky, Scott Corbin, David Goebel, and John R. Douceur. 2000. Single instance storage in Windows 2000. In Proceedings of the USENIX Windows System Symposium (WSS). USENIX, Berkeley, CA, 1--12. Google ScholarDigital Library
Andrei Broder. 1997. On the resemblance and containment of documents. In Proceedings of the Compression and Complexity of Sequences. IEEE Computer Society, Washington, DC, 21--30. Google ScholarDigital Library
Andrei Z. Broder. 1993. Some applications of Rabin’s fingerprinting method. In Sequences II: Methods in Communications, Security, and Computer Science. 143--152.Google Scholar
Edouard Bugnion, Scott Devine, and Mendel Rosenblum. 1997. Disco: Running commodity operating systems on scalable multiprocessors. ACM Transactions on Computer Systems 15, 4 (November 1997), 143--156. Google ScholarDigital Library
Randal C. Burns and Darrell D. E. Long. 1997. Efficient distributed backup with delta compression. In Proceedings of the Workshop on I/O in Parallel and Distributed Systems (IOPADS). ACM, New York, NY, 27--36. Google ScholarDigital Library
Feng Chen, Tian Luo, and Xiaodong Zhang. 2011. CAFTL: A content-aware flash translation layer enhancing the lifespan of flash memory based solid state drives. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST). USENIX, Berkeley, CA, 77--90. Google ScholarDigital Library
David Cheriton, Amin Firoozshahian, Alex Solomatnikov, John P. Stevenson, and Omid Azizi. 2012. HICAMP: Architectural support for efficient concurrency-safe shared structured data dccess. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). ACM, New York, NY, 287--300. Google ScholarDigital Library
Christopher Chute, Alex Manfrediz, Stephen Minton, David Reinsel, Wolfgang Schlichting, and Anna Toncheva. 2008. The diverse and exploding digital universe: An updated forecast of worldwide information growth through 2011. IDC white paper, sponsored by EMC. Retrieved September 12, 2013, from http://www.emc.com/collateral/analyst-reports/diverse-exploding-digital-universe.pdf.Google Scholar
Austin T. Clements, Irfan Ahmad, Murali Vilayannur, and Jinyuan Li. 2009. Decentralized deduplication in SAN cluster file systems. In Proceedings of the USENIX Annual Technical Conference (ATC). USENIX, Berkeley, CA, 1--14. Google ScholarDigital Library
Christian Collberg, John H. Hartman, Sridivya Babu, and Sharath K. Udupa. 2005. Slinky: Static linking reloaded. In Proceedings of the USENIX Annual Technical Conference (ATC). USENIX, Berkeley, CA, 309--322. Google ScholarDigital Library
Cornel Constantinescu, Joseph Glider, and David Chambliss. 2011. Mixing deduplication and compression on active data sets. In Proceedings of the Data Compression Conference (DCC). IEEE Computer Society, Washington, DC, 393--402. Google ScholarDigital Library
Landon P. Cox, Christopher D. Murray, and Brian D. Noble. 2002. Pastiche: Making backup cheap and easy. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI). USENIX, Berkeley, CA, 1--13. Google ScholarDigital Library
Biplob Debnath, Sudipta Sengupta, and Jin Li. 2010. ChunkStash: Speeding up inline storage deduplication using flash memory. In Proceedings of the USENIX Annual Technical Conference (ATC). USENIX, Berkeley, CA, 1--16. Google ScholarDigital Library
Biplob Debnath, Sudipta Sengupta, and Jin Li. 2011. SkimpyStash: RAM space skimpy key-value store on flash-based storage. In Proceedings of the ACM’s Special Interest Group on Management of Data (SIGMOD). ACM, New York, NY, 25--36. Google ScholarDigital Library
Wei Dong, Fred Douglis, Kai Li, Hugo Patterson, Sazzala Reddy, and Philip Shilane. 2011. Tradeoffs in scalable data routing for deduplication clusters. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST). USENIX, Berkeley, CA, 15--29. Google ScholarDigital Library
John R. Douceur, Atul Adya, William J. Bolosky, Dan Simon, and Marvin Theimer. 2002. Reclaiming space from duplicate files in a serverless distributed file system. Technical Report MSR-TR-2002-30. Microsoft Research. 1--14 pages.Google Scholar
Fred Douglis and Arun Iyengar. 2003. Application-specific delta-encoding via resemblance detection. In Proceedings of the USENIX Annual Technical Conference (ATC). USENIX, Berkeley, CA, 113--126.Google Scholar
Fred Douglis, Jason Lavoie, John M. Tracey, Purushottam Kulkarni, and Purushottam Kulkarni. 2004. Redundancy elimination within large collections of files. In Proceedings of the USENIX Annual Technical Conference (ATC). USENIX, Berkeley, CA, 1--5. Google ScholarDigital Library
Cezary Dubnicki, Leszek Gryz, Lukasz Heldt, Michal Kaczmarczyk, Wojciech Kilian, Przemyslaw Strzelczak, Jerzy Szczepkowski, Cristian Ungureanu, and Michal Welnicki. 2009. HYDRAstor: A scalable secondary storage. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST). USENIX, Berkeley, CA, 197--210. Google ScholarDigital Library
Ahmed El-Shimi, Ran Kalach, Ankit Kumar, Adi Oltean, Jin Li, and Sudipta Sengupta. 2012. Primary data deduplication large scale study and system design. In Proceedings of the USENIX Annual Technical Conference (ATC). USENIX, Berkeley, CA, 1--14. Google ScholarDigital Library
Kave Eshghi, Mark Lillibridge, Lawrence Wilcock, Guillaume Belrose, and Rycharde Hawkes. 2007. Jumbo Store: Providing efficient incremental upload and versioning for a utility rendering service. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST). USENIX, Berkeley, CA, 123--138. Google ScholarDigital Library
Kave Eshghi and Hsiu K. Tang. 2005. A framework for analyzing and improving content-based chunking algorithms. Technical Report HPL-2005-30. Intelligent Enterprise Technologies Laboratory. 1--10 pages. Available at http://www.hpl.hp.com/techreports/2005/HPL-2005-30R1.pdf.Google Scholar
Davide Frey, Anne-Marie Kermarrec, and Konstantinos Kloudas. 2012. Probabilistic deduplication for cluster-based storage systems. In Proceedings of the 3rd ACM Symposium on Cloud Computing (SOCC). ACM, New York, NY, 1--14. Google ScholarDigital Library
Yinjin Fu, Hong Jiang, and Nong Xiao. 2012. A scalable inline cluster deduplication framework for big data protection. In Proceedings of the ACM/IFIP/USENIX International Middleware Conference. ACM, New York, NY, 354--373. Google ScholarDigital Library
Fanglu Guo and Petros Efstathopoulos. 2011. Building a high-performance deduplication system. In Proceedings of the USENIX Annual Technical Conference (ATC). USENIX, Berkeley, CA, 1--14. Google ScholarDigital Library
Aayush Gupta, Raghav Pisolkar, Bhuvan Urgaonkar, and Anand Sivasubramaniam. 2011. Leveraging value locality in optimizing NAND flash-based SSDs. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST). USENIX, Berkeley, CA, 91--103. Google ScholarDigital Library
Diwaker Gupta, Sangmin Lee, Michael Vrable, Stefan Savage, Alex C. Snoeren, George Varghese, Geoffrey M. Voelker, and Amin Vahdat. 2010. Difference engine: Harnessing memory redundancy in virtual machines. Communications of the ACM 53, 10 (October 2010), 85--93. Google ScholarDigital Library
Danny Harnik, Benny Pinkas, and Alexandra Shulman-Peleg. 2010. Side channels in cloud services: Deduplication in cloud storage. IEEE Security and Privacy 8, 6 (November 2010), 40--47. Google ScholarDigital Library
Bo Hong and Darrell D. E. Long. 2004. Duplicate data elimination in a SAN file system. In Proceedings of the Conference on Mass Storage Systems (MSST). IEEE Computer Society, Washington, DC, 301--314.Google Scholar
James J. Hunt, Kiem-Phong Vo, and Walter F. Tichy. 1998. Delta algorithms: An empirical analysis. ACM Transactions on Software Engineering and Methodology 7, 2 (April 1998), 192--214. Google ScholarDigital Library
Keren Jin and Ethan L. Miller. 2009. The effectiveness of deduplication on virtual machine disk images. In Proceedings of the International Systems and Storage Conference (SYSTOR). ACM, New York, NY, 7:1--7:12. Google ScholarDigital Library
Michal Kaczmarczyk, Marcin Barczynski, Wojciech Kilian, and Cezary Dubnicki. 2012. Reducing impact of data fragmentation caused by in-line deduplication. In Proceedings of the International Systems and Storage Conference (SYSTOR). ACM, New York, NY, 1--12. Google ScholarDigital Library
Jürgen Kaiser, Dirk Meister, André Brinkmann, and Sascha Effert. 2012. Design of an exact data deduplication cluster. In Proceedings of the Conference on Mass Storage Systems (MSST). IEEE Computer Society, Washington, DC, 1--12.Google ScholarCross Ref
Jonghwa Kim, Choonghyun Lee, Sangyup Lee, Ikjoon Son, Jongmoo Choi, Sungroh Yoon, Hu ung Lee, Sooyong Kang, Youjip Won, and Jaehyuk Cha. 2012. Deduplication in SSDs: Model and quantitative analysis. In Proceedings of the Conference on Mass Storage Systems (MSST). IEEE Computer Society, Washington, DC, 1--12.Google ScholarCross Ref
Ricardo Koller and Raju Rangaswami. 2010. I/O deduplication: Utilizing content similarity to improve I/O performance. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST). USENIX, Berkeley, CA, 211--224. Google ScholarDigital Library
Erik Kruus, Cristian Ungureanu, and Cezary Dubnicki. 2010. Bimodal content defined chunking for backup streams. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST). USENIX, Berkeley, CA, 239--252. Google ScholarDigital Library
Anthony Liguori and Eric Van Hensbergen. 2008. Experiences with content addressable storage and virtual disks. In Proceedings of the USENIX Workshop on I/O Virtualization (WIOV). USENIX, Berkeley, CA, 1--5. Google ScholarDigital Library
Mark Lillibridge, Kave Eshghi, Deepavali Bhagwat, Vinay Deolalikar, Greg Trezise, and Peter Camble. 2009. Sparse indexing: Large scale, inline deduplication using sampling and locality. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST). USENIX, Berkeley, CA, 111--123. Google ScholarDigital Library
Hyeontaek Lim, Bin Fan, David G. Andersen, and Michael Kaminsky. 2011. SILT: A memory-efficient, high-performance key-value store. In Proceedings of the Symposium on Operating Systems Principles (SOSP). ACM, New York, NY, 1--13. Google ScholarDigital Library
Guanlin Lu, Yu Jin, and David H. C. Du. 2010. Frequency based chunking for data de-duplication. In Proceedings of the International Symposium on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS). IEEE Computer Society, Washington, DC, 287--296. Google ScholarDigital Library
Guanlin Lu, Youngjin Nam, and David H. C. Du. 2012. BloomStore: Bloom-filter based memory-efficient key-value store for indexing of data deduplication on flash. In Proceedings of the Conference on Mass Storage Systems (MSST). IEEE Computer Society, Washington, DC, 1--11.Google Scholar
Udi Manber. 1994. Finding similar files in a large file system. In Proceedings of the USENIX Winter Technical Conference. USENIX, Berkeley, CA, 1--10. Google ScholarDigital Library
Nagapramod Mandagere, Pin Zhou, Mark A. Smith, and Sandeep Uttamchandani. 2008. Demystifying data deduplication. In Proceedings of the ACM/IFIP/USENIX International Middleware Conference. ACM, New York, NY, 12--17. Google ScholarDigital Library
Dirk Meister and André Brinkmann. 2009. Multi-level comparison of data deduplication in a backup scenario. In Proceedings of the International Systems and Storage Conference (SYSTOR). ACM, New York, NY, 1--12. Google ScholarDigital Library
Dirk Meister and André Brinkmann. 2010. dedupv1: Improving deduplication throughput using solid state drives (SSD). In Proceedings of the Conference on Mass Storage Systems (MSST). IEEE Computer Society, Washington, DC, 1--6. Google ScholarDigital Library
Dutch T. Meyer, Gitika Aggarwal, Brendan Cully, Geoffrey Lefebvre, Michael J. Feeley, Norman C. Hutchinson, and Andrew Warfield. 2008. Parallax: Virtual disks for virtual machines. In Proceedings of the European Conference on Computer Systems (EuroSys). ACM, New York, NY, 41--54. Google ScholarDigital Library
Dutch T. Meyer and William J. Bolosky. 2011. A study of practical deduplication. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST). USENIX, Berkeley, CA, 1--13. Google ScholarDigital Library
Grzegorz Milos, Derek G. Murray, Steven Hand, and Michael A. Fetterman. 2009. Satori: Enlightened page sharing. In Proceedings of the USENIX Annual Technical Conference (ATC). USENIX, Berkeley, CA, 1--14. Google ScholarDigital Library
Athicha Muthitacharoen, Benjie Chen, and David Mazières. 2001. A low-bandwidth network file system. In Proceedings of the Symposium on Operating Systems Principles (SOSP). ACM, New York, NY, 174--187. Google ScholarDigital Library
Partho Nath, Michael A. Kozuch, David R. O’Hallaron, Jan Harkes, M. Satyanarayanan, Niraj Tolia, and Matt Toups. 2006. Design tradeoffs in applying content addressable storage to enterprise-scale systems based on virtual machines. In Proceedings of the USENIX Annual Technical Conference (ATC). USENIX, Berkeley, CA, 71--84. Google ScholarDigital Library
Chun-Ho Ng, Mingcao Ma, Tsz-Yeung Wong, Patrick P. C. Lee, and John C. S. Lui. 2011. Live deduplication storage of virtual machine images in an open-source cloud. In Proceedings of the ACM/IFIP/USENIX International Middleware Conference. ACM, New York, 1--20. Google ScholarDigital Library
Zan Ouyang, Nasir D. Memon, Torsten Suel, and Dimitre Trendafilov. 2002. Cluster-based delta compression of a collection of files. In Proceedings of the International Conference on Web Information Systems Engineering (WISE). IEEE Computer Society, Washington, DC, 257--268. Google ScholarDigital Library
Ying-Shiuan Pan, Jui-Hao Chiang, Han-Lin Li, Po-Jui Tsao, Ming-Fen Lin, and Tzi-cker Chiueh. 2011. Hypervisor support for efficient memory de-duplication. In Proceedings of the International Conference on Parallel and Distributed Systems (ICPADS). IEEE Computer Society, Washington, DC, 33--39. Google ScholarDigital Library
João Paulo, Pedro Reis, Jose Pereira, and Antonio Sousa. 2012. DEDISbench: A benchmark for deduplicated storage systems. In Proceedings of the International Symposium on Secure Virtual Infrastructures (DOA-SVI). 1--18.Google ScholarCross Ref
Calicrates Policroniades and Ian Pratt. 2004. Alternatives for detecting redundancy in storage systems data. In Proceedings of the USENIX Annual Technical Conference (ATC). USENIX, Berkeley, CA, 73--86. Google ScholarDigital Library
Sean Quinlan and Sean Dorward. 2002. Venti: A new approach to archival storage. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST). USENIX, Berkeley, CA, 1--13. Google ScholarDigital Library
Michael O. Rabin. 1981. Fingerprinting by Random Polynomials. Technical Report TR-15-81. Harvard Aiken Computation Laboratory. 1--12 pages.Google Scholar
Fatema Rashid, Ali Miri, and Isaac Woungang. 2012. A secure data deduplication framework for cloud environments. In Proceedings of the International Conference on Privacy, Security and Trust (PST). IEEE Computer Society, Washington, DC, 81--87. Google ScholarDigital Library
Sean Rhea, Russ Cox, and Alex Pesterev. 2008. Fast, inexpensive content-addressed storage in foundation. In Proceedings of the USENIX Annual Technical Conference (ATC). USENIX, Berkeley, CA, 143--156. Google ScholarDigital Library
Eric W. D. Rozier, William H. Sanders, Pin Zhou, Nagapramod Mandagere, Sandeep M. Uttamchandani, and Mark L. Yakushev. 2011. Modeling the fault tolerance consequences of deduplication. In Proceedings of the International Symposium on Reliable Distributed Systems (SRDS). IEEE Computer Society, Washington, DC, 75--84. Google ScholarDigital Library
Leonard D. Shapiro. 1986. Join processing in database systems with large main memories. ACM Transactions on Database Systems 11, 3 (September 1986), 239--264. Google ScholarDigital Library
Prateek Sharma and Purushottam Kulkarni. 2012. Singleton: System-wide page deduplication in virtual environments. In Proceedings of the Symposium on High Performance Distributed Computing (HPDC). ACM, New York, NY, 15--26. Google ScholarDigital Library
Philip Shilane, Grant Wallace, Mark Huang, and Windsor Hsu. 2012. Delta compressed and deduplicated storage using stream-informed locality. In Proceedings of the USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage). USENIX, Berkeley, CA, 1--10. Google ScholarDigital Library
Kiran Srinivasan, Tim Bisson, Garth Goodson, and Kaladhar Voruganti. 2012. iDedup: Latency-aware, inline data deduplication for primary storage. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST). USENIX, Berkeley, CA, 1--14. Google ScholarDigital Library
Mark W. Storer, Kevin Greenan, Darrell D. E. Long, and Ethan L. Miller. 2008. Secure data deduplication. In Proceedings of the Workshop on Storage Security and Survivability (StorageSS). ACM, New York, NY, 1--10. Google ScholarDigital Library
Kuniyasu Suzaki, Kengo Iijima, Toshiki Yagi, and Cyrille Artho. 2011. Memory deduplication as a threat to the guest OS. In Proceedings of the European Workshop on Systems Security (EuroSec). ACM, New York, NY, 1--6. Google ScholarDigital Library
Kuniyasu Suzaki, Toshiki Yagi, Kengo Iijima, Nguyen Anh Quynh, Cyrille Artho, and Yoshihito Watanebe. 2010. Moving from logical sharing of guest OS to physical sharing of deduplication on virtual machine. In Proceedings of the Workshop on Hot Topics in Security (HotSec). USENIX, Berkeley, CA, 1--7. Google ScholarDigital Library
Vasily Tarasov, Amar Mudrankit, Will Buik, Philip Shilane, Geoff Kuenning, and Erez Zadok. 2012. Generating realistic datasets for deduplication analysis. In Poster Session of the USENIX Annual Technical Conference (ATC). USENIX, Berkeley, CA, 1--2. Google ScholarDigital Library
Yoshihiro Tsuchiya and Takashi Watanabe. 2011. DBLK: Deduplication for primary block storage. In Proceedings of the Conference on Mass Storage Systems (MSST). IEEE Computer Society, Washington, DC, 1--5. Google ScholarDigital Library
Cristian Ungureanu, Benjamin Atkin, Akshat Aranya, Salil Gokhale, Stephen Rago, Grzegorz Calkowski, Cezary Dubnicki, and Aniruddha Bohra. 2010. HydraFS: A high-throughput file system for the HYDRAstor content-addressable storage system. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST). USENIX, Berkeley, CA, 225--238. Google ScholarDigital Library
Carl A. Waldspurger. 2002. Memory resource management in VMware ESX server. SIGOPS Operating Systems Review 36, SI (December 2002), 181--194. Google ScholarDigital Library
Jiansheng Wei, Hong Jiang, Ke Zhou, and Dan Feng. 2010. MAD2: A scalable high-throughput exact deduplication approach for network backup services. In Proceedings of the Conference on Mass Storage Systems (MSST). IEEE Computer Society, Washington, DC, 1--14. Google ScholarDigital Library
Timothy Wood, Gabriel Tarasuk-Levin, Prashant Shenoy, Peter Desnoyers, Emmanuel Cecchet, and Mark D. Corner. 2009. Memory buddies: Exploiting page sharing for smart colocation in virtualized data centers. In Proceedings of the Conference on Virtual Execution Environments (VEE). ACM, New York, NY, 31--40. Google ScholarDigital Library
Jeff Wright. 2011. Sun ZFS Storage Appliance Deduplication Design and Implementation Guidelines. Retrieved September 12, 2013, from http://www.oracle.com/technetwork/articles/servers-storage-admin/zfs-storage-deduplication-335298.html.Google Scholar
Wen Xia, Hong Jiang, Dan Feng, and Yu Hua. 2011. SiLo: A similarity-locality based near-exact deduplication scheme with low RAM overhead and high throughput. In Proceedings of the USENIX Annual Technical Conference (ATC). USENIX, Berkeley, CA, 26--30. Google ScholarDigital Library
Tianming Yang, Dan Feng, Zhongying Niu, and Ya ping Wan. 2010a. Scalable high performance de-duplication backup via hash join. Journal of Zhejiang University—Science C 11, 5 (November 2010), 1--13.Google ScholarCross Ref
Tianming Yang, Hong Jiang, Dan Feng, Zhongying Niu, Ke Zhou, and Yaping Wan. 2010b. DEBAR: A scalable high-performance de-duplication storage system for backup and archiving. In Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS). IEEE Computer Society, Washington, DC, 1--12.Google ScholarCross Ref
Lawrence You and Christos Karamanolis. 2004. Evaluation of efficient archival storage techniques. In Proceedings of the Conference on Mass Storage Systems (MSST). IEEE Computer Society, Washington, DC, 227--232.Google Scholar
Lawrence L. You, Kristal T. Pollack, and Darrell D. E. Long. 2005. Deep Store: An archival storage system architecture. In Proceedings of the International Conference on Data Engineering (ICDE). IEEE Computer Society, Washington, DC, 1--11. Google ScholarDigital Library
Benjamin Zhu, Kai Li, and Hugo Patterson. 2008. Avoiding the disk bottleneck in the data domain deduplication file system. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST). USENIX, Berkeley, CA, 1--14. Google ScholarDigital Library

Index Terms

A Survey and Classification of Storage Deduplication Systems
1. General and reference
  1. Document types
    1. Surveys and overviews

Recommendations

A study of practical deduplication

We collected file system content data from 857 desktop computers at Microsoft over a span of 4 weeks. We analyzed the data to determine the relative efficacy of data deduplication, particularly considering whole-file versus block-level elimination of ...
Read More
Storage Deduplication by Virtual Large-Scale Disks
NBIS '12: Proceedings of the 2012 15th International Conference on Network-Based Information Systems

Recently, the demand of low cost large scale storages increases. We developed VLSD (Virtual Large Scale Disks) toolkit for constructing virtual disk based distributed storages, which aggregate free spaces of individual disks. VLSD realizes low-cost ...
Read More
WOJ: Enabling Write-Once Full-data Journaling in SSDs by Using Weak-Hashing-based Deduplication

Journaling is a commonly used technique to ensure data consistency in file systems, such as ext3 and ext4. With journaling technique, file system updates are first recorded in a journal (in the commit phase) and later applied to their home locations in ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Computing Surveys Volume 47, Issue 1
July 2014
551 pages
ISSN:0360-0300
EISSN:1557-7341
DOI:10.1145/2620784
Issue’s Table of Contents

Copyright © 2014 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 June 2014
- Accepted: 1 April 2014
- Revised: 1 October 2012
- Received: 1 April 2012
Published in csur Volume 47, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Storage management
deduplication
file systems
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 102
  Total Citations
  View Citations
- 2,233
  Total Downloads
- Downloads (Last 12 months)102
- Downloads (Last 6 weeks)14
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A Survey and Classification of Storage Deduplication Systems

ACM Computing Surveys

Abstract

References

Cited By

Index Terms

Recommendations

A study of practical deduplication

Storage Deduplication by Virtual Large-Scale Disks

WOJ: Enabling Write-Once Full-data Journaling in SSDs by Using Weak-Hashing-based Deduplication

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

A Survey and Classification of Storage Deduplication Systems

ACM Computing Surveys

Abstract

References

Cited By

Index Terms

Recommendations

A study of practical deduplication

Storage Deduplication by Virtual Large-Scale Disks

WOJ: Enabling Write-Once Full-data Journaling in SSDs by Using Weak-Hashing-based Deduplication

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media