Skip to main content
Top
Published in: The Journal of Supercomputing 5/2018

20-12-2017

Data deduplication techniques for efficient cloud storage management: a systematic review

Authors: Ravneet Kaur, Inderveer Chana, Jhilik Bhattacharya

Published in: The Journal of Supercomputing | Issue 5/2018

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

The exponential growth of digital data in cloud storage systems is a critical issue presently as a large amount of duplicate data in the storage systems exerts an extra load on it. Deduplication is an efficient technique that has gained attention in large-scale storage systems. Deduplication eliminates redundant data, improves storage utilization and reduces storage cost. This paper presents a broad methodical literature review of existing data deduplication techniques along with various existing taxonomies of deduplication techniques that have been based on cloud data storage. Furthermore, the paper investigates deduplication techniques based on text and multimedia data along with their corresponding taxonomies as these techniques have different challenges for duplicate data detection. This research work is useful to identify deduplication techniques based on text, image and video data. It also discusses existing challenges and significant research directions in deduplication for future researchers, and article concludes with a summary of valuable suggestions for future enhancements in deduplication.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Literature
12.
go back to reference Maan AJ (2013) Analysis and comparison of algorithms for lossless data compression. Int J Inf Comput Technol 3(3):139–46 Maan AJ (2013) Analysis and comparison of algorithms for lossless data compression. Int J Inf Comput Technol 3(3):139–46
14.
go back to reference Shanmugasundaram S, Lourdusamy R (2011) A comparative study of text compression algorithms. Int J Wisdom Based Comput 1(3):68–76 Shanmugasundaram S, Lourdusamy R (2011) A comparative study of text compression algorithms. Int J Wisdom Based Comput 1(3):68–76
15.
go back to reference Bhadade US, Trivedi AI (2011) Lossless text compression using dictionaries. Int J Comput Appl Algorithms 13(8):27–34 Bhadade US, Trivedi AI (2011) Lossless text compression using dictionaries. Int J Comput Appl Algorithms 13(8):27–34
21.
go back to reference Barreto J, Ferreira P (2009) Efficient locally trackable deduplication in replicated systems. In: Proceedings of the 10th ACM/IFIP/USENIX International Conference on Middleware. Springer-Verlag New York, Inc. USA, p 6 Barreto J, Ferreira P (2009) Efficient locally trackable deduplication in replicated systems. In: Proceedings of the 10th ACM/IFIP/USENIX International Conference on Middleware. Springer-Verlag New York, Inc. USA, p 6
24.
go back to reference Alvarez C (2011) NetApp deduplication for FAS and V-Series deployment and implementation guide. In: Technical ReportTR-3505 Alvarez C (2011) NetApp deduplication for FAS and V-Series deployment and implementation guide. In: Technical ReportTR-3505
27.
go back to reference Banu AF, Chandrasekar C (2012) A survey on deduplication methods. Int J Comput Trends Technol 3(3):364–368 Banu AF, Chandrasekar C (2012) A survey on deduplication methods. Int J Comput Trends Technol 3(3):364–368
32.
go back to reference Srinivasan K, Bisson T, Goodson GR, Voruganti K (2012) iDedup: latency-aware, inline data deduplication for primary storage. In: Proceedings of the USENIX Conference on File and Storage Technologies, vol 12, pp 24–24 Srinivasan K, Bisson T, Goodson GR, Voruganti K (2012) iDedup: latency-aware, inline data deduplication for primary storage. In: Proceedings of the USENIX Conference on File and Storage Technologies, vol 12, pp 24–24
34.
go back to reference Kim C, Park KW, Park KH (2012) GHOST: GPGPU-offloaded high performance storage I/O deduplication for primary storage system. In: Proceedings of the 2012 International Workshop on Programming Models and Applications for Multicores and Manycores ACM, pp 17–26. https://doi.org/10.1145/2141702.2141705 Kim C, Park KW, Park KH (2012) GHOST: GPGPU-offloaded high performance storage I/O deduplication for primary storage system. In: Proceedings of the 2012 International Workshop on Programming Models and Applications for Multicores and Manycores ACM, pp 17–26. https://​doi.​org/​10.​1145/​2141702.​2141705
35.
go back to reference Lillibridge M, Eshghi K, Bhagwat D, Deolalikar V, Trezis G, Camble P (2009) Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality. In Proceedings of the 7th USENIX Conference on File and Storage Technologies, vol 9, pp 111–123 Lillibridge M, Eshghi K, Bhagwat D, Deolalikar V, Trezis G, Camble P (2009) Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality. In Proceedings of the 7th USENIX Conference on File and Storage Technologies, vol 9, pp 111–123
36.
go back to reference Zhu B, Li K, Patterson RH (2008) Avoiding the disk bottleneck in the data domain deduplication file system. Proc USENIX Conf File Storage Technol 8:1–14 Zhu B, Li K, Patterson RH (2008) Avoiding the disk bottleneck in the data domain deduplication file system. Proc USENIX Conf File Storage Technol 8:1–14
37.
go back to reference Dubnicki C, Gryz L, Heldt L, Kaczmarczyk M, Kilian W, Strzelczak P, Szczepkowski J, Ungureanu C, Welnicki M (2009) HYDRAstor: A scalable secondary storage. In: 7th USENIX Conference on File and Storage Technologies (FAST), vol 9, pp 197–210 Dubnicki C, Gryz L, Heldt L, Kaczmarczyk M, Kilian W, Strzelczak P, Szczepkowski J, Ungureanu C, Welnicki M (2009) HYDRAstor: A scalable secondary storage. In: 7th USENIX Conference on File and Storage Technologies (FAST), vol 9, pp 197–210
40.
go back to reference Ng CH, Ma M, Wong TY, Lee PP, Lui J (2011) Live deduplication storage of virtual machine images in an open-source cloud. In: Proceedings of the 12th International Middleware Conference. International Federation for Information Processing, pp 80–99 Ng CH, Ma M, Wong TY, Lee PP, Lui J (2011) Live deduplication storage of virtual machine images in an open-source cloud. In: Proceedings of the 12th International Middleware Conference. International Federation for Information Processing, pp 80–99
43.
go back to reference Clements AT, Ahmad I, Vilayannur M, Li J (2009) Decentralized Deduplication in SAN Cluster File Systems. In: USENIX Annual Technical Conference, pp 101–114 Clements AT, Ahmad I, Vilayannur M, Li J (2009) Decentralized Deduplication in SAN Cluster File Systems. In: USENIX Annual Technical Conference, pp 101–114
45.
go back to reference Agarwal B, Akella A, Anand A, Balachandran A, Chitnis P, Muthukrishnan C, Ramjee R, Varghese G (2010). EndRE: An End-system redundancy elimination service for enterprises. In: NSDI, pp 419–432 Agarwal B, Akella A, Anand A, Balachandran A, Chitnis P, Muthukrishnan C, Ramjee R, Varghese G (2010). EndRE: An End-system redundancy elimination service for enterprises. In: NSDI, pp 419–432
46.
go back to reference Katiyar A, Weissman JB (2011) ViDeDup: an application-aware framework for video de-duplication. In: Proceedings of the 3rd USENIX Conference on Hot Topics in Storage and File Systems (Hot Storage), pp 1–5 Katiyar A, Weissman JB (2011) ViDeDup: an application-aware framework for video de-duplication. In: Proceedings of the 3rd USENIX Conference on Hot Topics in Storage and File Systems (Hot Storage), pp 1–5
47.
go back to reference Li C, Shilane P, Douglis F, Shim H, Smaldone S, Wallace G (2014) Nitro: a Capacity-optimized SSD cache for primary storage. In: USENIX Annual Technical Conference, pp 501–512 Li C, Shilane P, Douglis F, Shim H, Smaldone S, Wallace G (2014) Nitro: a Capacity-optimized SSD cache for primary storage. In: USENIX Annual Technical Conference, pp 501–512
48.
go back to reference Shen HT, Zhou X, Huang Z, Shao J, Zhou X (2007) UQLIPS: a real-time near-duplicate video clip detection system. In: Proceedings of the 33rd International Conference on Very Large Data Bases VLDB Endowment, pp 1374–1377 Shen HT, Zhou X, Huang Z, Shao J, Zhou X (2007) UQLIPS: a real-time near-duplicate video clip detection system. In: Proceedings of the 33rd International Conference on Very Large Data Bases VLDB Endowment, pp 1374–1377
49.
go back to reference Chen F, Luo T, Zhang X (2011) CAFTL: A content-aware flash translation layer enhancing the lifespan of flash memory based solid state drives. In: Proceedings of 9th USENIX Conference on File Storage Technology (FAST), vol 11, pp 77–90 Chen F, Luo T, Zhang X (2011) CAFTL: A content-aware flash translation layer enhancing the lifespan of flash memory based solid state drives. In: Proceedings of 9th USENIX Conference on File Storage Technology (FAST), vol 11, pp 77–90
51.
go back to reference Lai R, Hua Y, Feng D, Xia W, Fu M, Yang Y (2014) A near-exact defragmentation scheme to improve restore performance for cloud backup systems. In: Sun X et al (eds) Algorithms and architectures for parallel processing. LNCS, vol 8630. Springer, Cham, pp 457–471. https://doi.org/10.1007/978-3-319-11197-1_35 Lai R, Hua Y, Feng D, Xia W, Fu M, Yang Y (2014) A near-exact defragmentation scheme to improve restore performance for cloud backup systems. In: Sun X et al (eds) Algorithms and architectures for parallel processing. LNCS, vol 8630. Springer, Cham, pp 457–471. https://​doi.​org/​10.​1007/​978-3-319-11197-1_​35
53.
go back to reference Tan Y, Jiang H, Feng D, Tian L, Yan Z (2011) CABdedupe: a causality-based deduplication performance booster for cloud backup services. In: Parallel and Distributed Processing Symposium (IPDPS) IEEE International, pp 1266–1277 Tan Y, Jiang H, Feng D, Tian L, Yan Z (2011) CABdedupe: a causality-based deduplication performance booster for cloud backup services. In: Parallel and Distributed Processing Symposium (IPDPS) IEEE International, pp 1266–1277
54.
go back to reference Nbt Yusof, Ismail A, Majid NAA (2016) Deduplication image middleware detection comparison in standalone cloud database. Int J Adv Comput Sci Technol (IJACST) 5(3):12–18 Nbt Yusof, Ismail A, Majid NAA (2016) Deduplication image middleware detection comparison in standalone cloud database. Int J Adv Comput Sci Technol (IJACST) 5(3):12–18
59.
go back to reference Kruus E, Ungureanu C, Dubnicki C (2010) Bimodal content defined chunking for backup streams. In: Proceedings of the USENIX Conference on File and Storage Technologies (FAST), pp 239–252 Kruus E, Ungureanu C, Dubnicki C (2010) Bimodal content defined chunking for backup streams. In: Proceedings of the USENIX Conference on File and Storage Technologies (FAST), pp 239–252
61.
63.
go back to reference Nam YJ, Park D, Du DH (2012) Assuring demanded read performance of data deduplication storage with backup datasets. In: IEEE 20th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS), pp 201–208. https://doi.org/10.1109/MASCOTS.2012.32 Nam YJ, Park D, Du DH (2012) Assuring demanded read performance of data deduplication storage with backup datasets. In: IEEE 20th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS), pp 201–208. https://​doi.​org/​10.​1109/​MASCOTS.​2012.​32
67.
go back to reference Fu Y, Jiang H, Xiao N (2012) A scalable inline cluster deduplication framework for big data protection. In: Narasimhan P, Triantafillou P (eds) Middleware IFIP international federation for information processing. LNCS, vol 7662. Springer, Berlin, pp 354–373 Fu Y, Jiang H, Xiao N (2012) A scalable inline cluster deduplication framework for big data protection. In: Narasimhan P, Triantafillou P (eds) Middleware IFIP international federation for information processing. LNCS, vol 7662. Springer, Berlin, pp 354–373
69.
go back to reference Bhagwat D, Eshghi K, Long DD, Lillibridge M (2009) Extreme binning: scalable, parallel deduplication for chunk-based file backup. In: Proceedings of IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, Computer Society, Washington, DC, vol 9, pp 1–9. https://doi.org/10.1109/MASCOT.2009.5366623 Bhagwat D, Eshghi K, Long DD, Lillibridge M (2009) Extreme binning: scalable, parallel deduplication for chunk-based file backup. In: Proceedings of IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, Computer Society, Washington, DC, vol 9, pp 1–9. https://​doi.​org/​10.​1109/​MASCOT.​2009.​5366623
72.
go back to reference Guo F, Efstathopoulos P (2011) Building a high-performance deduplication system. In: Proceedings of USENIX Annual Technical Conference Guo F, Efstathopoulos P (2011) Building a high-performance deduplication system. In: Proceedings of USENIX Annual Technical Conference
78.
81.
go back to reference Vishalakshi NS, Sridevi S (2017) Survey on secure de-duplication with encrypted data for cloud storage. Int J Adv Res Sci Eng Technol 4(1):3111–3117 Vishalakshi NS, Sridevi S (2017) Survey on secure de-duplication with encrypted data for cloud storage. Int J Adv Res Sci Eng Technol 4(1):3111–3117
91.
go back to reference Debnath BK, Sengupta S, Li J (2010) ChunkStash: speeding up inline storage deduplication using flash memory. In: Proceedings of USENIX Annual Technical Conference (ATC), pp 1–16 Debnath BK, Sengupta S, Li J (2010) ChunkStash: speeding up inline storage deduplication using flash memory. In: Proceedings of USENIX Annual Technical Conference (ATC), pp 1–16
92.
go back to reference Dong W, Douglis F, Li K, Patterson RH, Reddy S, Shilane P (2011) Tradeoffs in scalable data routing for deduplication clusters. In: Proceedings of USENIX Conference on File and Storage Technologies (FAST), vol 11, pp 15–29 Dong W, Douglis F, Li K, Patterson RH, Reddy S, Shilane P (2011) Tradeoffs in scalable data routing for deduplication clusters. In: Proceedings of USENIX Conference on File and Storage Technologies (FAST), vol 11, pp 15–29
93.
go back to reference Li J, Qian X, Li Q, Zhao Y, Wang L, Tang YY (2015) Mining near duplicate image groups. Multimed Tools Appl 74(2):655–669CrossRef Li J, Qian X, Li Q, Zhao Y, Wang L, Tang YY (2015) Mining near duplicate image groups. Multimed Tools Appl 74(2):655–669CrossRef
95.
98.
go back to reference Deshmukh AS, Lambhate PD (2016) A methodological survey on mapreduce for identification of duplicate images. Int J Sci Res (IJSR) 5(1):206–210 Deshmukh AS, Lambhate PD (2016) A methodological survey on mapreduce for identification of duplicate images. Int J Sci Res (IJSR) 5(1):206–210
100.
103.
go back to reference Li X, Lin J, Li J, Jin B (2016) A Video Deduplication Scheme with Privacy Preservation in IoT. In: International Symposium on Computational Intelligence and Intelligent Systems. Communications in Computer and Information Science, vol 575. Springer, Singapore, pp 409–417. https://doi.org/10.1007/978-981-10-0356-1_43 Li X, Lin J, Li J, Jin B (2016) A Video Deduplication Scheme with Privacy Preservation in IoT. In: International Symposium on Computational Intelligence and Intelligent Systems. Communications in Computer and Information Science, vol 575. Springer, Singapore, pp 409–417. https://​doi.​org/​10.​1007/​978-981-10-0356-1_​43
104.
go back to reference Velmurugan K, Baboo LD (2011) Content-based image retrieval using SURF and colour moments. Global J Comput Sci Technol 11(10) Velmurugan K, Baboo LD (2011) Content-based image retrieval using SURF and colour moments. Global J Comput Sci Technol 11(10)
108.
go back to reference Ke Y, Sukthankar R, Huston L, Ke Y, Sukthankar R (2004) Efficient near-duplicate detection and sub-image retrieval. In :ACM Multimedia, vol 4(1) Ke Y, Sukthankar R, Huston L, Ke Y, Sukthankar R (2004) Efficient near-duplicate detection and sub-image retrieval. In :ACM Multimedia, vol 4(1)
112.
go back to reference Li Z, Feng X (2013) Near duplicate image detecting algorithm based on bag of visual word model. J Multimed 8(5):557–565 Li Z, Feng X (2013) Near duplicate image detecting algorithm based on bag of visual word model. J Multimed 8(5):557–565
122.
123.
go back to reference Wang XJ, Zhang L, Liu C (2013) Duplicate discovery on 2 billion internet images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp 429–436 Wang XJ, Zhang L, Liu C (2013) Duplicate discovery on 2 billion internet images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp 429–436
127.
go back to reference Hua Y, Jiang H, Feng D (2014) FAST: Near real-time searchable data analytics for the cloud. In: IEEE Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp 754–765: https://doi.org/10.1109/SC.2014.67 Hua Y, Jiang H, Feng D (2014) FAST: Near real-time searchable data analytics for the cloud. In: IEEE Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp 754–765: https://​doi.​org/​10.​1109/​SC.​2014.​67
Metadata
Title
Data deduplication techniques for efficient cloud storage management: a systematic review
Authors
Ravneet Kaur
Inderveer Chana
Jhilik Bhattacharya
Publication date
20-12-2017
Publisher
Springer US
Published in
The Journal of Supercomputing / Issue 5/2018
Print ISSN: 0920-8542
Electronic ISSN: 1573-0484
DOI
https://doi.org/10.1007/s11227-017-2210-8

Other articles of this Issue 5/2018

The Journal of Supercomputing 5/2018 Go to the issue

Premium Partner