Skip to main content
Erschienen in: The Journal of Supercomputing 5/2018

20.12.2017

Data deduplication techniques for efficient cloud storage management: a systematic review

verfasst von: Ravneet Kaur, Inderveer Chana, Jhilik Bhattacharya

Erschienen in: The Journal of Supercomputing | Ausgabe 5/2018

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

The exponential growth of digital data in cloud storage systems is a critical issue presently as a large amount of duplicate data in the storage systems exerts an extra load on it. Deduplication is an efficient technique that has gained attention in large-scale storage systems. Deduplication eliminates redundant data, improves storage utilization and reduces storage cost. This paper presents a broad methodical literature review of existing data deduplication techniques along with various existing taxonomies of deduplication techniques that have been based on cloud data storage. Furthermore, the paper investigates deduplication techniques based on text and multimedia data along with their corresponding taxonomies as these techniques have different challenges for duplicate data detection. This research work is useful to identify deduplication techniques based on text, image and video data. It also discusses existing challenges and significant research directions in deduplication for future researchers, and article concludes with a summary of valuable suggestions for future enhancements in deduplication.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Literatur
12.
Zurück zum Zitat Maan AJ (2013) Analysis and comparison of algorithms for lossless data compression. Int J Inf Comput Technol 3(3):139–46 Maan AJ (2013) Analysis and comparison of algorithms for lossless data compression. Int J Inf Comput Technol 3(3):139–46
14.
Zurück zum Zitat Shanmugasundaram S, Lourdusamy R (2011) A comparative study of text compression algorithms. Int J Wisdom Based Comput 1(3):68–76 Shanmugasundaram S, Lourdusamy R (2011) A comparative study of text compression algorithms. Int J Wisdom Based Comput 1(3):68–76
15.
Zurück zum Zitat Bhadade US, Trivedi AI (2011) Lossless text compression using dictionaries. Int J Comput Appl Algorithms 13(8):27–34 Bhadade US, Trivedi AI (2011) Lossless text compression using dictionaries. Int J Comput Appl Algorithms 13(8):27–34
21.
Zurück zum Zitat Barreto J, Ferreira P (2009) Efficient locally trackable deduplication in replicated systems. In: Proceedings of the 10th ACM/IFIP/USENIX International Conference on Middleware. Springer-Verlag New York, Inc. USA, p 6 Barreto J, Ferreira P (2009) Efficient locally trackable deduplication in replicated systems. In: Proceedings of the 10th ACM/IFIP/USENIX International Conference on Middleware. Springer-Verlag New York, Inc. USA, p 6
24.
Zurück zum Zitat Alvarez C (2011) NetApp deduplication for FAS and V-Series deployment and implementation guide. In: Technical ReportTR-3505 Alvarez C (2011) NetApp deduplication for FAS and V-Series deployment and implementation guide. In: Technical ReportTR-3505
27.
Zurück zum Zitat Banu AF, Chandrasekar C (2012) A survey on deduplication methods. Int J Comput Trends Technol 3(3):364–368 Banu AF, Chandrasekar C (2012) A survey on deduplication methods. Int J Comput Trends Technol 3(3):364–368
32.
Zurück zum Zitat Srinivasan K, Bisson T, Goodson GR, Voruganti K (2012) iDedup: latency-aware, inline data deduplication for primary storage. In: Proceedings of the USENIX Conference on File and Storage Technologies, vol 12, pp 24–24 Srinivasan K, Bisson T, Goodson GR, Voruganti K (2012) iDedup: latency-aware, inline data deduplication for primary storage. In: Proceedings of the USENIX Conference on File and Storage Technologies, vol 12, pp 24–24
34.
Zurück zum Zitat Kim C, Park KW, Park KH (2012) GHOST: GPGPU-offloaded high performance storage I/O deduplication for primary storage system. In: Proceedings of the 2012 International Workshop on Programming Models and Applications for Multicores and Manycores ACM, pp 17–26. https://doi.org/10.1145/2141702.2141705 Kim C, Park KW, Park KH (2012) GHOST: GPGPU-offloaded high performance storage I/O deduplication for primary storage system. In: Proceedings of the 2012 International Workshop on Programming Models and Applications for Multicores and Manycores ACM, pp 17–26. https://​doi.​org/​10.​1145/​2141702.​2141705
35.
Zurück zum Zitat Lillibridge M, Eshghi K, Bhagwat D, Deolalikar V, Trezis G, Camble P (2009) Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality. In Proceedings of the 7th USENIX Conference on File and Storage Technologies, vol 9, pp 111–123 Lillibridge M, Eshghi K, Bhagwat D, Deolalikar V, Trezis G, Camble P (2009) Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality. In Proceedings of the 7th USENIX Conference on File and Storage Technologies, vol 9, pp 111–123
36.
Zurück zum Zitat Zhu B, Li K, Patterson RH (2008) Avoiding the disk bottleneck in the data domain deduplication file system. Proc USENIX Conf File Storage Technol 8:1–14 Zhu B, Li K, Patterson RH (2008) Avoiding the disk bottleneck in the data domain deduplication file system. Proc USENIX Conf File Storage Technol 8:1–14
37.
Zurück zum Zitat Dubnicki C, Gryz L, Heldt L, Kaczmarczyk M, Kilian W, Strzelczak P, Szczepkowski J, Ungureanu C, Welnicki M (2009) HYDRAstor: A scalable secondary storage. In: 7th USENIX Conference on File and Storage Technologies (FAST), vol 9, pp 197–210 Dubnicki C, Gryz L, Heldt L, Kaczmarczyk M, Kilian W, Strzelczak P, Szczepkowski J, Ungureanu C, Welnicki M (2009) HYDRAstor: A scalable secondary storage. In: 7th USENIX Conference on File and Storage Technologies (FAST), vol 9, pp 197–210
40.
Zurück zum Zitat Ng CH, Ma M, Wong TY, Lee PP, Lui J (2011) Live deduplication storage of virtual machine images in an open-source cloud. In: Proceedings of the 12th International Middleware Conference. International Federation for Information Processing, pp 80–99 Ng CH, Ma M, Wong TY, Lee PP, Lui J (2011) Live deduplication storage of virtual machine images in an open-source cloud. In: Proceedings of the 12th International Middleware Conference. International Federation for Information Processing, pp 80–99
43.
Zurück zum Zitat Clements AT, Ahmad I, Vilayannur M, Li J (2009) Decentralized Deduplication in SAN Cluster File Systems. In: USENIX Annual Technical Conference, pp 101–114 Clements AT, Ahmad I, Vilayannur M, Li J (2009) Decentralized Deduplication in SAN Cluster File Systems. In: USENIX Annual Technical Conference, pp 101–114
45.
Zurück zum Zitat Agarwal B, Akella A, Anand A, Balachandran A, Chitnis P, Muthukrishnan C, Ramjee R, Varghese G (2010). EndRE: An End-system redundancy elimination service for enterprises. In: NSDI, pp 419–432 Agarwal B, Akella A, Anand A, Balachandran A, Chitnis P, Muthukrishnan C, Ramjee R, Varghese G (2010). EndRE: An End-system redundancy elimination service for enterprises. In: NSDI, pp 419–432
46.
Zurück zum Zitat Katiyar A, Weissman JB (2011) ViDeDup: an application-aware framework for video de-duplication. In: Proceedings of the 3rd USENIX Conference on Hot Topics in Storage and File Systems (Hot Storage), pp 1–5 Katiyar A, Weissman JB (2011) ViDeDup: an application-aware framework for video de-duplication. In: Proceedings of the 3rd USENIX Conference on Hot Topics in Storage and File Systems (Hot Storage), pp 1–5
47.
Zurück zum Zitat Li C, Shilane P, Douglis F, Shim H, Smaldone S, Wallace G (2014) Nitro: a Capacity-optimized SSD cache for primary storage. In: USENIX Annual Technical Conference, pp 501–512 Li C, Shilane P, Douglis F, Shim H, Smaldone S, Wallace G (2014) Nitro: a Capacity-optimized SSD cache for primary storage. In: USENIX Annual Technical Conference, pp 501–512
48.
Zurück zum Zitat Shen HT, Zhou X, Huang Z, Shao J, Zhou X (2007) UQLIPS: a real-time near-duplicate video clip detection system. In: Proceedings of the 33rd International Conference on Very Large Data Bases VLDB Endowment, pp 1374–1377 Shen HT, Zhou X, Huang Z, Shao J, Zhou X (2007) UQLIPS: a real-time near-duplicate video clip detection system. In: Proceedings of the 33rd International Conference on Very Large Data Bases VLDB Endowment, pp 1374–1377
49.
Zurück zum Zitat Chen F, Luo T, Zhang X (2011) CAFTL: A content-aware flash translation layer enhancing the lifespan of flash memory based solid state drives. In: Proceedings of 9th USENIX Conference on File Storage Technology (FAST), vol 11, pp 77–90 Chen F, Luo T, Zhang X (2011) CAFTL: A content-aware flash translation layer enhancing the lifespan of flash memory based solid state drives. In: Proceedings of 9th USENIX Conference on File Storage Technology (FAST), vol 11, pp 77–90
51.
Zurück zum Zitat Lai R, Hua Y, Feng D, Xia W, Fu M, Yang Y (2014) A near-exact defragmentation scheme to improve restore performance for cloud backup systems. In: Sun X et al (eds) Algorithms and architectures for parallel processing. LNCS, vol 8630. Springer, Cham, pp 457–471. https://doi.org/10.1007/978-3-319-11197-1_35 Lai R, Hua Y, Feng D, Xia W, Fu M, Yang Y (2014) A near-exact defragmentation scheme to improve restore performance for cloud backup systems. In: Sun X et al (eds) Algorithms and architectures for parallel processing. LNCS, vol 8630. Springer, Cham, pp 457–471. https://​doi.​org/​10.​1007/​978-3-319-11197-1_​35
53.
Zurück zum Zitat Tan Y, Jiang H, Feng D, Tian L, Yan Z (2011) CABdedupe: a causality-based deduplication performance booster for cloud backup services. In: Parallel and Distributed Processing Symposium (IPDPS) IEEE International, pp 1266–1277 Tan Y, Jiang H, Feng D, Tian L, Yan Z (2011) CABdedupe: a causality-based deduplication performance booster for cloud backup services. In: Parallel and Distributed Processing Symposium (IPDPS) IEEE International, pp 1266–1277
54.
Zurück zum Zitat Nbt Yusof, Ismail A, Majid NAA (2016) Deduplication image middleware detection comparison in standalone cloud database. Int J Adv Comput Sci Technol (IJACST) 5(3):12–18 Nbt Yusof, Ismail A, Majid NAA (2016) Deduplication image middleware detection comparison in standalone cloud database. Int J Adv Comput Sci Technol (IJACST) 5(3):12–18
59.
Zurück zum Zitat Kruus E, Ungureanu C, Dubnicki C (2010) Bimodal content defined chunking for backup streams. In: Proceedings of the USENIX Conference on File and Storage Technologies (FAST), pp 239–252 Kruus E, Ungureanu C, Dubnicki C (2010) Bimodal content defined chunking for backup streams. In: Proceedings of the USENIX Conference on File and Storage Technologies (FAST), pp 239–252
61.
63.
Zurück zum Zitat Nam YJ, Park D, Du DH (2012) Assuring demanded read performance of data deduplication storage with backup datasets. In: IEEE 20th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS), pp 201–208. https://doi.org/10.1109/MASCOTS.2012.32 Nam YJ, Park D, Du DH (2012) Assuring demanded read performance of data deduplication storage with backup datasets. In: IEEE 20th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS), pp 201–208. https://​doi.​org/​10.​1109/​MASCOTS.​2012.​32
67.
Zurück zum Zitat Fu Y, Jiang H, Xiao N (2012) A scalable inline cluster deduplication framework for big data protection. In: Narasimhan P, Triantafillou P (eds) Middleware IFIP international federation for information processing. LNCS, vol 7662. Springer, Berlin, pp 354–373 Fu Y, Jiang H, Xiao N (2012) A scalable inline cluster deduplication framework for big data protection. In: Narasimhan P, Triantafillou P (eds) Middleware IFIP international federation for information processing. LNCS, vol 7662. Springer, Berlin, pp 354–373
69.
Zurück zum Zitat Bhagwat D, Eshghi K, Long DD, Lillibridge M (2009) Extreme binning: scalable, parallel deduplication for chunk-based file backup. In: Proceedings of IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, Computer Society, Washington, DC, vol 9, pp 1–9. https://doi.org/10.1109/MASCOT.2009.5366623 Bhagwat D, Eshghi K, Long DD, Lillibridge M (2009) Extreme binning: scalable, parallel deduplication for chunk-based file backup. In: Proceedings of IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, Computer Society, Washington, DC, vol 9, pp 1–9. https://​doi.​org/​10.​1109/​MASCOT.​2009.​5366623
72.
Zurück zum Zitat Guo F, Efstathopoulos P (2011) Building a high-performance deduplication system. In: Proceedings of USENIX Annual Technical Conference Guo F, Efstathopoulos P (2011) Building a high-performance deduplication system. In: Proceedings of USENIX Annual Technical Conference
78.
81.
Zurück zum Zitat Vishalakshi NS, Sridevi S (2017) Survey on secure de-duplication with encrypted data for cloud storage. Int J Adv Res Sci Eng Technol 4(1):3111–3117 Vishalakshi NS, Sridevi S (2017) Survey on secure de-duplication with encrypted data for cloud storage. Int J Adv Res Sci Eng Technol 4(1):3111–3117
91.
Zurück zum Zitat Debnath BK, Sengupta S, Li J (2010) ChunkStash: speeding up inline storage deduplication using flash memory. In: Proceedings of USENIX Annual Technical Conference (ATC), pp 1–16 Debnath BK, Sengupta S, Li J (2010) ChunkStash: speeding up inline storage deduplication using flash memory. In: Proceedings of USENIX Annual Technical Conference (ATC), pp 1–16
92.
Zurück zum Zitat Dong W, Douglis F, Li K, Patterson RH, Reddy S, Shilane P (2011) Tradeoffs in scalable data routing for deduplication clusters. In: Proceedings of USENIX Conference on File and Storage Technologies (FAST), vol 11, pp 15–29 Dong W, Douglis F, Li K, Patterson RH, Reddy S, Shilane P (2011) Tradeoffs in scalable data routing for deduplication clusters. In: Proceedings of USENIX Conference on File and Storage Technologies (FAST), vol 11, pp 15–29
93.
Zurück zum Zitat Li J, Qian X, Li Q, Zhao Y, Wang L, Tang YY (2015) Mining near duplicate image groups. Multimed Tools Appl 74(2):655–669CrossRef Li J, Qian X, Li Q, Zhao Y, Wang L, Tang YY (2015) Mining near duplicate image groups. Multimed Tools Appl 74(2):655–669CrossRef
95.
98.
Zurück zum Zitat Deshmukh AS, Lambhate PD (2016) A methodological survey on mapreduce for identification of duplicate images. Int J Sci Res (IJSR) 5(1):206–210 Deshmukh AS, Lambhate PD (2016) A methodological survey on mapreduce for identification of duplicate images. Int J Sci Res (IJSR) 5(1):206–210
100.
Zurück zum Zitat Zheng Y, Yuan X, Wang X, Jiang J, Wang C, Gui X (2015) Enabling encrypted cloud media center with secure deduplication. In: Proceedings of the 10th ACM Symposium on Information, Computer and Communications Security, pp 63–72. https://doi.org/10.1145/2714576.271462 Zheng Y, Yuan X, Wang X, Jiang J, Wang C, Gui X (2015) Enabling encrypted cloud media center with secure deduplication. In: Proceedings of the 10th ACM Symposium on Information, Computer and Communications Security, pp 63–72. https://​doi.​org/​10.​1145/​2714576.​271462
102.
103.
Zurück zum Zitat Li X, Lin J, Li J, Jin B (2016) A Video Deduplication Scheme with Privacy Preservation in IoT. In: International Symposium on Computational Intelligence and Intelligent Systems. Communications in Computer and Information Science, vol 575. Springer, Singapore, pp 409–417. https://doi.org/10.1007/978-981-10-0356-1_43 Li X, Lin J, Li J, Jin B (2016) A Video Deduplication Scheme with Privacy Preservation in IoT. In: International Symposium on Computational Intelligence and Intelligent Systems. Communications in Computer and Information Science, vol 575. Springer, Singapore, pp 409–417. https://​doi.​org/​10.​1007/​978-981-10-0356-1_​43
104.
Zurück zum Zitat Velmurugan K, Baboo LD (2011) Content-based image retrieval using SURF and colour moments. Global J Comput Sci Technol 11(10) Velmurugan K, Baboo LD (2011) Content-based image retrieval using SURF and colour moments. Global J Comput Sci Technol 11(10)
108.
Zurück zum Zitat Ke Y, Sukthankar R, Huston L, Ke Y, Sukthankar R (2004) Efficient near-duplicate detection and sub-image retrieval. In :ACM Multimedia, vol 4(1) Ke Y, Sukthankar R, Huston L, Ke Y, Sukthankar R (2004) Efficient near-duplicate detection and sub-image retrieval. In :ACM Multimedia, vol 4(1)
112.
Zurück zum Zitat Li Z, Feng X (2013) Near duplicate image detecting algorithm based on bag of visual word model. J Multimed 8(5):557–565 Li Z, Feng X (2013) Near duplicate image detecting algorithm based on bag of visual word model. J Multimed 8(5):557–565
114.
122.
Zurück zum Zitat Huang F, Zhou Z, Liu T, Liu X (2016) Original image tracing with image relational graph for near-duplicate image elimination. In: Sun X, Liu A, Chao HC, Bertino E (eds) Cloud Computing and Security ICCCS. LNCS, vol 10040. Springer, Cham, pp 322–336. https://doi.org/10.1007/978-3-319-48674-1_29 Huang F, Zhou Z, Liu T, Liu X (2016) Original image tracing with image relational graph for near-duplicate image elimination. In: Sun X, Liu A, Chao HC, Bertino E (eds) Cloud Computing and Security ICCCS. LNCS, vol 10040. Springer, Cham, pp 322–336. https://​doi.​org/​10.​1007/​978-3-319-48674-1_​29
123.
Zurück zum Zitat Wang XJ, Zhang L, Liu C (2013) Duplicate discovery on 2 billion internet images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp 429–436 Wang XJ, Zhang L, Liu C (2013) Duplicate discovery on 2 billion internet images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp 429–436
127.
Zurück zum Zitat Hua Y, Jiang H, Feng D (2014) FAST: Near real-time searchable data analytics for the cloud. In: IEEE Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp 754–765: https://doi.org/10.1109/SC.2014.67 Hua Y, Jiang H, Feng D (2014) FAST: Near real-time searchable data analytics for the cloud. In: IEEE Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp 754–765: https://​doi.​org/​10.​1109/​SC.​2014.​67
Metadaten
Titel
Data deduplication techniques for efficient cloud storage management: a systematic review
verfasst von
Ravneet Kaur
Inderveer Chana
Jhilik Bhattacharya
Publikationsdatum
20.12.2017
Verlag
Springer US
Erschienen in
The Journal of Supercomputing / Ausgabe 5/2018
Print ISSN: 0920-8542
Elektronische ISSN: 1573-0484
DOI
https://doi.org/10.1007/s11227-017-2210-8

Weitere Artikel der Ausgabe 5/2018

The Journal of Supercomputing 5/2018 Zur Ausgabe

Premium Partner