Skip to main content
Erschienen in: Cluster Computing 1/2015

01.03.2015

De-Frag: an efficient scheme to improve deduplication performance via reducing data placement de-linearization

verfasst von: Yujuan Tan, Zhichao Yan, Dan Feng, Xubin He, Qiang Zou, Lei Yang

Erschienen in: Cluster Computing | Ausgabe 1/2015

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Data deduplication has become a commodity in large-scale storage systems, especially in data backup and archival systems. However, due to the removal of redundant data, data deduplication de-linearizes data placement and forces the data chunks of the same data object to be divided into multiple separate units. In our preliminary study, we found that the de-linearization of data placement compromises the data spatial locality that is used to improve data read performance, deduplication throughput and deduplication efficiency in some deduplication approaches, which significantly affects deduplication performance and makes some deduplication approaches become less effective. In this paper, we first analyze the negative effect of data placement de-linearization to deduplication performance, and then propose an effective approach called De-Frag to reduce the de-linearization of data placement. The key idea of De-Frag is to choose some redundant data to be written to the disks rather than be removed. It quantifies the spatial locality of each chunk group by spatial locality level (SPL for short) and writes the redundant chunks to disks when SPL value is smaller than a preset value, thus to reduce the de-linearization of data placement and enhance the spatial locality. As shown in our experimental results driven by real world datasets, De-Frag effectively enhances data spatial locality and improves deduplication throughput, deduplication efficiency, and data read performance, at the cost of slightly lower compression ratios.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Zhu, B., Li, K., Patterson, H.: Avoiding the disk bottleneck in the Data Domain deduplication file system, in FAST’08, Feb. 2008 Zhu, B., Li, K., Patterson, H.: Avoiding the disk bottleneck in the Data Domain deduplication file system, in FAST’08, Feb. 2008
2.
Zurück zum Zitat Lillibridge, M., Eshghi, K., Bhagwat, D., Deolalikar,V., Trezise, G., Campbell, P.: Sparse Indexing: Large scale, inline deduplication using sampling and locality, in FAST’09, Feb. 2009 Lillibridge, M., Eshghi, K., Bhagwat, D., Deolalikar,V., Trezise, G., Campbell, P.: Sparse Indexing: Large scale, inline deduplication using sampling and locality, in FAST’09, Feb. 2009
3.
Zurück zum Zitat Bhagwat, D., Eshghi, K., Long, D.D., Lillibridge, M.: Extreme binning: scalable, parallel deduplication for chunk-based file backup, HP Laboratories, Tech. Rep. HPL-2009-10R2, Sep. 2009. Bhagwat, D., Eshghi, K., Long, D.D., Lillibridge, M.: Extreme binning: scalable, parallel deduplication for chunk-based file backup, HP Laboratories, Tech. Rep. HPL-2009-10R2, Sep. 2009.
4.
Zurück zum Zitat Srinivasan, K., Bisson, T., Goodson, G., Voruganti, K.: iDedup: latency-aware, inline data deduplication for primary storage, in FAST’12, Feb. 2012. Srinivasan, K., Bisson, T., Goodson, G., Voruganti, K.: iDedup: latency-aware, inline data deduplication for primary storage, in FAST’12, Feb. 2012.
5.
Zurück zum Zitat Nam, Y.J., Park, D., Du, D.: Assuring demanded read performance of data deduplication storage with backup datasets, in MASCOTS’12, Aug. 2012. Nam, Y.J., Park, D., Du, D.: Assuring demanded read performance of data deduplication storage with backup datasets, in MASCOTS’12, Aug. 2012.
6.
Zurück zum Zitat Kaczmarczyk, M., Barczynski, M., Kilian, W., Dubnicki, C.: Reducing impact of data fragmentation caused by in-line deduplication, in SYSTOR’12, Jun. 2012. Kaczmarczyk, M., Barczynski, M., Kilian, W., Dubnicki, C.: Reducing impact of data fragmentation caused by in-line deduplication, in SYSTOR’12, Jun. 2012.
7.
Zurück zum Zitat Li, X., Lillibridge, M., Uysal, M.: Reliability analysis of deduplicated and erasure-coded storage. ACM SIGMETRICS Perform Eval Rev 38(3), 4–9 (2011)CrossRef Li, X., Lillibridge, M., Uysal, M.: Reliability analysis of deduplicated and erasure-coded storage. ACM SIGMETRICS Perform Eval Rev 38(3), 4–9 (2011)CrossRef
8.
Zurück zum Zitat Liu, C., Gu, Y., Sun, L., Yan, B., Wang, D.: R-ADMAD: high reliability provision for large-scale de-duplication archival storage systems, in ICS’09, Jun. 2010. Liu, C., Gu, Y., Sun, L., Yan, B., Wang, D.: R-ADMAD: high reliability provision for large-scale de-duplication archival storage systems, in ICS’09, Jun. 2010.
9.
Zurück zum Zitat Bhagwat, D., Pollack, K., Long, D.D.E., Schwarz, T., Miller, E.L., èaris, J.P.: providing high reliability in a minimum redundancy archival storage system, in MASCOTS’06, Sep. 2006. Bhagwat, D., Pollack, K., Long, D.D.E., Schwarz, T., Miller, E.L., èaris, J.P.: providing high reliability in a minimum redundancy archival storage system, in MASCOTS’06, Sep. 2006.
10.
Zurück zum Zitat Xia, W., Jiang, H., Feng, D., Hua, Y.: SiLo: a similarity-locality based near-exact deduplication scheme with low RAM overhead and high throughput, in USENIX’11, Jun. 2011. Xia, W., Jiang, H., Feng, D., Hua, Y.: SiLo: a similarity-locality based near-exact deduplication scheme with low RAM overhead and high throughput, in USENIX’11, Jun. 2011.
11.
Zurück zum Zitat Rabin, M.O.: Fingerprinting by random polynomials, Center for Research in Computing Technology, Technical Report, Harvard University, TR-15-81, 1981. Rabin, M.O.: Fingerprinting by random polynomials, Center for Research in Computing Technology, Technical Report, Harvard University, TR-15-81, 1981.
12.
Zurück zum Zitat NIST, “Secure Hash Standard”, in FIPS PUB 180–1, May 1993. NIST, “Secure Hash Standard”, in FIPS PUB 180–1, May 1993.
13.
Zurück zum Zitat Dong, W., Douglis, F., Li, K., Patterson, H.,: TradeOffs in scalable data routing for deduplication clusters, in FAST’11, Feb. 2011. Dong, W., Douglis, F., Li, K., Patterson, H.,: TradeOffs in scalable data routing for deduplication clusters, in FAST’11, Feb. 2011.
14.
Zurück zum Zitat Tan, Y., Jiang, H., Feng, D., Tian, L., Yan, Z., Zhou, G.: SAM: A semantic-aware multi-tiered source de-duplication framework for cloud backup, in ICPP’10, Sep. 2010. Tan, Y., Jiang, H., Feng, D., Tian, L., Yan, Z., Zhou, G.: SAM: A semantic-aware multi-tiered source de-duplication framework for cloud backup, in ICPP’10, Sep. 2010.
15.
Zurück zum Zitat Clements, A.T., Ahmad, I., Vilayannur, M., Li, J.: Decentralized deduplication in SAN cluster file systems, in USENIX’09, Jan. 2009. Clements, A.T., Ahmad, I., Vilayannur, M., Li, J.: Decentralized deduplication in SAN cluster file systems, in USENIX’09, Jan. 2009.
16.
Zurück zum Zitat Dubnicki, C., Gryz, L., Heldt, L., Kaczmarczyk, M., Kilian, W., Strzelczak, P., Szczepkowski, J., Ungureanu, C., Welnicki, M.: Hydrastor: a scalable secondary storage. in FAST’09, Feb. 2009. Dubnicki, C., Gryz, L., Heldt, L., Kaczmarczyk, M., Kilian, W., Strzelczak, P., Szczepkowski, J., Ungureanu, C., Welnicki, M.: Hydrastor: a scalable secondary storage. in FAST’09, Feb. 2009.
17.
Zurück zum Zitat You, L.L., Pollack, K.T., Long, D.D.E.: Deep Store: An archival storage system architecture, in ICDE’05, Apr. 2005. You, L.L., Pollack, K.T., Long, D.D.E.: Deep Store: An archival storage system architecture, in ICDE’05, Apr. 2005.
18.
Zurück zum Zitat Vrable, M., Savage, S., Voelker, G.M.: Cumulus: Filesystem backup to the cloud, in FAST’09, Feb. 2009. Vrable, M., Savage, S., Voelker, G.M.: Cumulus: Filesystem backup to the cloud, in FAST’09, Feb. 2009.
19.
Zurück zum Zitat Tan, Y., Jiang, H., Feng, D., Tian, L., Yan, Z.: CABdedupe: A Causality-based deduplication performance booster for cloud backup services, in IPDPS’11, May. 2011. Tan, Y., Jiang, H., Feng, D., Tian, L., Yan, Z.: CABdedupe: A Causality-based deduplication performance booster for cloud backup services, in IPDPS’11, May. 2011.
20.
Zurück zum Zitat Adya, A., Bolosky, W.J., Castro, M., Cermak, G., Chaiken, R., Douceur, J.R., Howell, J., Lorch, J.R., Theimer, M., Wattenhofer, R. P.: FARSITE: federated, available, and reliable storage for an incompletely trusted environment, in OSDI’02, Dec. 2002. Adya, A., Bolosky, W.J., Castro, M., Cermak, G., Chaiken, R., Douceur, J.R., Howell, J., Lorch, J.R., Theimer, M., Wattenhofer, R. P.: FARSITE: federated, available, and reliable storage for an incompletely trusted environment, in OSDI’02, Dec. 2002.
21.
Zurück zum Zitat Bolosky, W.J., Corbin, S., Goebel, D., Douceur, J.R.: Single instance storage in windows 2000, in USENIX ’00, Aug. 2000. Bolosky, W.J., Corbin, S., Goebel, D., Douceur, J.R.: Single instance storage in windows 2000, in USENIX ’00, Aug. 2000.
22.
Zurück zum Zitat E. CORPORATION.: EMC Centera: Content Addressed Storage System, 2003. E. CORPORATION.: EMC Centera: Content Addressed Storage System, 2003.
23.
Zurück zum Zitat Quinlan, S., Dorward, S.: Venti: A new approach to archival storage, in FAST’02, Jan. 2002. Quinlan, S., Dorward, S.: Venti: A new approach to archival storage, in FAST’02, Jan. 2002.
24.
Zurück zum Zitat Muthitacharoen, A., Chen, B., Mazières, D.: A low-bandwidth network file system, in SOSP’01, Oct. 2001. Muthitacharoen, A., Chen, B., Mazières, D.: A low-bandwidth network file system, in SOSP’01, Oct. 2001.
25.
Zurück zum Zitat Deepak, R., Bobbar, J., Suresh, J.: Improving duplicate elimination in storage systems, ACM Trans Storage, 2(4), 2006. Deepak, R., Bobbar, J., Suresh, J.: Improving duplicate elimination in storage systems, ACM Trans Storage, 2(4), 2006.
26.
Zurück zum Zitat Eshghi, K.: A framework for analyzing and improving content based chunking algorithms, Hewlett Packard Laboratories, Tech. Rep. HPL-2005-30, Feb. 2005. Eshghi, K.: A framework for analyzing and improving content based chunking algorithms, Hewlett Packard Laboratories, Tech. Rep. HPL-2005-30, Feb. 2005.
27.
Zurück zum Zitat Liu, C., Gu, Y., Sun, L., Yan, B., Wang, D.: ADMAD: Application-driven metadata aware de-deduplication archival storage systems, in the 25th IEEE Conference on Mass Storage Systems and Technologies, Sep. 2008. Liu, C., Gu, Y., Sun, L., Yan, B., Wang, D.: ADMAD: Application-driven metadata aware de-deduplication archival storage systems, in the 25th IEEE Conference on Mass Storage Systems and Technologies, Sep. 2008.
28.
Zurück zum Zitat Rhea, S., Cox, R., Pesterev, A.: Fast, inexpensive content-addressed storage in Foundation, in USENIX’08, Jun. 2008. Rhea, S., Cox, R., Pesterev, A.: Fast, inexpensive content-addressed storage in Foundation, in USENIX’08, Jun. 2008.
29.
Zurück zum Zitat Debnath, B., Senguptaz, S., Li, J.: ChunkStash: speeding up inline storage deduplication using flash memory, in USENIX’10, Jun. 2010. Debnath, B., Senguptaz, S., Li, J.: ChunkStash: speeding up inline storage deduplication using flash memory, in USENIX’10, Jun. 2010.
30.
Zurück zum Zitat Guo, F., Efstathopoulos, P.: Building a high-performance deduplication system, in USENIX’11, Jun. 2011. Guo, F., Efstathopoulos, P.: Building a high-performance deduplication system, in USENIX’11, Jun. 2011.
31.
Zurück zum Zitat Tan, Y., Yan, Z., Feng, D., Sha, E.H.M.: Reducing the de-linearization of data placement to improve deduplication performance, in International Workshop on Data-Intensive Scalable Computing Systems (DISCS, in conjunction with the 2012 ACM/IEEE Supercomputing Conference), Nov. 2012. Tan, Y., Yan, Z., Feng, D., Sha, E.H.M.: Reducing the de-linearization of data placement to improve deduplication performance, in International Workshop on Data-Intensive Scalable Computing Systems (DISCS, in conjunction with the 2012 ACM/IEEE Supercomputing Conference), Nov. 2012.
Metadaten
Titel
De-Frag: an efficient scheme to improve deduplication performance via reducing data placement de-linearization
verfasst von
Yujuan Tan
Zhichao Yan
Dan Feng
Xubin He
Qiang Zou
Lei Yang
Publikationsdatum
01.03.2015
Verlag
Springer US
Erschienen in
Cluster Computing / Ausgabe 1/2015
Print ISSN: 1386-7857
Elektronische ISSN: 1573-7543
DOI
https://doi.org/10.1007/s10586-014-0397-5

Weitere Artikel der Ausgabe 1/2015

Cluster Computing 1/2015 Zur Ausgabe