Skip to main content
Erschienen in: The Journal of Supercomputing 10/2016

01.10.2016

Optimization strategy of Hadoop small file storage for big data in healthcare

verfasst von: Hui He, Zhonghui Du, Weizhe Zhang, Allen Chen

Erschienen in: The Journal of Supercomputing | Ausgabe 10/2016

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

As the era of “big data” comes, the data processing platform like Hadoop was born at the right moment. But its carrier for storage, Hadoop distributed file system (HDFS) has the great weakness in storage of the numerous small files. The storage of numerous small files will increase the load of the entire colony and reduce efficiency. However, datasets such as genomic data and clinical data that will enable researchers to perform analytics in healthcare are all in storage of small files. To solve the defect of storage of small files, we generally will merge small files, and store the big file after merging. But the former methods have not applied the size distribution of the file, and not further improved the effect of merging of small files. This article proposes a method for merging of small files based on balance of data block, which will optimize the volume distribution of the big file after merging, and effectively reduce the data blocks of HDFS, so as to reduce the memory overhead of major nodes of cluster and reduce load to achieve high-efficiency operation of data processing.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Manyika J, Michael C, Brad B, Jacques B, Richard D, Charles R, Angela HB (2011) Big data: the next frontier for innovation, competition, and productivity. McKinsey Global Institute, Washington, pp 129–137 Manyika J, Michael C, Brad B, Jacques B, Richard D, Charles R, Angela HB (2011) Big data: the next frontier for innovation, competition, and productivity. McKinsey Global Institute, Washington, pp 129–137
4.
Zurück zum Zitat Yu L, Chen G, Wang W et al (2007) Msfss: a storage system for mass small files. In: 11th International Conference on Computer Supported Cooperative Work in Design (CSCWD), 2007. IEEE, Melbourne, Australia, pp 1087–1092 Yu L, Chen G, Wang W et al (2007) Msfss: a storage system for mass small files. In: 11th International Conference on Computer Supported Cooperative Work in Design (CSCWD), 2007. IEEE, Melbourne, Australia, pp 1087–1092
5.
Zurück zum Zitat Beaver D, Kumar S, Li HC et al (2010) Finding a needle in Haystack: Facebook’s photo storage. OSDI 10:1–8 Beaver D, Kumar S, Li HC et al (2010) Finding a needle in Haystack: Facebook’s photo storage. OSDI 10:1–8
7.
Zurück zum Zitat Liu X, Yu Q, Liao J (2014) FastDFS: a high performance distributed file system. ICIC Express Lett Part B Appl Int J Res Surv 5(6):1741–1746 Liu X, Yu Q, Liao J (2014) FastDFS: a high performance distributed file system. ICIC Express Lett Part B Appl Int J Res Surv 5(6):1741–1746
8.
Zurück zum Zitat Qian Y, Yi R, Du Y et al (2013) Dynamic I/O congestion control in scalable Lustre file system. In: IEEE 29th Symposium on Mass Storage Systems and Technologies (MSST), 2013. IEEE, Lake Arrowhead, USA, pp 1–5 Qian Y, Yi R, Du Y et al (2013) Dynamic I/O congestion control in scalable Lustre file system. In: IEEE 29th Symposium on Mass Storage Systems and Technologies (MSST), 2013. IEEE, Lake Arrowhead, USA, pp 1–5
9.
Zurück zum Zitat Mohandas N, Thampi SM (2011) Improving Hadoop Performance in Handling Small Files. In: Abraham A, Lloret Mauri J, Buford JF, Suzuki J, Thampi SM (eds) First International Conference, ACC 2011, Kochi, India, July 22–24, 2011, Proceedings, Part IV. Communications in Computer and Information Science, vol 193. Springer, Berlin, Heidelberg, pp 187–194 Mohandas N, Thampi SM (2011) Improving Hadoop Performance in Handling Small Files. In: Abraham A, Lloret Mauri J, Buford JF, Suzuki J, Thampi SM (eds) First International Conference, ACC 2011, Kochi, India, July 22–24, 2011, Proceedings, Part IV. Communications in Computer and Information Science, vol 193. Springer, Berlin, Heidelberg, pp 187–194
10.
Zurück zum Zitat Grant M, Saba S, Wang J (2009) Improving metadata management for small files in HDFS. In: International Conference on Cluster Computing and Workshops, CLUSTER ’09. IEEE, New Orleans, USA, pp1–4 Grant M, Saba S, Wang J (2009) Improving metadata management for small files in HDFS. In: International Conference on Cluster Computing and Workshops, CLUSTER ’09. IEEE, New Orleans, USA, pp1–4
11.
Zurück zum Zitat Yan CR, Li T, Huang YF, Gan YL (2014) Hmfs: efficient support of small files processing over HDFS. Algorithms Archit Parallel Process Lect Notes Comput Sci 8631:54–67 Yan CR, Li T, Huang YF, Gan YL (2014) Hmfs: efficient support of small files processing over HDFS. Algorithms Archit Parallel Process Lect Notes Comput Sci 8631:54–67
13.
Zurück zum Zitat Zhang WZ, He H, Ye JW (2013) A two-level cache for distributed information retrieval in search engines. Sci World J. 2013:Article ID 596724 (2013). doi:10.1155/2013/596724 Zhang WZ, He H, Ye JW (2013) A two-level cache for distributed information retrieval in search engines. Sci World J. 2013:Article ID 596724 (2013). doi:10.​1155/​2013/​596724
Metadaten
Titel
Optimization strategy of Hadoop small file storage for big data in healthcare
verfasst von
Hui He
Zhonghui Du
Weizhe Zhang
Allen Chen
Publikationsdatum
01.10.2016
Verlag
Springer US
Erschienen in
The Journal of Supercomputing / Ausgabe 10/2016
Print ISSN: 0920-8542
Elektronische ISSN: 1573-0484
DOI
https://doi.org/10.1007/s11227-015-1462-4

Weitere Artikel der Ausgabe 10/2016

The Journal of Supercomputing 10/2016 Zur Ausgabe