Skip to main content
Erschienen in: The Journal of Supercomputing 9/2015

01.09.2015

Optimizing the Hadoop MapReduce Framework with high-performance storage devices

verfasst von: Sangwhan Moon, Jaehwan Lee, Xiling Sun, Yang-suk Kee

Erschienen in: The Journal of Supercomputing | Ausgabe 9/2015

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Solid-state drives (SSDs) are an attractive alternative to hard disk drives (HDDs) to accelerate the Hadoop MapReduce Framework. However, the SSD characteristics and today’s Hadoop framework exhibit mismatches that impede indiscriminate SSD integration. This paper explores how to optimize a Hadoop MapReduce Framework with SSDs in terms of performance, cost, and energy consumption. It identifies extensible best practices that can exploit SSD benefits within Hadoop when combined with high network bandwidth and increased parallel storage access. Our Terasort benchmark results demonstrate that Hadoop currently does not sufficiently exploit SSD throughput. Hence, using faster SSDs in Hadoop does not enhance its performance. We show that SSDs presently deliver significant efficiency when storing intermediate Hadoop data, leaving HDDs for Hadoop Distributed File System (HDFS). The proposed configuration is optimized with the JVM reuse option and frequent heartbeat interval option. Moreover, we examined the performance of a state-of-the-art non-volatile memory express interface SSD within the Hadoop MapReduce Framework. While HDFS read and write throughput increases with high-performance SSDs, achieving complete system performance improvement requires carefully balancing CPU, network, and storage resource capabilities at a system level.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Fußnoten
1
We repeated five times since variance of results is small enough.
 
2
I/O utilization is defined as the percentage of CPU time passed during I/O requests were issued [10].
 
Literatur
1.
Zurück zum Zitat Dean J, Ghemawat S (2004) MapReduce: simplified data processing on large clusters. In: OSDI’04: 6th symposium on operating system design and implementation, San Francisco Dean J, Ghemawat S (2004) MapReduce: simplified data processing on large clusters. In: OSDI’04: 6th symposium on operating system design and implementation, San Francisco
2.
Zurück zum Zitat Ghemawat S, Gobioff H, Leung S (2003) The Google file system. In: SOSP’03: 19th ACM symposium on operating systems principles Ghemawat S, Gobioff H, Leung S (2003) The Google file system. In: SOSP’03: 19th ACM symposium on operating systems principles
4.
Zurück zum Zitat Shvachko K, Kuang H, Radia S, Chansler R (2010) The Hadoop distributed file system. In: MSST’10: 26th IEEE symposium on massive storage systems and technologies Shvachko K, Kuang H, Radia S, Chansler R (2010) The Hadoop distributed file system. In: MSST’10: 26th IEEE symposium on massive storage systems and technologies
5.
Zurück zum Zitat Dell. Solid state drive vs. hard disk drive price and performance study. [White paper] Dell. Solid state drive vs. hard disk drive price and performance study. [White paper]
6.
Zurück zum Zitat Shafer J, Rixner S, Cox A (2010) The Hadoop distributed filesystem: balancing portability and performance. In: ISPASS’10: IEEE international symposium on performance analysis of systems and software Shafer J, Rixner S, Cox A (2010) The Hadoop distributed filesystem: balancing portability and performance. In: ISPASS’10: IEEE international symposium on performance analysis of systems and software
7.
Zurück zum Zitat Moon S, Lee J, Kee Y (2014) Introducing SSDs to Hadoop MapReduce Framework. In: IEEE Cloud’14: 7th IEEE international conference on cloud computing Moon S, Lee J, Kee Y (2014) Introducing SSDs to Hadoop MapReduce Framework. In: IEEE Cloud’14: 7th IEEE international conference on cloud computing
9.
Zurück zum Zitat DFSIO program. Available in Hadoop source distribution: src/test/org/apache/hadoop/fs/TestDFSIO. Accessed 21 May 2015 DFSIO program. Available in Hadoop source distribution: src/test/org/apache/hadoop/fs/TestDFSIO. Accessed 21 May 2015
14.
Zurück zum Zitat Cloud Computing, Intel Inc., Optimizing Hadoop deployments. [White paper] Cloud Computing, Intel Inc., Optimizing Hadoop deployments. [White paper]
15.
Zurück zum Zitat Intel Xeon Processor-Based Servers, Big data analytics, Intel Inc., Optimizing Hadoop Deployments. [White paper] Intel Xeon Processor-Based Servers, Big data analytics, Intel Inc., Optimizing Hadoop Deployments. [White paper]
17.
Zurück zum Zitat White Tom (2012) Hadoop: the definitive guide. O’Reilly Media Inc, USA White Tom (2012) Hadoop: the definitive guide. O’Reilly Media Inc, USA
20.
Zurück zum Zitat Sur S, Wang H, Huang J, Ouyang X, Panda D (2010) Can high-performance interconnects benefit Hadoop distributed file system. In: MASVDC’10: workshop on micro architectural support for virtualization, data center computing, and clouds in conjunction with MICRO’10 Sur S, Wang H, Huang J, Ouyang X, Panda D (2010) Can high-performance interconnects benefit Hadoop distributed file system. In: MASVDC’10: workshop on micro architectural support for virtualization, data center computing, and clouds in conjunction with MICRO’10
21.
Zurück zum Zitat Islam N, Rahman M, Jose J, Rajachandrasekar R, Wang H, Subramoni H, Murthy C, Panda D (2012) High performance RDMA-based design of HDFS over InfiniBand. In: SC ’12: the international conference on high performance computing, networking, storage and analysis Islam N, Rahman M, Jose J, Rajachandrasekar R, Wang H, Subramoni H, Murthy C, Panda D (2012) High performance RDMA-based design of HDFS over InfiniBand. In: SC ’12: the international conference on high performance computing, networking, storage and analysis
22.
Zurück zum Zitat Appuswamy R, Gkantsidis C, Narayanan D, Hodson O, Rowstron A (2013) Scale-up vs scale-out for Hadoop: time to rethink? In: ACM symposium on cloud computing, 2 October 2013 Appuswamy R, Gkantsidis C, Narayanan D, Hodson O, Rowstron A (2013) Scale-up vs scale-out for Hadoop: time to rethink? In: ACM symposium on cloud computing, 2 October 2013
23.
Zurück zum Zitat Harter T, Borthakur D, Dong S, Aiyer A, Tang L, Arpaci-Dusseau A, Arpaci-Dusseau R (2014) Analysis of HDFS under HBase: a Facebook messages case study. In: FAST’14: 12th USENIX conference on file and storage technologies Harter T, Borthakur D, Dong S, Aiyer A, Tang L, Arpaci-Dusseau A, Arpaci-Dusseau R (2014) Analysis of HDFS under HBase: a Facebook messages case study. In: FAST’14: 12th USENIX conference on file and storage technologies
24.
Zurück zum Zitat SanDisk, Increasing Hadoop performance with SanDisk solid state drives (SSDs). [White paper] SanDisk, Increasing Hadoop performance with SanDisk solid state drives (SSDs). [White paper]
25.
Zurück zum Zitat Dai J, Huang J, Huang S, Huang B, Liu Y (2011) HiTune: dataflow-based performance analysis for big data cloud. In: Usenix ATC’11: USENIX annual technical conference Dai J, Huang J, Huang S, Huang B, Liu Y (2011) HiTune: dataflow-based performance analysis for big data cloud. In: Usenix ATC’11: USENIX annual technical conference
26.
Zurück zum Zitat Joshi S, Liaskovitis V (2012) Java garbage collection characteristics and tuning guidelines for Apache Hadoop TeraSort workload. [White paper] Joshi S, Liaskovitis V (2012) Java garbage collection characteristics and tuning guidelines for Apache Hadoop TeraSort workload. [White paper]
27.
Zurück zum Zitat Chen Y, Ganapathi AS, Katz RH (2010) To compress or not to compress—compute vs. IO tradeoffs for MapReduce energy efficiency. Technical Report No. UCB/EECS-2010-36, University of California at Berkeley Chen Y, Ganapathi AS, Katz RH (2010) To compress or not to compress—compute vs. IO tradeoffs for MapReduce energy efficiency. Technical Report No. UCB/EECS-2010-36, University of California at Berkeley
Metadaten
Titel
Optimizing the Hadoop MapReduce Framework with high-performance storage devices
verfasst von
Sangwhan Moon
Jaehwan Lee
Xiling Sun
Yang-suk Kee
Publikationsdatum
01.09.2015
Verlag
Springer US
Erschienen in
The Journal of Supercomputing / Ausgabe 9/2015
Print ISSN: 0920-8542
Elektronische ISSN: 1573-0484
DOI
https://doi.org/10.1007/s11227-015-1447-3

Weitere Artikel der Ausgabe 9/2015

The Journal of Supercomputing 9/2015 Zur Ausgabe