Improving the performance of Apache Hadoop on pervasive environments through context-aware scheduling

W. Cassales, Guilherme; Schwertner Charão, Andrea; Kirsch-Pinheiro, Manuele; Souveyet, Carine; Steffenel, Luiz-Angelo

doi:10.1007/s12652-016-0361-8

Improving the performance of Apache Hadoop on pervasive environments through context-aware scheduling

Original Research
Published: 09 March 2016

Volume 7, pages 333–345, (2016)
Cite this article

Journal of Ambient Intelligence and Humanized Computing Aims and scope Submit manuscript

Guilherme W. Cassales¹,
Andrea Schwertner Charão¹,
Manuele Kirsch-Pinheiro²,
Carine Souveyet² &
…
Luiz-Angelo Steffenel³

284 Accesses
13 Citations
Explore all metrics

Abstract

This article proposes to improve Apache Hadoop scheduling through a context-aware approach. Apache Hadoop is the most popular implementation of the MapReduce paradigm for distributed computing, but its design does not adapt automatically to computing nodes’ context and capabilities. By introducing context-awareness into Hadoop, we intent to dynamically adapt its scheduling to the execution environment. This is a necessary feature in the context of pervasive grids, which are heterogeneous, dynamic and shared environments. The solution has been incorporated into Hadoop and assessed through controlled experiments. The experiments demonstrate that context-awareness provides comparative performance gains, especially when some of the resources disappear during execution.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A survey of Kubernetes scheduling algorithms

Article Open access 13 June 2023

Serverless Computing: Current Trends and Open Problems

The big data system, components, tools, and technologies: a survey

Article 18 September 2018

Notes

http://aws.amazon.com/elasticmapreduce/

Abbreviations

API:: Application programming interface
DHT:: Distributed hash table
FIFO:: First in, first out
HDFS:: Hadoop distributed file system
P2P:: Peer-to-Peer
PER-MARE:: Pervasive map-reduce project
SLA:: Service-level agreement
VM:: Virtual machine
YARN:: Yet another resource negotiator

References

Apache, Apache Hadoop, 2014. http://hadoop.apache.org/docs/r2.6.0/index.html. Last access: November 2014
Assuncao MD, Netto MAS, Koch F, Bianchi S (2012) Context-aware job scheduling for cloud computing environments. In: IEEE Fifth International Conference on Utility and Cloud Computing (UCC). 2012. pp 255–262. doi:10.1109/UCC.2012.33
Baldauf M, Dustdar S, Rosenberg F (2007) A survey on context-aware systems. Int J Ad Hoc Ubiquitous Comput 2(4):263–277
Article Google Scholar
Cassales GW, Charao AS, Pinheiro MK, Souveyet C, Steffenel LA (2014) Bringing Context to Apache Hadoop. In: 8th International Conference on Mobile Ubiquitous Computing, Rome, Italy
Cassales GW, Charao AS, Kirsch Pinheiro M, Souveyet C, Steffenel LA (2015) Context-aware scheduling for apache hadoop over pervasive environments. Procedia Comp Sci 52:202–209. The 6th International Conference on Ambient Systems, Networks and Technologies (ANT-2015), the 5th International Conference on Sustainable Energy Information Technology (SEIT-2015). doi:10.1016/j.procs.2015.05.058. http://www.sciencedirect.com/science/article/pii/S1877050915008583
Cavallo M, Cusma L, Modica GD, Polito C, Tomarchio O (2015) A scheduling strategy to run Hadoop jobs on geodistributed data. In: 3rd Workshop on CLoud for IoT (CLIoT 2015), in conjunction with the European Conference on Service-Oriented and Cloud Computing (ESOCC 2015)
Chen Q, Zhang D, Guo M, Deng Q, Guo S (2010) SAMR: a self-adaptive MapReduce scheduling algorithm in heterogeneous environment, In: Proceedings of the 2010 10th IEEE International Conference on Computer and Information Technology. CIT ’10 (IEEE Computer Society, Washington, DC, USA, 2010), pp 2736–2743 (978-0-7695-4108-2)
Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1):107–113
Article Google Scholar
Engel T, Charo A, Kirsch-Pinheiro M, Steffenel LA (2015) Performance improvement of data mining in weka through multi-core and gpu acceleration: opportunities and pitfalls. J Ambient Intel Humaniz Comput 6(4):377–390. doi:10.1007/s12652-015-0292-9
Grid’5000, Grid 5000, 2013. https://www.grid5000.fr/, Last access: July 2014
Hamilton, J.: Hadoop Wins TeraSort, 2008. http://perspectives.mvdirona.com/2008/07/hadoop-wins-terasort/. Last access: September 2015
Hofmann P, Woods D (2010) Cloud computing: the limits of public clouds for business applications. IEEE Internet Comput 14(6):90–93. doi:10.1109/MIC.2010.136
Article Google Scholar
Huang S, Huang J, Dai J, Xie T, Huang B: The HiBench benchmark suite: Characterization of the MapReduce-based data analysis. In: 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW), 2010, pp 41–51. doi:10.1109/ICDEW.2010.5452747
Hunt P, Konar M, Junqueira FP, Reed B, ZooKeeper: wait-free Coordination for Internet-scale Systems. In: Proceedings of the USENIX Annual Technical Conference (USENIX Association, Boston, MA, USA, 2010), pp 11. http://dl.acm.org/citation.cfm?id=1855840.1855851
Isard M, Prabhakaran V, Currey J, Wieder U, Talwar K, Goldberg A (2009) Quincy: fair scheduling for distributed computing clusters, in Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles. SOSP ’09 (ACM, New York, NY, USA, 2009), pp 261–276 (978-1-60558-752-3)
Kumar KA, Konishetty VK, Voruganti K, Rao GVP (2012) CASH: context aware scheduler for Hadoop. In: Proceedings of the International Conference on Advances in Computing, Communications and Informatics. ICACCI ’12, New York, NY, USA, 2012, pp 52–61 (978-1-4503-1196-0)
Li J, Wang Q, Jayasinghe D, Park J, Zhu T, Pu C (2013) Performance overhead among three hypervisors: an experimental study using Hadoop benchmarks. In: 2013 IEEE International Congress on Big Data (BigData Congress) 2013, pp 9–16. 2013, doi:10.1109/BigData.Congress..11
Maamar Z, Benslimane D, Narendra NC (2006) What can context do for web services? Commun ACM 49(12):98–103
Article Google Scholar
Marozzo F, Talia D, Trunfio P (2012) P2p-mapreduce: parallel data processing in dynamic cloud environments. J Comput Syst Sci 78(5):1382–1402
Article Google Scholar
Maurer M, Brandic I, Sakellariou R (2012) Self-adaptive and resource-efficient SLA enactment for cloud computing infrastructures. In: 2012 IEEE 5th International Conference on cloud computing (CLOUD), 2012, pp 368–375. doi:10.1109/CLOUD.2012.55
Najar S, Kirsch M, Pinheiro C (2015) Souveyet, service discovery and prediction on pervasive information system. J Ambient Intell Human Comp 6(4):407–423. doi:10.1007/s12652-015-0288-5
Article Google Scholar
Nascimento AP, Boeres C, Rebello VEF (2008) Dynamic self-scheduling for parallel applications with task dependencies. In: Proceedings of the 6th International Workshop on MGC. MGC ’08, New York, NY, USA, 2008, pp 1–116 (978-1-60558-365-5)
Oracle, Overview of Java SE Monitoring and Management, 2014. http://docs.oracle.com/javase/7/docs/technotes/guides/management/overview.html, Last access: July 2014
Parashar M, Pierson JM (2010) Pervasive grids: challenges and opportunities. In: Li K, Hsu C, Yang L, Dongarra J, Zima H (eds) Handbook of Research on Scalable Computing Technologies. (IGI Global, 2010), pp 14–30. doi:10.4018/978-1-60566-661-7.ch002 ( 978–160566661-7)
Ramakrishnan A, Preuveneers D, Berbers Y (2014) Enabling self-learning in dynamic and open IoT environments. In: Shakshuki E, Yasar A (eds) The 5th International Conference on Ambient Systems, Networks and Technologies (ANT-2014), the 4th International Conference on Sustainable Energy Information Technology (SEIT-2014), vol. 32, 2014, pp 207–214. doi:10.1016/j.procs.2014.05.416
Rasooli A, Down DG (2012) Coshh: a classification and optimization based scheduler for heterogeneous hadoop systems. In: Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis. SCC ’12 (IEEE Computer Society, Washington, DC, USA, 2012), pp. 1284–1291 (978-0-7695-4956-9)
Sandholm T, Lai K (2010) Dynamic Proportional Share Scheduling in Hadoop. In: Proceedings of the 15th International Conference on Job Scheduling Strategies for Parallel Processing. JSSPP’10, Berlin, Heidelberg, 2010, pp 110–131. (3–642-16504-4, 978-3-642-16504-7)
Schadt EE, Linderman MD, Sorenson J, Lee L, Nolan GP (2010) Computational solutions to large-scale data management and analysis. Nat Rev Genet 11(9):647–657. doi:10.1038/nrg2857
Steffenel LA, Kirsch Pinheiro M (2015) Leveraging data intensive applications on a pervasive computing platform: The case of mapreduce. Procedia Comp Sci 52:1034–1039 (2015). The 6th International Conference on Ambient Systems, Networks and Technologies (ANT-2015), the 5th International Conference on Sustainable Energy Information Technology (SEIT-2015). doi:10.1016/j.procs.2015.05.102. http://www.sciencedirect.com/science/article/pii/S1877050915009023
Steffenel LA, Flauzac O, Charão AS, Barcelos PP, Stein B, Nesmachnow S, Kirsch Pinheiro M, Diaz D (2013) PER-MARE: adaptive deployment of MapReduce over pervasive grids. In: Proceedings of the 2013 Eighth International Conference on P2P, Parallel, Grid, Cloud and Internet Computing. 3PGCIC ’13 (IEEE Computer Society, Washington, DC, USA, 2013), pp 17–24 (978-0-7695-5094-7)
STIC-AmSud, PER-MARE project, 2014. http://cosy.univ-reims.fr/PER-MARE, Last access: July 2014
Tian C, Zhou H, He Y, Zha L (2009) A dynamic MapReduce scheduler for heterogeneous workloads. In: Proceedings of the 2009 Eighth International Conference on Grid and Cooperative Computing. GCC ’09 (IEEE Computer Society, Washington, DC, USA, 2009), pp 218–224 (978-0-7695-3766-5)
Xie J, Ruan X, Ding Z, Tian Y, Majors J, Manzanares A, Yin S, Qin X (2010) Improving MapReduce performance through data placement in heterogeneous Hadoop clusters, in Parallel and Distributed Processing, Workshops and Phd Forum (IPDPSW)
Zaharia M, Konwinski A, Joseph AD, Katz R, Stoica I (2008) Improving MapReduce performance in heterogeneous environments, in Proceedings of the 8th USENIX conference on Operating systems design and implementation. OSDI’08 (USENIX Association, Berkeley, CA, USA, 2008), pp 29–42

Download references

Acknowledgments

The authors would like to thank their partners in the PER-MARE project STIC-AmSud (2014) and acknowledge the financial support given to this research by the CAPES/MAEE/ANII STIC-AmSud collaboration program (project number 13STIC07).

Author information

Authors and Affiliations

Laboratório de Sistemas de Computação, Universidade Federal de Santa Maria, Santa Maria, Brazil
Guilherme W. Cassales & Andrea Schwertner Charão
Centre de Recherche en Informatique, Université Paris 1 Panthéon-Sorbonne, Paris, France
Manuele Kirsch-Pinheiro & Carine Souveyet
Laboratoire CReSTIC—Équipe SysCom, Université de Reims Champagne-Ardenne, Reims, France
Luiz-Angelo Steffenel

Authors

Guilherme W. Cassales
View author publications
You can also search for this author in PubMed Google Scholar
Andrea Schwertner Charão
View author publications
You can also search for this author in PubMed Google Scholar
Manuele Kirsch-Pinheiro
View author publications
You can also search for this author in PubMed Google Scholar
Carine Souveyet
View author publications
You can also search for this author in PubMed Google Scholar
Luiz-Angelo Steffenel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Luiz-Angelo Steffenel.

Rights and permissions

Reprints and permissions

About this article

Cite this article

W. Cassales, G., Schwertner Charão, A., Kirsch-Pinheiro, M. et al. Improving the performance of Apache Hadoop on pervasive environments through context-aware scheduling. J Ambient Intell Human Comput 7, 333–345 (2016). https://doi.org/10.1007/s12652-016-0361-8

Download citation

Received: 05 October 2015
Accepted: 10 January 2016
Published: 09 March 2016
Issue Date: June 2016
DOI: https://doi.org/10.1007/s12652-016-0361-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Improving the performance of Apache Hadoop on pervasive environments through context-aware scheduling

Abstract

Access this article

Similar content being viewed by others

A survey of Kubernetes scheduling algorithms

Serverless Computing: Current Trends and Open Problems

The big data system, components, tools, and technologies: a survey

Notes

Abbreviations

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Improving the performance of Apache Hadoop on pervasive environments through context-aware scheduling

Abstract

Access this article

Similar content being viewed by others

A survey of Kubernetes scheduling algorithms

Serverless Computing: Current Trends and Open Problems

The big data system, components, tools, and technologies: a survey

Notes

Abbreviations

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation