A formal framework to analyze cost and performance in Map-Reduce based applications
Introduction
During the last decade, most efforts in scientific applications were focused on supercomputers, grids, and commodity clusters [1], [2]. Generally, scientific applications are computation and data intensive, which means that these applications require long execution time. However, the execution time can be reduced by exploiting parallelism in CPU and I/O, especially, in cluster systems. Examples of those applications can be found in many fields like MRI scan data [3], astronomy [4], and Earthquake Modeling [5]. Due to its high cost effectiveness, commodity clusters have been the main resource for high performance computing (HPC). As an example, in the most recent survey of the fastest 500 computers in the world [6], 81.4% are clusters.
Scientific high-performance applications have reached a turning point where computing power is no longer the most important concern [7], [8]. Processing power has been doubling every 18 months for many years, enabling HPC applications to create, analyze and use increasingly large volumes of data. The impact of these trends on system architectures has caused a subsequent increase of the data-set sizes, which have to be stored and accessed efficiently. Therefore, programming applications that combine the exploitation of CPU parallelism with massive data operations is hard and a problem that has not yet resolved. Today, however, the emphasis is shifting from focusing the main optimization only on the computing system, to optimizing other systems such as the storage and the networking systems.
In late 2007, a new dimension was added to the scientific community: cloud computing. Although the concept of cloud computing emerged in 2007, there does not exist yet an accurate definition for it [1]. Cloud computing can be seen as a paradigm that provides access to a flexible and on-demand computing infrastructure, by allowing the user to rent a set of virtual machines to solve a computational problem.
In order to make a good use of resources, both clusters and cloud systems need a flexible and scalable method for developing and executing high performance scientific applications. Clearly, this common piece is the Map-Reduce scheme [9], a programming model for processing large data-sets in distributed systems. Although the Map-Reduce model was originally developed by Google, it is currently used by important companies like Amazon, Facebook, Baidu, Yahoo!, Last.fm, New York Times and AOL, becoming the de facto standard for large-scale data processing [10], [11]. From now on, the term scientific cluster is used to refer to internal data centers not made available to the general public, and public cloud refers to services that can be outsourced by paying each deployed virtual machine (VM) per unit of time basis.
This paper describes a model that splits the cluster configuration in four independent systems: CPU, storage, memory and network. Basically, this model takes as input a list of components for each system and produces, as output, a performance/cost ratio for each configuration. Accordingly, a set of detailed parameters related to the underlying characteristics (including cost) of each component must be specified. Thus, this model uses the information provided as input by the user to generate automatically a combination of system models to simulate its underlying hardware features. This model has been implemented in a simulated environment using the SIMCAN simulation platform [12], an open-source project for modeling and simulating distributed systems and applications. This platform also allows user to develop parallel and distributed applications, which is a powerful tool for calculating the performance of scientific computation using different hardware configurations [13], [14], [15], [16].
The results obtained from the application of our methodology are reported on this paper, supported by the SIMCAN simulation platform to optimize the balance and performance for scientific systems. The main contributions of this work can be summarized as:
- •
In contrast with existing work, mainly focusing on both optimizations and performance evaluation of Map-Reduce based applications [17], [18], our proposed model focuses on the relationship between the hardware cost, the configuration of the system, the overall application performance, and the user requirements. Accordingly, our work presents the design and development of an accurate model for optimizing the trade-offs between cost and performance of scientific clusters based on the requirements defined by the user.
- •
In order to show the accuracy of the proposed model, a big set of validation experiments has been exercised in real environments.
- •
A complete performance evaluation model allows to compare different hardware configurations. Thus, this model computes the best choice of the possible hardware alternatives for the specified requirements.
This paper represents an extended and enhanced version of previous work on the optimization of the trade-offs between cost and performance in cluster systems [19]. We have included a detailed description of the simulation platform used to perform the experiments described in this paper. We have extended the evaluation process that was included in our previous work with new experiments. Additionally, several charts that show the performance of the simulated scenarios have been included. These charts are very helpful to provide relevant information to the reader about the cost of obtaining the expected results.
The rest of the paper is structured as follows. Section 2 describes the motivation for this work. Section 3 presents an overview of SIMCAN. Section 4 describes a model to optimize the trade-offs between cost and performance in cluster systems, which is divided in three different parts: modeling of the system architecture, modeling of the Map-Reduce application, and the architecture of the complete model. Section 5 shows both validation and performance experiments using the proposed model. Section 6 presents related work. Finally, Section 7 presents our conclusions and some directions for future work.
Section snippets
Motivation
Currently, users can choose either to execute their applications in a scientific cluster or to launch them in a public cloud. Public cloud environments probably are the easiest and cheapest option because users pay only for the requested resources (VMs and storage) and then they upload their data for launching applications via the Internet. Consequently, in the computing research community, some questions arise: Can we save money/time by using public cloud computing systems instead of buying
Overview of the SIMCAN simulation platform
The SIMCAN project was born in 2008 with the purpose of creating a scalable, flexible and efficient simulation platform for modeling and simulating distributed systems [12]. SIMCAN has been implemented in C++ using both INET and OMNeT++ frameworks.
Currently, simulation techniques are widely used by the research community to study and analyze performance in distributed systems. One of the main issues of simulation is that each researcher has its own objectives and needs, therefore simulators are
Optimizing the trade-offs between cost and performance in cluster systems
This section provides a detailed description of the model for analyzing the cost and performance of Map-Reduce applications in cluster systems. This model is divided in three parts: modeling of the scientific cluster architecture, modeling of the Map-Reduce application, and the architecture of the complete model.
Validation and evaluation
In this section we present both validation and evaluation experiments. However, a thorough validation of the SIMCAN simulation platform has been already achieved [13]. In this work, several experiments have been performed to show the level of accuracy of the simulation models. Basically, these experiments consist in comparing the behavior of executing a given application in a real system with the behavior observed in the simulation of the application in an analogous simulated system. These
Related work
Myriad studies to analyze the performance of high performance computing systems can be found in the literature. Depending of its main objective, the corresponding performance analysis focuses on the hardware system, on the application to be executed, or both.
For example, the performance of the Columbia cluster has been evaluated [26]. The results obtained from this study demonstrate that several features concerning configurations have a direct impact on the overall system performance for the
Conclusions and future work
In this paper we have proposed a model for automatically compute a performance/cost ratio by simulating scientific clusters using different configurations. This model is platform-independent and does not require to make an investment before purchasing the final system. In this paper we have presented both validation and evaluation tests. Initially, the accuracy of our proposed model for simulating Map-Reduce applications in Hadoop cluster is presented. Also, a set of experiments has been
Acknowledgements
Research partially supported by the Spanish Projects TESIS and ESTuDIo (TIN2009-14312-C02 and TIN2012-36812-C02). We would also like to thank Cesar Andrés for his assistance and suggesting improvements in the performed experiments.
Alberto Núñez received the M.Sc. degree in Computer Science in 2005 at the Universidad Carlos III de Madrid, and the Ph.D. degree in Computer Science in 2011 at the same university. He is currently Teaching Assistant at the Universidad Complutense de Madrid, teaching Distributed Systems, Discrete Mathematics and Data Bases. He won the IBM Ph.D. Fellowship award in 2009. His research interests are focused on formal testing and high performance computing architectures and applications, especially
References (40)
- et al.
A scalable parallel Poisson solver for three-dimensional problems with one periodic direction
Computers and Fluids
(2010) - et al.
SIMCAN: a flexible, scalable and expandable simulation platform for modelling and simulating distributed architectures and applications
Simulation Modelling Practice and Theory
(2012) - et al.
Optimizing the trade-offs between cost and performance in scientific computing
Procedia Computer Science
(2012) - et al.
Composable cost estimation and monitoring for computational applications in cloud computing environments
- I. Foster, Y. Zhao, I. Raicu, S. Lu, Cloud computing and grid computing 360-degree compared, in: Proc. Grid Computing...
- et al.
Learning to decode cognitive states from brain images
Machine Learning
(2004) - E. Deelman, J. Blythe, Y. Gil, C. Kesselman, G. Mehta, S. Patil, M.-H. Su, K. Vahi, M. Livny, Pegasus: mapping...
- et al.
High resolution forward and inverse earthquake modeling on terascale computers
- et al.
Top500 Supercomputer Sites
(2012) - et al.
Horizon: efficient deadline-driven disk I/O management for distributed storage systems
MRAP: a novel MapReduce-based framework to support HPC analytics applications with access patterns
MapReduce: simplified data processing on large clusters
Communications of the ACM
MapReduce for data intensive scientific analyses
The SIMCAN Simulation Platform
New techniques for simulating high performance MPI applications on large storage networks
Journal of Supercomputing
Using architectural simulation models to aid the design of data intensive application
Analyzing scalable high-performance I/O architectures
Map-reduce-merge: simplified relational data processing on large clusters
Cited by (10)
Mutomvo: Mutation testing framework for simulated cloud and HPC environments
2018, Journal of Systems and SoftwareCitation Excerpt :In this case, both applications are equivalents due to the fact that the swap of the sentence that performs a CPU call and the sentence that only carries out a reset a timer, does not has direct impact in the result of the execution. The second application, known as appMR (Núñez and Merayo, 2014), is a simplified version of the Map-Reduce model proposed by Google (Dean and Ghemawat, 2008), which consists of 900 lines of code. The main goal of this application is to process, in parallel, a data set by distributing pieces of data among multiple processes.
Recent advances in computational science and engineering research
2017, Journal of Computational SciencePerformance evaluation of communications in distributed systems and web based service architectures
2017, Journal of Computational ScienceOrlando Tools: Development, Training, and Use of Scalable Applications in Heterogeneous Distributed Computing Environments
2019, Communications in Computer and Information ScienceA Modeling Language for MapReduce Programing in a Storage System Perspective
2018, Journal of Signal Processing Systems
Alberto Núñez received the M.Sc. degree in Computer Science in 2005 at the Universidad Carlos III de Madrid, and the Ph.D. degree in Computer Science in 2011 at the same university. He is currently Teaching Assistant at the Universidad Complutense de Madrid, teaching Distributed Systems, Discrete Mathematics and Data Bases. He won the IBM Ph.D. Fellowship award in 2009. His research interests are focused on formal testing and high performance computing architectures and applications, especially on how to perform models and simulations.
Mercedes G. Merayo received her Ph.D. in Computer Science from Universidad Complutense de Madrid, Spain, in 2009. She holds an Associate Professor position in the Computer Systems and Computation Department at the same University. She has published more than 40 papers in refereed journals and international venues. Her research interests are formal methods in general, with a focus on probabilistic/timed/stochastic extensions in formal testing.