A formal framework to analyze cost and performance in Map-Reduce based applications

https://doi.org/10.1016/j.jocs.2013.04.003Get rights and content

Abstract

Satisfying the global throughput targets of scientific applications is an important challenge in high performance computing (HPC) systems. The main difficulty lies in the high number of parameters having an important impact on the overall system performance. These include the number of storage servers, features of communication links, and the number of CPU cores per node, among many others.

In this paper we present a model that computes a performance/cost ratio using different hardware configurations and focusing on scientific computing. The main goal of this approach is to balance the trade-off between cost and performance using different combinations of components for building the entire system. The main advantage of our approach is that we simulate different configurations in a complex simulation platform. Therefore, it is not necessary to make an investment until the system computes the different alternatives and the best solutions are suggested. In order to achieve this goal, both the system's architecture and Map-Reduce applications are modeled. The proposed model has been evaluated by building complex systems in a simulated environment using the SIMCAN simulation platform.

Introduction

During the last decade, most efforts in scientific applications were focused on supercomputers, grids, and commodity clusters [1], [2]. Generally, scientific applications are computation and data intensive, which means that these applications require long execution time. However, the execution time can be reduced by exploiting parallelism in CPU and I/O, especially, in cluster systems. Examples of those applications can be found in many fields like MRI scan data [3], astronomy [4], and Earthquake Modeling [5]. Due to its high cost effectiveness, commodity clusters have been the main resource for high performance computing (HPC). As an example, in the most recent survey of the fastest 500 computers in the world [6], 81.4% are clusters.

Scientific high-performance applications have reached a turning point where computing power is no longer the most important concern [7], [8]. Processing power has been doubling every 18 months for many years, enabling HPC applications to create, analyze and use increasingly large volumes of data. The impact of these trends on system architectures has caused a subsequent increase of the data-set sizes, which have to be stored and accessed efficiently. Therefore, programming applications that combine the exploitation of CPU parallelism with massive data operations is hard and a problem that has not yet resolved. Today, however, the emphasis is shifting from focusing the main optimization only on the computing system, to optimizing other systems such as the storage and the networking systems.

In late 2007, a new dimension was added to the scientific community: cloud computing. Although the concept of cloud computing emerged in 2007, there does not exist yet an accurate definition for it [1]. Cloud computing can be seen as a paradigm that provides access to a flexible and on-demand computing infrastructure, by allowing the user to rent a set of virtual machines to solve a computational problem.

In order to make a good use of resources, both clusters and cloud systems need a flexible and scalable method for developing and executing high performance scientific applications. Clearly, this common piece is the Map-Reduce scheme [9], a programming model for processing large data-sets in distributed systems. Although the Map-Reduce model was originally developed by Google, it is currently used by important companies like Amazon, Facebook, Baidu, Yahoo!, Last.fm, New York Times and AOL, becoming the de facto standard for large-scale data processing [10], [11]. From now on, the term scientific cluster is used to refer to internal data centers not made available to the general public, and public cloud refers to services that can be outsourced by paying each deployed virtual machine (VM) per unit of time basis.

This paper describes a model that splits the cluster configuration in four independent systems: CPU, storage, memory and network. Basically, this model takes as input a list of components for each system and produces, as output, a performance/cost ratio for each configuration. Accordingly, a set of detailed parameters related to the underlying characteristics (including cost) of each component must be specified. Thus, this model uses the information provided as input by the user to generate automatically a combination of system models to simulate its underlying hardware features. This model has been implemented in a simulated environment using the SIMCAN simulation platform [12], an open-source project for modeling and simulating distributed systems and applications. This platform also allows user to develop parallel and distributed applications, which is a powerful tool for calculating the performance of scientific computation using different hardware configurations [13], [14], [15], [16].

The results obtained from the application of our methodology are reported on this paper, supported by the SIMCAN simulation platform to optimize the balance and performance for scientific systems. The main contributions of this work can be summarized as:

  • In contrast with existing work, mainly focusing on both optimizations and performance evaluation of Map-Reduce based applications [17], [18], our proposed model focuses on the relationship between the hardware cost, the configuration of the system, the overall application performance, and the user requirements. Accordingly, our work presents the design and development of an accurate model for optimizing the trade-offs between cost and performance of scientific clusters based on the requirements defined by the user.

  • In order to show the accuracy of the proposed model, a big set of validation experiments has been exercised in real environments.

  • A complete performance evaluation model allows to compare different hardware configurations. Thus, this model computes the best choice of the possible hardware alternatives for the specified requirements.

This paper represents an extended and enhanced version of previous work on the optimization of the trade-offs between cost and performance in cluster systems [19]. We have included a detailed description of the simulation platform used to perform the experiments described in this paper. We have extended the evaluation process that was included in our previous work with new experiments. Additionally, several charts that show the performance of the simulated scenarios have been included. These charts are very helpful to provide relevant information to the reader about the cost of obtaining the expected results.

The rest of the paper is structured as follows. Section 2 describes the motivation for this work. Section 3 presents an overview of SIMCAN. Section 4 describes a model to optimize the trade-offs between cost and performance in cluster systems, which is divided in three different parts: modeling of the system architecture, modeling of the Map-Reduce application, and the architecture of the complete model. Section 5 shows both validation and performance experiments using the proposed model. Section 6 presents related work. Finally, Section 7 presents our conclusions and some directions for future work.

Section snippets

Motivation

Currently, users can choose either to execute their applications in a scientific cluster or to launch them in a public cloud. Public cloud environments probably are the easiest and cheapest option because users pay only for the requested resources (VMs and storage) and then they upload their data for launching applications via the Internet. Consequently, in the computing research community, some questions arise: Can we save money/time by using public cloud computing systems instead of buying

Overview of the SIMCAN simulation platform

The SIMCAN project was born in 2008 with the purpose of creating a scalable, flexible and efficient simulation platform for modeling and simulating distributed systems [12]. SIMCAN has been implemented in C++ using both INET and OMNeT++ frameworks.

Currently, simulation techniques are widely used by the research community to study and analyze performance in distributed systems. One of the main issues of simulation is that each researcher has its own objectives and needs, therefore simulators are

Optimizing the trade-offs between cost and performance in cluster systems

This section provides a detailed description of the model for analyzing the cost and performance of Map-Reduce applications in cluster systems. This model is divided in three parts: modeling of the scientific cluster architecture, modeling of the Map-Reduce application, and the architecture of the complete model.

Validation and evaluation

In this section we present both validation and evaluation experiments. However, a thorough validation of the SIMCAN simulation platform has been already achieved [13]. In this work, several experiments have been performed to show the level of accuracy of the simulation models. Basically, these experiments consist in comparing the behavior of executing a given application in a real system with the behavior observed in the simulation of the application in an analogous simulated system. These

Related work

Myriad studies to analyze the performance of high performance computing systems can be found in the literature. Depending of its main objective, the corresponding performance analysis focuses on the hardware system, on the application to be executed, or both.

For example, the performance of the Columbia cluster has been evaluated [26]. The results obtained from this study demonstrate that several features concerning configurations have a direct impact on the overall system performance for the

Conclusions and future work

In this paper we have proposed a model for automatically compute a performance/cost ratio by simulating scientific clusters using different configurations. This model is platform-independent and does not require to make an investment before purchasing the final system. In this paper we have presented both validation and evaluation tests. Initially, the accuracy of our proposed model for simulating Map-Reduce applications in Hadoop cluster is presented. Also, a set of experiments has been

Acknowledgements

Research partially supported by the Spanish Projects TESIS and ESTuDIo (TIN2009-14312-C02 and TIN2012-36812-C02). We would also like to thank Cesar Andrés for his assistance and suggesting improvements in the performed experiments.

Alberto Núñez received the M.Sc. degree in Computer Science in 2005 at the Universidad Carlos III de Madrid, and the Ph.D. degree in Computer Science in 2011 at the same university. He is currently Teaching Assistant at the Universidad Complutense de Madrid, teaching Distributed Systems, Discrete Mathematics and Data Bases. He won the IBM Ph.D. Fellowship award in 2009. His research interests are focused on formal testing and high performance computing architectures and applications, especially

References (40)

  • S. Sehrish et al.

    MRAP: a novel MapReduce-based framework to support HPC analytics applications with access patterns

  • J. Dean et al.

    MapReduce: simplified data processing on large clusters

    Communications of the ACM

    (2008)
  • J. Ekanayake et al.

    MapReduce for data intensive scientific analyses

  • Amazon Elastic MapReduce, 2012,...
  • A. Nú nez et al.

    The SIMCAN Simulation Platform

    (2012)
  • A.N. nez et al.

    New techniques for simulating high performance MPI applications on large storage networks

    Journal of Supercomputing

    (2010)
  • J. Fernández et al.

    Using architectural simulation models to aid the design of data intensive application

  • A. Nú nez et al.

    Analyzing scalable high-performance I/O architectures

  • G. Wang, A. Butt, P. Pandey, K. Gupta, A simulation approach to evaluating design decisions in MapReduce setups, in:...
  • H. chih Yang et al.

    Map-reduce-merge: simplified relational data processing on large clusters

  • Cited by (10)

    View all citing articles on Scopus

    Alberto Núñez received the M.Sc. degree in Computer Science in 2005 at the Universidad Carlos III de Madrid, and the Ph.D. degree in Computer Science in 2011 at the same university. He is currently Teaching Assistant at the Universidad Complutense de Madrid, teaching Distributed Systems, Discrete Mathematics and Data Bases. He won the IBM Ph.D. Fellowship award in 2009. His research interests are focused on formal testing and high performance computing architectures and applications, especially on how to perform models and simulations.

    Mercedes G. Merayo received her Ph.D. in Computer Science from Universidad Complutense de Madrid, Spain, in 2009. She holds an Associate Professor position in the Computer Systems and Computation Department at the same University. She has published more than 40 papers in refereed journals and international venues. Her research interests are formal methods in general, with a focus on probabilistic/timed/stochastic extensions in formal testing.

    View full text