A framework for building hypercubes using MapReduce

https://doi.org/10.1016/j.cpc.2014.02.010Get rights and content

Abstract

The European Space Agency’s Gaia mission will create the largest and most precise three dimensional chart of our galaxy (the Milky Way), by providing unprecedented position, parallax, proper motion, and radial velocity measurements for about one billion stars. The resulting catalog will be made available to the scientific community and will be analyzed in many different ways, including the production of a variety of statistics. The latter will often entail the generation of multidimensional histograms and hypercubes as part of the precomputed statistics for each data release, or for scientific analysis involving either the final data products or the raw data coming from the satellite instruments.

In this paper we present and analyze a generic framework that allows the hypercube generation to be easily done within a MapReduce infrastructure, providing all the advantages of the new Big Data analysis paradigm but without dealing with any specific interface to the lower level distributed system implementation (Hadoop). Furthermore, we show how executing the framework for different data storage model configurations (i.e. row or column oriented) and compression techniques can considerably improve the response time of this type of workload for the currently available simulated data of the mission.

In addition, we put forward the advantages and shortcomings of the deployment of the framework on a public cloud provider, benchmark against other popular solutions available (that are not always the best for such ad-hoc applications), and describe some user experiences with the framework, which was employed for a number of dedicated astronomical data analysis techniques workshops.

Introduction

Computer processing capabilities have been growing at a fast pace following Moore’s law, i.e. roughly doubling every two years during the last decades. Furthermore, the amount of data managed has been also growing at the same time as disk storage becomes cheaper. Companies like Google, Facebook, Twitter, LinkedIn, etc. nowadays deal with larger and larger data sets which need to be queried on-line by users and also have to answer business related questions for the decision making process. As instrumentation and sensors are basically made of the same technology as computing hardware, this has happened as well in science as we can discern in projects like the human genome, meteorology information and also in astronomical missions and telescopes like Gaia [1], Euclid  [2], the Large Synoptic Survey Telescope–LSST  [3] or the Square Kilometer Array–SKA  [4], which will produce data sets ranging from a petabyte for the entire mission in the case of Gaia to 10 petabytes of reduced data per day in the SKA.

Furthermore, raw data (re-)analysis is becoming an asset for scientific research as it opens up new possibilities to scientists that may lead to more accurate results, enlarging the scientific return of every mission. In order to cope with the large amount of data, the approach to take has to be different from the traditional one in which the data is requested and afterwards analyzed (even remotely). One option is to move to Cloud environments where one can upload the data analysis work flows so that they run in a low-latency environment and can access every single bit of information.

Quite a lot of research has been going on to address these challenges and new computing paradigms have lately appeared such as NoSQL databases, that relax transaction constraints, or other Massively Parallel Processing (MPP) techniques such as MapReduce  [5]. This new architecture emphasizes the scalability and availability of the system over the structure of the information and the savings in storage hardware this may produce. In this way the scale-up of problems is kept reasonably close to the theoretical linear case, allowing us to tackle more complex problems by investing more money in hardware instead of making new software developments which are always far more expensive. An interesting feature of this new type of data management system (MapReduce) is that it does not impose a declarative language (i.e. SQL), but it allows users to plug in their algorithms no matter the programming language they are written in and let them run and visit every single record of the data set (always brute force in MapReduce, although this may be worked around if needed by grouping the input data in different paths using certain constraints). This may also be accomplished to some extent in traditional SQL databases through User Defined Functions (UDFs) although code porting is always an issue as it depends a lot on the peculiarities of the database and debugging is not straightforward [6]. However, scientists and many application developers are more experienced at, or may feel more comfortable with, embedding their algorithms in a piece of software (i.e. a framework) that sits on top of the distributed system, while not caring much about what is going on behind the scenes or about the details of the underlying system.

Furthermore, some of the more widely used tools in data mining and statistics are multidimensional hypercubes and histograms, as they can provide summaries of different and complex phenomena (at a coarser or finer granularity) through a graphical representation of the data being analyzed, no matter how large the data set is. These tools are useful for a wide range of disciplines, in particular in science and astronomy, as they allow the study of certain features and their variations depending on other factors, as well as for data classification aggregations, pivot tables confronting two dimensions, etc. They also help scientists validate the generated data sets and check whether they fit within the expected values of the model or the other way around (also applicable to simulations).

As multidimensional histograms can be considered a very simple hypercube which normally contains one, two or three dimensions (often for visualization purposes) and whose measure is the count of objects given certain concrete values (or ranges) of its dimensions, we will generally refer to hypercubes through the paper and will only mention histograms when the above conditions apply (hypercubes with one to three dimensions whose only measure is the object count).

Previous work applying MapReduce to scientific data includes [7], where a High Energy Physics data analysis framework is embedded into the MapReduce system by means of wrappers (in the Map and Reduce phases) and external storage. The wrappers ensure that the analytical algorithms (implemented in a different programming language) can natively read the data in the framework specific format by copying it to the local file system or to other content distribution infrastructures outside the MapReduce platform. Furthermore,  [8] and [9] examine some of the current public Cloud computing infrastructures for MapReduce and study the effects and limitations of parallel applications porting to the Cloud respectively, both from a scientific data analysis perspective. In addition,  [10] shows that novel storage techniques being currently used in commercial parallel DBMS (i.e. column-oriented) can also be applied to MapReduce work flows, producing significant improvements in the response time of the data processing as well as in the compression ratios achieved for randomly-generated data sets. Last but not least, several general-purpose layers on top of Hadoop (i.e. Pig  [11] and Hive [12]) have lately appeared, aiming at processing and querying large data sets without dealing directly with the lower level API of Hadoop, but using a declarative language that gets translated into MapReduce jobs.

This paper is structured as follows. In Section  2, we present the simulated data set that will be used through the paper and some simple but useful examples that can be built with the framework. Section  3 describes the framework internals. In Section  4, we show the experiments carried out, analyzing the deployment in a public Cloud provider, examining the data storage models (including the column-oriented approach) and compression techniques, and benchmarking against two other well known approaches. Section  5 puts forward some user experiences in some astronomical data analysis techniques workshops. Finally, Sections  6 Conclusions, 7 Future work refer to the conclusions and future work respectively.

Section snippets

Data analysis in the Gaia mission

In the case of the Gaia mission, many histograms will be produced for each data release in order to summarize and document the catalogs produced. Furthermore, a lot of density maps will have to be computed, e.g. for visualization purposes, as otherwise it would be impossible to plot such a large amount of objects. All these histograms and plots (see  [13] for examples), the so-called precomputed statistics, will have to be (re)generated in the shortest period of time and this will imply a load

Framework description

The framework (implemented in Java) has been conceived considering the following features:

  • Thin layer on top of Hadoop that allows users or external tools to focus only on the definition of the hypercubes to compute.

  • Hide all the complexity of this novel computing paradigm and the distributed system on which it runs. Therefore, it provides a way to deal with a cutting-edge distributed system (Hadoop) without any knowledge of Big Data internals.

  • Possibility to process as many hypercubes as

Cloud deployment

Recently there has been a blossoming of commercial Cloud computing service providers, for example Amazon Web Services (AWS), Google Compute Engine, Rackspace Cloud, Microsoft Azure and several other companies or products sometimes focused on different needs (Dropbox, Google Drive, etc.). AWS has become one of the main actors in this Cloud market, offering a wide range of services such as the ones that have been used for this work: Amazon Elastic MapReduce (Amazon EMR4

User experience

The hypercube generation framework is packaged as a JAR (Java Archive) file and has a few dependences on other packages (mainly on those of Hadoop distribution). To make use of the framework, the user has to write some code that sets what hypercubes to compute (see Listing 1), as well as any extra code that will be executed by the framework, e.g. when computing the concrete categories or values for each hypercube and input record in the data set being processed. Furthermore, the user is

Conclusions

In this paper, we have presented a framework that allows us to easily build data mining hypercubes. This framework fills the gap between the computer science and scientific (e.g. astrophysical) communities, easing the adoption of cutting-edge technologies for accomplishing new scientific research challenges not considered before. In this respect the framework adds a layer on top of the Hadoop MapReduce infrastructure so that scientific software engineers can focus on the algorithms themselves

Future work

There are several lines of work that have been opened by this research, including:

  • Extensions or internal optimizations of the framework.

  • Benchmarking against other possible solutions such as the ones provided in the data mining extensions of commercial (parallel) DBMS, or other solutions being currently developed within the Hadoop ecosystem which aim at providing near-real time responses for queries that return small data sets.

  • Use other already existing implementations of the column-based

Acknowledgments

This research was partially supported by Ministerio de Ciencia e Innovación (Spanish Ministry of Science and Innovation), through the research grants TIN2012-31518, AYA2009-14648-C02-01 and CONSOLIDERCSD2007-00050. The GUMS simulations were run on the supercomputer MareNostrum at the Barcelona Supercomputing Center—Centro Nacional de Supercomputación.

References (18)

  • F. Mignard

    Overall science goals of the Gaia mission

  • Euclid. Mapping the geometry of the dark Universe, Definition Study Report, Tech. rep., European Space Agency/SRE, July

    (2011)
  • Z. Ivezic, J.A. Tyson, LSST: from science drivers to reference design and anticipated data products. ArXiv e-prints...
  • P. Dewdney et al.

    The square kilometre array

    Proc. IEEE

    (2009)
  • J. Dean et al.

    MapReduce: simplified data processing on large clusters

    Commun. ACM

    (2008)
  • A. Pavlo et al.

    A comparison of approaches to large-scale data analysis

  • J. Ekanayake et al.

    MapReduce for data intensive scientific analyses

  • T. Gunarathne, T.-L. Wu, J. Qiu, G. Fox, MapReduce in the Clouds for science, in: 2010 IEEE Second International...
  • S.N. Srirama et al.

    Scalability of parallel scientific applications on the Cloud

    Sci. Program.

    (2011)
There are more references available in the full text version of this article.

Cited by (8)

  • Enabling data science in the Gaia mission archive: The present-day mass function and age distribution

    2017, Astronomy and Computing
    Citation Excerpt :

    This is precisely one of the main goals of this research. Furthermore, (Tapiador et al., 2014) examines some technological advances in Big Data and how to use them to bring the scientific and computer science disciplines closer together through higher level frameworks with state-of-the-art advances embedded on them. These tools can easily be leveraged by the scientific community, effectively closing the gap between the latest technological improvements in computer science and the current needs of astronomers and astrophysicists (given the data avalanche in the field).

  • Design science research contribution to business intelligence in the cloud — A systematic literature review

    2016, Future Generation Computer Systems
    Citation Excerpt :

    When the execution of the current query in the server is completed, the query at the front of the buffer is executed next, or optionally, a scheduler picks a query from the buffer to be executed next according to certain scheduling policies. Tapiador et al. proposed a framework that allows the hypercube generation to be easily done in a MapReduce infrastructure [41]. This framework has been implemented in Java considering the following features: (i) A thin layer on top of Hadoop that allows users or external tools to focus only on the definition of the hypercubes to compute. (

  • A classification algorithm of cart decision tree based on mapreduce attribute weights

    2018, International Journal of Performability Engineering
  • OpenCluster: A flexible distributed computing framework for astronomical data processing

    2017, Publications of the Astronomical Society of the Pacific
  • Accelerating data queries on Hadoop framework by using compact data formats

    2017, 2016 IEEE 4th Workshop on Advances in Information, Electronic and Electrical Engineering, AIEEE 2016 - Proceedings
View all citing articles on Scopus
View full text