A framework for building hypercubes using MapReduce

doi:10.1016/j.cpc.2014.02.010

Computer Physics Communications

Volume 185, Issue 5, May 2014, Pages 1429-1438

https://doi.org/10.1016/j.cpc.2014.02.010 Get rights and content

Abstract

The European Space Agency’s Gaia mission will create the largest and most precise three dimensional chart of our galaxy (the Milky Way), by providing unprecedented position, parallax, proper motion, and radial velocity measurements for about one billion stars. The resulting catalog will be made available to the scientific community and will be analyzed in many different ways, including the production of a variety of statistics. The latter will often entail the generation of multidimensional histograms and hypercubes as part of the precomputed statistics for each data release, or for scientific analysis involving either the final data products or the raw data coming from the satellite instruments.

In this paper we present and analyze a generic framework that allows the hypercube generation to be easily done within a MapReduce infrastructure, providing all the advantages of the new Big Data analysis paradigm but without dealing with any specific interface to the lower level distributed system implementation (Hadoop). Furthermore, we show how executing the framework for different data storage model configurations (i.e. row or column oriented) and compression techniques can considerably improve the response time of this type of workload for the currently available simulated data of the mission.

In addition, we put forward the advantages and shortcomings of the deployment of the framework on a public cloud provider, benchmark against other popular solutions available (that are not always the best for such ad-hoc applications), and describe some user experiences with the framework, which was employed for a number of dedicated astronomical data analysis techniques workshops.

Introduction

Computer processing capabilities have been growing at a fast pace following Moore’s law, i.e. roughly doubling every two years during the last decades. Furthermore, the amount of data managed has been also growing at the same time as disk storage becomes cheaper. Companies like Google, Facebook, Twitter, LinkedIn, etc. nowadays deal with larger and larger data sets which need to be queried on-line by users and also have to answer business related questions for the decision making process. As instrumentation and sensors are basically made of the same technology as computing hardware, this has happened as well in science as we can discern in projects like the human genome, meteorology information and also in astronomical missions and telescopes like Gaia [1], Euclid [2], the Large Synoptic Survey Telescope–LSST [3] or the Square Kilometer Array–SKA [4], which will produce data sets ranging from a petabyte for the entire mission in the case of Gaia to 10 petabytes of reduced data per day in the SKA.

Furthermore, raw data (re-)analysis is becoming an asset for scientific research as it opens up new possibilities to scientists that may lead to more accurate results, enlarging the scientific return of every mission. In order to cope with the large amount of data, the approach to take has to be different from the traditional one in which the data is requested and afterwards analyzed (even remotely). One option is to move to Cloud environments where one can upload the data analysis work flows so that they run in a low-latency environment and can access every single bit of information.

Quite a lot of research has been going on to address these challenges and new computing paradigms have lately appeared such as NoSQL databases, that relax transaction constraints, or other Massively Parallel Processing (MPP) techniques such as MapReduce [5]. This new architecture emphasizes the scalability and availability of the system over the structure of the information and the savings in storage hardware this may produce. In this way the scale-up of problems is kept reasonably close to the theoretical linear case, allowing us to tackle more complex problems by investing more money in hardware instead of making new software developments which are always far more expensive. An interesting feature of this new type of data management system (MapReduce) is that it does not impose a declarative language (i.e. SQL), but it allows users to plug in their algorithms no matter the programming language they are written in and let them run and visit every single record of the data set (always brute force in MapReduce, although this may be worked around if needed by grouping the input data in different paths using certain constraints). This may also be accomplished to some extent in traditional SQL databases through User Defined Functions (UDFs) although code porting is always an issue as it depends a lot on the peculiarities of the database and debugging is not straightforward [6]. However, scientists and many application developers are more experienced at, or may feel more comfortable with, embedding their algorithms in a piece of software (i.e. a framework) that sits on top of the distributed system, while not caring much about what is going on behind the scenes or about the details of the underlying system.

Furthermore, some of the more widely used tools in data mining and statistics are multidimensional hypercubes and histograms, as they can provide summaries of different and complex phenomena (at a coarser or finer granularity) through a graphical representation of the data being analyzed, no matter how large the data set is. These tools are useful for a wide range of disciplines, in particular in science and astronomy, as they allow the study of certain features and their variations depending on other factors, as well as for data classification aggregations, pivot tables confronting two dimensions, etc. They also help scientists validate the generated data sets and check whether they fit within the expected values of the model or the other way around (also applicable to simulations).

As multidimensional histograms can be considered a very simple hypercube which normally contains one, two or three dimensions (often for visualization purposes) and whose measure is the count of objects given certain concrete values (or ranges) of its dimensions, we will generally refer to hypercubes through the paper and will only mention histograms when the above conditions apply (hypercubes with one to three dimensions whose only measure is the object count).

Previous work applying MapReduce to scientific data includes [7], where a High Energy Physics data analysis framework is embedded into the MapReduce system by means of wrappers (in the Map and Reduce phases) and external storage. The wrappers ensure that the analytical algorithms (implemented in a different programming language) can natively read the data in the framework specific format by copying it to the local file system or to other content distribution infrastructures outside the MapReduce platform. Furthermore, [8] and [9] examine some of the current public Cloud computing infrastructures for MapReduce and study the effects and limitations of parallel applications porting to the Cloud respectively, both from a scientific data analysis perspective. In addition, [10] shows that novel storage techniques being currently used in commercial parallel DBMS (i.e. column-oriented) can also be applied to MapReduce work flows, producing significant improvements in the response time of the data processing as well as in the compression ratios achieved for randomly-generated data sets. Last but not least, several general-purpose layers on top of Hadoop (i.e. Pig [11] and Hive [12]) have lately appeared, aiming at processing and querying large data sets without dealing directly with the lower level API of Hadoop, but using a declarative language that gets translated into MapReduce jobs.

This paper is structured as follows. In Section 2, we present the simulated data set that will be used through the paper and some simple but useful examples that can be built with the framework. Section 3 describes the framework internals. In Section 4, we show the experiments carried out, analyzing the deployment in a public Cloud provider, examining the data storage models (including the column-oriented approach) and compression techniques, and benchmarking against two other well known approaches. Section 5 puts forward some user experiences in some astronomical data analysis techniques workshops. Finally, Sections 6 Conclusions, 7 Future work refer to the conclusions and future work respectively.

Section snippets

Data analysis in the Gaia mission

In the case of the Gaia mission, many histograms will be produced for each data release in order to summarize and document the catalogs produced. Furthermore, a lot of density maps will have to be computed, e.g. for visualization purposes, as otherwise it would be impossible to plot such a large amount of objects. All these histograms and plots (see [13] for examples), the so-called precomputed statistics, will have to be (re)generated in the shortest period of time and this will imply a load

Framework description

The framework (implemented in Java) has been conceived considering the following features:

•
Thin layer on top of Hadoop that allows users or external tools to focus only on the definition of the hypercubes to compute.
•
Hide all the complexity of this novel computing paradigm and the distributed system on which it runs. Therefore, it provides a way to deal with a cutting-edge distributed system (Hadoop) without any knowledge of Big Data internals.
•
Possibility to process as many hypercubes as

Cloud deployment

Recently there has been a blossoming of commercial Cloud computing service providers, for example Amazon Web Services (AWS), Google Compute Engine, Rackspace Cloud, Microsoft Azure and several other companies or products sometimes focused on different needs (Dropbox, Google Drive, etc.). AWS has become one of the main actors in this Cloud market, offering a wide range of services such as the ones that have been used for this work: Amazon Elastic MapReduce (Amazon EMR⁴

User experience

The hypercube generation framework is packaged as a JAR (Java Archive) file and has a few dependences on other packages (mainly on those of Hadoop distribution). To make use of the framework, the user has to write some code that sets what hypercubes to compute (see Listing 1), as well as any extra code that will be executed by the framework, e.g. when computing the concrete categories or values for each hypercube and input record in the data set being processed. Furthermore, the user is

Conclusions

In this paper, we have presented a framework that allows us to easily build data mining hypercubes. This framework fills the gap between the computer science and scientific (e.g. astrophysical) communities, easing the adoption of cutting-edge technologies for accomplishing new scientific research challenges not considered before. In this respect the framework adds a layer on top of the Hadoop MapReduce infrastructure so that scientific software engineers can focus on the algorithms themselves

Future work

There are several lines of work that have been opened by this research, including:

•
Extensions or internal optimizations of the framework.
•
Benchmarking against other possible solutions such as the ones provided in the data mining extensions of commercial (parallel) DBMS, or other solutions being currently developed within the Hadoop ecosystem which aim at providing near-real time responses for queries that return small data sets.
•
Use other already existing implementations of the column-based

Acknowledgments

This research was partially supported by Ministerio de Ciencia e Innovación (Spanish Ministry of Science and Innovation), through the research grants TIN2012-31518, AYA2009-14648-C02-01 and CONSOLIDERCSD2007-00050. The GUMS simulations were run on the supercomputer MareNostrum at the Barcelona Supercomputing Center—Centro Nacional de Supercomputación.

References (18)

F. Mignard
Overall science goals of the Gaia mission
Euclid. Mapping the geometry of the dark Universe, Definition Study Report, Tech. rep., European Space Agency/SRE, July
(2011)
Z. Ivezic, J.A. Tyson, LSST: from science drivers to reference design and anticipated data products. ArXiv e-prints...
P. Dewdney et al.
The square kilometre array
Proc. IEEE
(2009)
J. Dean et al.
MapReduce: simplified data processing on large clusters
Commun. ACM
(2008)
A. Pavlo et al.
A comparison of approaches to large-scale data analysis
J. Ekanayake et al.
MapReduce for data intensive scientific analyses
T. Gunarathne, T.-L. Wu, J. Qiu, G. Fox, MapReduce in the Clouds for science, in: 2010 IEEE Second International...
S.N. Srirama et al.
Scalability of parallel scientific applications on the Cloud
Sci. Program.
(2011)

There are more references available in the full text version of this article.

Cited by (8)

Enabling data science in the Gaia mission archive: The present-day mass function and age distribution
2017, Astronomy and Computing
Citation Excerpt :
This is precisely one of the main goals of this research. Furthermore, (Tapiador et al., 2014) examines some technological advances in Big Data and how to use them to bring the scientific and computer science disciplines closer together through higher level frameworks with state-of-the-art advances embedded on them. These tools can easily be leveraged by the scientific community, effectively closing the gap between the latest technological improvements in computer science and the current needs of astronomers and astrophysicists (given the data avalanche in the field).
Recent advances in large scale computing architectures enable new opportunities to extract value out of the vast amounts of data being currently generated. However, their successful adoption is not straightforward in areas like science, as there are still some barriers that need to be overcome. Those comprise (i) the existence of legacy code that needs to be ported, (ii) the lack of high-level and use case specific frameworks that facilitate a smoother transition, or (iii) the scarcity of profiles with the balanced skill sets between the technological and scientific domains.
The European Space Agency’s Gaia mission will create the largest and most precise three dimensional chart of our galaxy (the Milky Way), providing unprecedented position, parallax and proper motion measurements for about one billion stars. The successful exploitation of this data archive will depend on the ability to offer the proper infrastructure upon which scientists will be able to do exploration and modelling with this huge data set.
In this paper, we present and contextualize these challenges by building two probabilistic models using Hierarchical Bayesian Modelling. These models represent a key challenge in astronomy and are of paramount importance for the Gaia mission itself. Moreover, we approach the implementation by leveraging a generic distributed processing engine through an existing software package for Markov chain Monte Carlo sampling. The two computationally intensive models are then validated with simulated data in different scenarios under specific restrictions, and their performance is assessed to prove their scalability. We argue that this approach will not only serve for the models in hand but also for exemplifying how to address similar problems in science, which may need to both scale to bigger data sets and reuse existing software as much as possible. This will lead to shorter time to science in massive data archives.
Design science research contribution to business intelligence in the cloud — A systematic literature review
2016, Future Generation Computer Systems
Citation Excerpt :
When the execution of the current query in the server is completed, the query at the front of the buffer is executed next, or optionally, a scheduler picks a query from the buffer to be executed next according to certain scheduling policies. Tapiador et al. proposed a framework that allows the hypercube generation to be easily done in a MapReduce infrastructure [41]. This framework has been implemented in Java considering the following features: (i) A thin layer on top of Hadoop that allows users or external tools to focus only on the definition of the hypercubes to compute. (
Business intelligence (BI) helps managers make informed decisions. In the age of big data, BI technology provides essential support for decision making. Cloud computing also attracts many organizations because of its potential: ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g. networks, servers, storage, applications, and services). This paper focuses on the deployment of BI in the cloud, from the vantage point of design science research (DSR). We produce a state of the art of research pertaining to BI in the cloud, following the methodology of systematic literature review. This literature review especially exhibits the different artifacts proposed by design science researchers regarding BI in the cloud. To structure the literature review, we propose a framework composed of two dimensions: artifact type and BI step. In particular, we propose a typology of artifact types, refining the coarse-grained typology commonly used in DSR. We use the two-dimensional framework both to map the current state of DSR regarding BI in the cloud, and to elicit future research avenues in terms of design science artifacts for BI in the cloud. The contribution is threefold: the literature review may help DSR researchers get an overview of this active research domain; the two-dimensional framework facilitates the understanding of different research streams; finally, the proposed future topics may guide researchers in identifying promising research avenues.
A classification algorithm of cart decision tree based on mapreduce attribute weights
2018, International Journal of Performability Engineering
OpenCluster: A flexible distributed computing framework for astronomical data processing
2017, Publications of the Astronomical Society of the Pacific
Opencluster: A flexible distributed computing framework for astronomical data processing
2017, arXiv
Accelerating data queries on Hadoop framework by using compact data formats
2017, 2016 IEEE 4th Workshop on Advances in Information, Electronic and Electrical Engineering, AIEEE 2016 - Proceedings

View all citing articles on Scopus

View full text

A framework for building hypercubes using MapReduce

Abstract

Introduction

Section snippets

Data analysis in the Gaia mission

Framework description

Cloud deployment

User experience

Conclusions

Future work

Acknowledgments

Overall science goals of the Gaia mission

Euclid. Mapping the geometry of the dark Universe, Definition Study Report, Tech. rep., European Space Agency/SRE, July

The square kilometre array

Proc. IEEE

MapReduce: simplified data processing on large clusters

Commun. ACM

A comparison of approaches to large-scale data analysis

MapReduce for data intensive scientific analyses

Scalability of parallel scientific applications on the Cloud

Sci. Program.