Skip to main content

2016 | Buch

Big Data Benchmarking

6th International Workshop, WBDB 2015, Toronto, ON, Canada, June 16-17, 2015 and 7th International Workshop, WBDB 2015, New Delhi, India, December 14-15, 2015, Revised Selected Papers

herausgegeben von: Tilmann Rabl, Raghunath Nambiar, Chaitanya Baru, Milind Bhandarkar, Meikel Poess, Saumyadipta Pyne

Verlag: Springer International Publishing

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

This book constitutes the thoroughly refereed post-workshop proceedings of the 6th International Workshop on Big Data Benchmarking, WBDB 2015, held in Toronto, ON, Canada, in June 2015 and the 7th International Workshop, WBDB 2015, held in New Delhi, India, in December 2015.

The 8 full papers presented in this book were carefully reviewed and selected from 22 submissions. They deal with recent trends in big data and HPC convergence, new proposals for big data benchmarking, as well as tooling and performance results.

Inhaltsverzeichnis

Frontmatter

Future Challenges

Frontmatter
Big Data, Simulations and HPC Convergence
Abstract
Two major trends in computing systems are the growth in high performance computing (HPC) with in particular an international exascale initiative, and big data with an accompanying cloud infrastructure of dramatic and increasing size and sophistication. In this paper, we study an approach to convergence for software and applications/algorithms and show what hardware architectures it suggests. We start by dividing applications into data plus model components and classifying each component (whether from Big Data or Big Compute) in the same way. This leads to 64 properties divided into 4 views, which are Problem Architecture (Macro pattern); Execution Features (Micro patterns); Data Source and Style; and finally the Processing (runtime) View. We discuss convergence software built around HPC-ABDS (High Performance Computing enhanced Apache Big Data Stack) and show how one can merge Big Data and HPC (Big Simulation) concepts into a single stack and discuss appropriate hardware.
Geoffrey Fox, Judy Qiu, Shantenu Jha, Saliya Ekanayake, Supun Kamburugamuve

New Benchmark Proposals

Frontmatter
Benchmarking Fast-Data Platforms for the Aadhaar Biometric Database
Abstract
Aadhaar is the world’s largest biometric database with a billion records, being compiled as an identity platform to deliver social services to residents of India. Aadhaar processes streams of biometric data as residents are enrolled and updated. Besides \(\sim \)1 million enrollments and updates per day, up to 100 million daily biometric authentications are expected during delivery of various public services. These form critical Big Data applications, with large volumes and high velocity of data. Here, we propose a stream processing workload, based on the Aadhaar enrollment and Authentication applications, as a Big Data benchmark for distributed stream processing systems. We describe the application composition, and characterize their task latencies and selectivity, and data rate and size distributions, based on real observations. We also validate this benchmark on Apache Storm using synthetic streams and simulated application logic. This paper offers a unique glimpse into an operational national identity infrastructure, and proposes a benchmark for “fast data” platforms to support such eGovernance applications.
Yogesh Simmhan, Anshu Shukla, Arun Verma
Towards a General Array Database Benchmark: Measuring Storage Access
Abstract
Array databases have set out to close an important gap in data management, as multi-dimensional arrays play a key role in science and engineering data and beyond. Even more, arrays regularly contribute to the “Big Data” deluge, such as satellite images, climate simulation output, medical image modalities, cosmological simulation data, and datacubes in statistics. Array databases have proven advantageous in flexible access to massive arrays, and an increasing number of research prototypes is emerging. With the advent of more implementations a systematic comparison becomes a worthwhile endeavor.
In this paper, we present a systematic benchmark of the storage access component of an Array DBMS. It is designed in a way that comparable results are produced regardless of any specific architecture and tuning. We apply this benchmark, which is available in the public domain, to three main proponents: rasdaman, SciQL, and SciDB. We present the benchmark and its design rationales, show the benchmark results, and comment on them.
George Merticariu, Dimitar Misev, Peter Baumann

Supporting Technology

Frontmatter
ALOJA: A Benchmarking and Predictive Platform for Big Data Performance Analysis
Abstract
The main goals of the ALOJA research project from BSC-MSR, are to explore and automate the characterization of cost-effectiveness of Big Data deployments. The development of the project over its first year, has resulted in a open source benchmarking platform, an online public repository of results with over 42,000 Hadoop job runs, and web-based analytic tools to gather insights about system’s cost-performance (ALOJA’s Web application, tools, and sources available at http://​aloja.​bsc.​es). This article describes the evolution of the project’s focus and research lines from over a year of continuously benchmarking Hadoop under different configuration and deployments options, presents results, and discusses the motivation both technical and market-based of such changes. During this time, ALOJA’s target has evolved from a previous low-level profiling of Hadoop runtime, passing through extensive benchmarking and evaluation of a large body of results via aggregation, to currently leveraging Predictive Analytics (PA) techniques. Modeling benchmark executions allow us to estimate the results of new or untested configurations or hardware set-ups automatically, by learning techniques from past observations saving in benchmarking time and costs.
Nicolas Poggi, Josep Ll. Berral, David Carrera

Experimental Results

Frontmatter
Benchmarking the Availability and Fault Tolerance of Cassandra
Abstract
To be able to handle big data workloads, modern NoSQL database management systems like Cassandra are designed to scale well over multiple machines. However, with each additional machine in a cluster, the likelihood for hardware failure increases. In order to still achieve high availability and fault tolerance, the data needs to be replicated within the cluster. Predictable and stable response times are required by many applications even in the case of a node failure. While Cassandra guarantees high availability, the influence of a node failure on the system performance is still unclear.
In this paper, we therefore focus on the availability and fault tolerance of Cassandra. We analyze the impact of a node outage within a Cassandra cluster on the throughput and latency for different workloads. Our results show that Cassandra is well suited to achieve high availability while preserving table response times in case of a node failure. Especially for read intensive applications that require high availability, Cassandra is a good choice.
Marten Rosselli, Raik Niemann, Todor Ivanov, Karsten Tolle, Roberto V. Zicari
Performance Evaluation of Spark SQL Using BigBench
Abstract
In this paper we present the initial results of our work to execute BigBench on Spark. First, we evaluated the scalability behavior of the existing MapReduce implementation of BigBench. Next, we executed the group of 14 pure HiveQL queries on Spark SQL and compared the results with the respective Hive ones. Our experiments show that: (1) for both Hive and Spark SQL, BigBench queries perform with the increase of the data size on average better than the linear scaling behavior and (2) pure HiveQL queries perform faster on Spark SQL than on Hive.
Todor Ivanov, Max-Georg Beer
Accelerating BigBench on Hadoop
Abstract
Benchmarking Big Data systems is an open challenge. The existing Micro-Benchmarks (e.g. TeraSort) do not present an end-to-end scenario in real world. To solve this issue, a new towards industry standard benchmark for Big Data Analytics called BigBench has been proposed. And with BigBench, we’ve been keeping our collaboration with Apache Open Source Community to work on performance tuning and optimization for Hadoop ecosystem. In this paper, we share our contributions to BigBench, and present our tuning and optimization experience along with the benchmark results.
Yan Tang, Gowda Bhaskar, Jack Chen, Xin Hao, Yi Zhou, Yi Yao, Lifeng Wang
Backmatter
Metadaten
Titel
Big Data Benchmarking
herausgegeben von
Tilmann Rabl
Raghunath Nambiar
Chaitanya Baru
Milind Bhandarkar
Meikel Poess
Saumyadipta Pyne
Copyright-Jahr
2016
Electronic ISBN
978-3-319-49748-8
Print ISBN
978-3-319-49747-1
DOI
https://doi.org/10.1007/978-3-319-49748-8