Top

2016 | Book

Read chapter Read first chapter

Big Data Benchmarks, Performance Optimization, and Emerging Hardware

6th Workshop, BPOE 2015, Kohala, HI, USA, August 31 - September 4, 2015. Revised Selected Papers

Editors: Jianfeng Zhan, Rui Han, Roberto V. Zicari

Publisher: Springer International Publishing

Book Series : Lecture Notes in Computer Science

Part of: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

About this book

This book constitutes the thoroughly revised selected papers of the 6th workshop on Big Data Benchmarks, Performance Optimization, and Emerging Hardware, BPOE 2015, held in Kohala Coast, HI, USA, in August/September 2015 as satellite event of VLDB 2015, the 41st International Conference on Very Large Data Bases.

The 8 papers presented were carefully reviewed and selected from 10 submissions. The workshop focuses on architecture and system support for big data systems, aiming at bringing researchers and practitioners from data management, architecture, and systems research communities together to discuss the research issues at the intersection of these areas. This book also invites three papers from several industrial partners, including two papers describing tools used in system benchmarking and monitoring and one paper discussing principles and methodologies in existing big data benchmarks.

Frontmatter

Benchmarking

Frontmatter

Revisiting Benchmarking Principles and Methodologies for Big Data Benchmarking

Abstract

Benchmarking as yardsticks for system design and evaluation, has developed a long period and plays a pivotal role in many domains, such as database systems and high performance computing. Through prolonged and unremitting efforts, benchmarks on these domains have been reaching their maturity gradually. However, in terms of emerging scenarios of big data, its different properties in data volume, data types, data processing requirements and techniques, make that existing benchmarks are rarely appropriate for big data systems and further make us wonder how to define a good big data benchmark. In this paper, we revisit successful benchmarks in other domains from two perspectives: benchmarking principles which define fundamental rules, and methodologies which guide the benchmark constructions. Further, we conclude the benchmarking principle and methodology on big data benchmarking from a recent open-source effort – BigDataBench.

Liutao Zhao, Wanling Gao, Yi Jin

BigDataBench-MT: A Benchmark Tool for Generating Realistic Mixed Data Center Workloads

Abstract

Long-running service workloads (e.g. web search engine) and short-term data analysis workloads (e.g. Hadoop MapReduce jobs) co-locate in today’s data centers. Developing realistic benchmarks to reflect such practical scenario of mixed workload is a key problem to produce trustworthy results when evaluating and comparing data center systems. This requires using actual workloads as well as guaranteeing their submissions to follow patterns hidden in real-world traces. However, existing benchmarks either generate actual workloads based on probability models, or replay real-world workload traces using basic I/O operations. To fill this gap, we propose a benchmark tool that is a first step towards generating a mix of actual service and data analysis workloads on the basis of real workload traces. Our tool includes a combiner that enables the replaying of actual workloads according to the workload traces, and a multi-tenant generator that flexibly scales the workloads up and down according to users’ requirements. Based on this, our demo illustrates the workload customization and generation process using a visual interface. The proposed tool, called BigDataBench-MT, is a multi-tenant version of our comprehensive benchmark suite BigDataBench and it is publicly available from http://prof.ict.ac.cn/BigDataBench/multi-tenancyversion/.

Rui Han, Shulin Zhan, Chenrong Shao, Junwei Wang, Lizy K. John, Jiangtao Xu, Gang Lu, Lei Wang

Benchmarking and Workload Characterization

Frontmatter

Towards a Big Data Benchmarking and Demonstration Suite for the Online Social Network Era with Realistic Workloads and Live Data

Abstract

The growing popularity of online social networks has taken big data analytics into uncharted territories. Newly developed platforms and analytics in these environments are in dire need for customized frameworks of evaluation and demonstration. This paper presents the first big data benchmark centering on online social network analytics and their underlying distributed platforms. The benchmark comprises of a novel data generator rooted in live online social network feeds, a uniquely comprehensive set of online social network analytics workloads, and evaluation metrics that are both system-aware and analytics-aware. In addition, the benchmark also provides application plug-ins that allow for compelling demonstration of big data solutions. We describe the benchmark design challenges, an early prototype and three use cases.

Rui Zhang, Irene Manotas, Min Li, Dean Hildebrand

On Statistical Characteristics of Real-Life Knowledge Graphs

Abstract

The success of open-access knowledge graphs, such as YAGO, and commercial products, such as Google Knowledge Graph, has attracted much attention from both academic and industrial communities in building common-sense and domain-specific knowledge graphs. A natural question arises that how to effectively and efficiently manage a large-scale knowledge graph. Though systems and technologies that use relational storage engines or native graph database management systems are proposed, there exists no widely accepted solution. Therefore, a benchmark for management of knowledge graphs is required.

In this paper, we analyze the requirements of benchmarking knowledge graph management from a specific yet important point-of-view, i.e. characteristics of knowledge graph data. Seventeen statistical features of four knowledge graphs as well as two social networks are studied. We show that through these graphs depict similar structures, their tiny differences may result in totally different storage and indexing strategies, that should not be omitted. Finally, we put forward the requirements to seeding datasets and synthetic data generators for benchmarking knowledge graph management based on the study.

Wenliang Cheng, Chengyu Wang, Bing Xiao, Weining Qian, Aoying Zhou

Mbench: Benchmarking a Multicore Operating System Using Mixed Workloads

Abstract

Existing multicore operating system (OS) benchmarks provide workloads that estimate individual OS components or run in solo for performance and scalability evaluation. But in practice a multicore machine usually hosts a mix of concurrent programs for better resource utilization. In this paper, we show that this lack of mixed workloads in evaluation is inadequate at predicting real-world behavior especially in the spectrum of big data and latency-critical workloads.

We present Mbench, a novel benchmark suite that helps reveal information about the performance isolation provided by an OS. It includes not only micro benchmarks of least redundancy, but also real applications with dynamic workloads and large data sets. All the benchmarks are integrated into an experiment control tool with components of experiment setup, controlling, monitoring,and analyzing. And further extensions can be applied for supports of more benchmarks. Using the benchmark suite, we demonstrate the importance of considering mixed workloads in some challenging problems ranging from workload consolidation and tail latency performance. We plan to release Mbench under a free software license.

Gang Lu, Xinlong Lin, Runlin Zhou

Performance Optimization and Evaluation

Frontmatter

Evolution from Shark to Spark SQL: Preliminary Analysis and Qualitative Evaluation

Abstract

Spark is a general distributed framework with the abstraction called resilient distributed datasets (RDD). Database analysis is one of the main kinds of workloads supported on Spark. The SQL component on Spark has evolved from Shark to Spark SQL, while the core components of Spark also have evolved a lot comparing with the original version. We analyzed on which aspects Spark have made efforts to support many workloads efficiently and whether the changes make the support for SQL achieve better performance.

Xinhui Tian, Gang Lu, Xiexuan Zhou, Jingwei Li

How Data Volume Affects Spark Based Data Analytics on a Scale-up Server

Abstract

Sheer increase in volume of data over the last decade has triggered research in cluster computing frameworks that enable web enterprises to extract big insights from big data. While Apache Spark is gaining popularity for exhibiting superior scale-out performance on the commodity machines, the impact of data volume on the performance of Spark based data analytics in scale-up configuration is not well understood. We present a deep-dive analysis of Spark based applications on a large scale-up server machine. Our analysis reveals that Spark based data analytics are DRAM bound and do not benefit by using more than 12 cores for an executor. By enlarging input data size, application performance degrades significantly due to substantial increase in wait time during I/O operations and garbage collection, despite 10 % better instruction retirement rate (due to lower L1 cache misses and higher core utilization). We match memory behaviour with the garbage collector to improve performance of applications between 1.6x to 3x.

Ahsan Javed Awan, Mats Brorsson, Vladimir Vlassov, Eduard Ayguade

An Optimal Reduce Placement Algorithm for Data Skew Based on Sampling

Abstract

For frequent disk I/O and big data transmissions among different racks and physical nodes, the intermediate data communication has become the biggest performance bottle-neck in most running Hadoop systems. This paper proposes a reduce placement algorithm called CORP to schedule related map and reduce tasks on the near nodes or clusters or racks for the data locality. Since the number of keys cannot be counted until the input data are processed by map tasks, this paper firstly provides a sampling algorithm based on reservoir sampling to achieve the distribution of the keys in intermediate data. Through calculating the distance and cost matrices among the cross node communication, the related map and reduce tasks can be scheduled to relatively near physical nodes for data locality. Experimental results show that CORP can not only improve the balance of reduce tasks effectively, but also decrease the job execution time for the lower inner data communication.

Zhuo Tang, Wen Ma, Rui Li, Kenli Li, Keqin Li

AAA: A Massive Data Acquisition Approach in Large-Scale System Monitoring

Abstract

The rapid development of information system proposes higher demand for monitoring. Usually we resort to a data acquisition system to collect variety of metrics from each device for real-time anomaly detection, alerting and analysis. It is a great challenge to realize real-time and reliable data collection and gathering in a data acquisition system for large-scale system. In this paper, we propose an Adaptive window Acquisition Algorithm (AAA) to support data acquisition on great amount of data sources. AAA can dynamically adjust its policy according to the number of data sources and the acquisition interval to achieve better performance. The algorithm has been applied to a large management system project. Experimental results show that with the help of dynamic adjusting mechanism, the proposed approach can provide reliable collection service for common data acquisition systems.

Runlin Zhou, Yanfei Lv, Dongjin Fan, Hong Zhang, Chunge Zhu

Emerging Hardware

Frontmatter

A Plugin-Based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS

Abstract

Hadoop Distributed File System (HDFS) has been popularly utilized by many Big Data processing frameworks as their underlying storage engine, such as Hadoop MapReduce, HBase, Hive, and Spark. This makes the performance of HDFS a primary concern in the Big Data community. Recent studies have shown that HDFS cannot completely exploit the performance benefits of RDMA-enabled high performance interconnects like InfiniBand. To solve these performance issues, RDMA-enabled HDFS designs have been proposed in the literature that show better performance with RDMA-enabled networks. But these designs are tightly integrated with the specific versions of the Apache Hadoop distribution, and cannot be used with other Hadoop distributions easily. In this paper, we propose an efficient RDMA-based plugin for HDFS, which can be easily integrated with various Hadoop distributions and versions like Apache Hadoop 2.5 and 2.6, Hortonworks HDP, and Cloudera CDH. Performance evaluations show that our plugin ensures the expected performance of up to 3.7x improvement in TestDFSIO write, associated with the hybrid RDMA-enhanced design, to all these distributions. We also demonstrate that our RDMA-based plugin can achieve up to 4.6x improvement over Mellanox R4H (RDMA for HDFS) plugin.

Adithya Bhat, Nusrat Sharmin Islam, Xiaoyi Lu, Md. Wasi-ur-Rahman, Dipti Shankar, Dhabaleswar K. (DK) Panda

Stream-Based Lossless Data Compression Hardware Using Adaptive Frequency Table Management

Abstract

In order to treat BigData efficiently, the communication speed of the inter or the intra data path equipped on high performance computing systems that needs to treat BigData management has been reaching to very high speed. In spite of fast increasing of the BigData, the implementation of the data communication path has become complex due to the electrical difficulties such as noises, crosstalks and reflections of the high speed data connection via a single cupper-based physical wire. This paper proposes a novel hardware solution to implement it by applying a stream-based data compression algorithm called the LCA-DLT. The compression algorithm is able to treat continuous data stream without exchanging the symbol lookup table among the compressor and the decompressor. The algorithm includes a dynamic frequency management of data patterns. The management is implemented by a dynamic histogram creation optimized for hardware implementation. When the dedicated communication protocol is combined with the LCA-DLT, it supports remote data migration among the computing systems. This paper describes the algorithm design and its hardware implementation of the LCA-DLT, and also shows the compression performance including the required hardware resources.

Shinichi Yamagiwa, Koichi Marumo, Hiroshi Sakamoto

Backmatter

Title: Big Data Benchmarks, Performance Optimization, and Emerging Hardware
Editors: Jianfeng Zhan
Rui Han
Roberto V. Zicari
Publisher: Springer International Publishing
Electronic ISBN: 978-3-319-29006-5
Print ISBN: 978-3-319-29005-8
DOI: https://doi.org/10.1007/978-3-319-29006-5

Springer Professional

Big Data Benchmarks, Performance Optimization, and Emerging Hardware

6th Workshop, BPOE 2015, Kohala, HI, USA, August 31 - September 4, 2015. Revised Selected Papers

About this book

Table of Contents

Frontmatter

Benchmarking

Frontmatter

Revisiting Benchmarking Principles and Methodologies for Big Data Benchmarking

BigDataBench-MT: A Benchmark Tool for Generating Realistic Mixed Data Center Workloads

Benchmarking and Workload Characterization

Frontmatter

Towards a Big Data Benchmarking and Demonstration Suite for the Online Social Network Era with Realistic Workloads and Live Data

On Statistical Characteristics of Real-Life Knowledge Graphs

Mbench: Benchmarking a Multicore Operating System Using Mixed Workloads

Performance Optimization and Evaluation

Frontmatter

Evolution from Shark to Spark SQL: Preliminary Analysis and Qualitative Evaluation

How Data Volume Affects Spark Based Data Analytics on a Scale-up Server

An Optimal Reduce Placement Algorithm for Data Skew Based on Sampling

AAA: A Massive Data Acquisition Approach in Large-Scale System Monitoring

Emerging Hardware

Frontmatter

A Plugin-Based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS

Stream-Based Lossless Data Compression Hardware Using Adaptive Frequency Table Management

Backmatter

Premium Partner