Skip to main content

About this book

This book constitutes the thoroughly refereed post-conference proceedings of the 7th TPC Technology Conference on Performance Evaluation and Benchmarking, TPSTC 2015, held in conjunction with the 40th International Conference on Very Large Databases (VLDB 2015) in Kohala Coast, Hawaii, USA, in August/September 2015.

The 8 papers presented together with 1 keynote, and 1 vision paper were carefully reviewed and selected from 24 submissions. Many buyers use TPC benchmark results as points of comparison when purchasing new computing systems. The information technology landscape is evolving at a rapid pace, challenging industry experts and researchers to develop innovative techniques for evaluation, measurement and characterization of complex systems. The TPC remains committed to developing new benchmark standards to keep pace, and one vehicle for achieving this objective is the sponsorship of the Technology Conference on Performance Evaluation and Benchmarking (TPCTC).

Table of Contents


Reinventing the TPC: From Traditional to Big Data to Internet of Things

The Transaction Processing Performance Council (TPC) has made significant contributions to the industry and research with standards that encourage fair competition to accelerate product development and enhancements. Technology disruptions are changing the industry landscape faster than ever. This paper provides a high level summary of the history of the TPC and recent initiatives to make sure that it is a relevant organization in the age of digital transformation fueled by Big Data and the Internet of Things.
Raghunath Nambiar, Meikel Poess

Pocket Data: The Need for TPC-MOBILE

Embedded database engines such as SQLite provide a convenient data persistence layer and have spread along with the applications using them to many types of systems, including interactive devices such as smartphones. Android, the most widely-distributed smartphone platform, both uses SQLite internally and provides interfaces encouraging apps to use SQLite to store their own private structured data. As similar functionality appears in all major mobile operating systems, embedded database performance affects the response times and resource consumption of billions of smartphones and the millions of apps that run on them—making it more important than ever to characterize smartphone embedded database workloads. To do so, we present results from an experiment which recorded SQLite activity on 11 Android smartphones during one month of typical usage. Our analysis shows that Android SQLite usage produces queries and access patterns quite different from canonical server workloads. We argue that evaluating smartphone embedded databases will require a new benchmarking suite and we use our results to outline some of its characteristics.
Oliver Kennedy, Jerry Ajay, Geoffrey Challen, Lukasz Ziarek

SparkBench – A Spark Performance Testing Suite

Spark has emerged as an easy to use, scalable, robust and fast system for analytics with a rapidly growing and vibrant community of users and contributors. It is multipurpose—with extensive and modular infrastructure for machine learning, graph processing, SQL, streaming, statistical processing, and more. Its rapid adoption therefore calls for a performance assessment suite that supports agile development, measurement, validation, optimization, configuration, and deployment decisions across a broad range of platform environments and test cases.
Recognizing the need for such comprehensive and agile testing, this paper proposes going beyond existing performance tests for Spark and creating an expanded Spark performance testing suite. This proposal describes several desirable properties flowing from the larger scale, greater and evolving variety, and nuanced requirements of different applications of Spark. The paper identifies the major areas of performance characterization, and the key methodological aspects that should be factored into the design of the proposed suite. The objective is to capture insights from industry and academia on how to best characterize capabilities of Spark-based analytic platforms and provide cost-effective assessment of optimization opportunities in a timely manner.
Dakshi Agrawal, Ali Butt, Kshitij Doshi, Josep-L. Larriba-Pey, Min Li, Frederick R Reiss, Francois Raab, Berni Schiefer, Toyotaro Suzumura, Yinglong Xia

NUMA-Aware Memory Management with In-Memory Databases

Writing enterprise grade software for multi-processor systems is an interesting challenge since such a system primarily involves a multitude of hardware components that exhibit conflict due to simultaneous access by unorganized software threads of user applications. The problem is particularly compounded with In-Memory paradigm that includes potential applications like Data Management in the modern era. With an emergence of distributed hardware trends like Non-Uniform Memory Access (NUMA), where access times to a system’s physical address space depend on relative location of Memory w.r.t CPU, it is crucial to rethink about placement of a user process’ workable memory with respect to executing threads. We present a few novel techniques from our Heap management work with SAP HANA as part of our goal towards building a strong NUMA awareness with in-memory databases. Our work primarily focuses on providing a robust and well-performant Memory Management framework on Linux OS by handling the associated complexity and challenges seen with enabling enterprise software to live on a distributed memory landscape. One of the important outcomes of our approach is to build a rich set of kernel APIs that provide fine-granular control to higher DBMS layers like Store and Query for educated placements of their relational data structures. However the generality of our techniques allows them to be readily applied to other domains that need to deal with NUMA performance penalty.
Mehul Wagle, Daniel Booss, Ivan Schreter, Daniel Egenolf

Big-SeqDB-Gen: A Formal and Scalable Approach for Parallel Generation of Big Synthetic Sequence Databases

The recognition that data is of big economic value and the significant hardware achievements in low cost data storage, high-speed networks and high performance parallel computing, foster new research directions on large-scale knowledge discovery from big sequence databases. There are many applications involving sequence databases, such as customer shopping sequences, web clickstreams, and biological sequences. All these applications are concerned by the big data problem. There is no doubt that fast mining of billions of sequences is a challenge. However, due to the non availability of big data sets, it is not possible to assess knowledge discovery algorithms over big sequence databases. For both privacy and security concerns, Companies do not disclose their data. In the other hand, existing synthetic sequence generators are not up to the big data challenge.
In this paper, first we propose a formal and scalable approach for Parallel Generation of Big Synthetic Sequence Databases. Based on Whitney numbers, the underlying Parallel Sequence Generator (i) creates billions of distinct sequences in parallel and (ii) ensures that injected sequential patterns satisfy user-specified sequences’ characteristics. Second, we report a scalability and scale-out performance study of the Parallel Sequence Generator, for various sequence databases’ sizes and various number of Sequence Generators in a shared-nothing cluster of nodes.
Rim Moussa

A Benchmark Framework for Data Compression Techniques

Lightweight data compression is frequently applied in main memory database systems to improve query performance. The data processed by such systems is highly diverse. Moreover, there is a high number of existing lightweight compression techniques. Therefore, choosing the optimal technique for a given dataset is non-trivial. Existing approaches are based on simple rules, which do not suffice for such a complex decision. In contrast, our vision is a cost-based approach. However, this requires a detailed cost model, which can only be obtained from a systematic benchmarking of many compression algorithms on many different datasets. A naïve benchmark evaluates every algorithm under consideration separately. This yields many redundant steps and is thus inefficient. We propose an efficient and extensible benchmark framework for compression techniques. Given an ensemble of algorithms, it minimizes the overall run time of the evaluation. We experimentally show that our approach outperforms the naïve approach.
Patrick Damme, Dirk Habich, Wolfgang Lehner

Enhancing Data Generation in TPCx-HS with a Non-uniform Random Distribution

Developed by the Transaction Processing Performance Council, the TPC Express Benchmark™ HS (TPCx-HS) is the industry’s first standard for benchmarking big data systems. It is designed to provide an objective measure of hardware, operating system and commercial Apache Hadoop File System API compatible software distributions, and to provide the industry with verifiable performance, price-performance and availability metrics [1, 2]. It can be used to compare a broad range of system topologies and implementation methodologies of big data systems in a technically rigorous and directly comparable and vendor-neutral manner. The modeled application is simple and the results are highly relevant to hardware and software dealing with Big Data systems in general. The data generation is derived from TeraGen [3] which uses uniform distribution of data. In this paper the authors propose normal distribution (Gaussian distribution) which may be more representative of real life datasets. The modified TeraGen and complete changes required to the TPCx-HS kit are included as part of this paper.
Raghunath Nambiar, Tilmann Rabl, Karthik Kulkarni, Michael Frank

Rethinking Benchmarking for Data

Benchmarking has been critical in making progress in the field of data, as it has provided a crucial mechanism to accelerate the progress in the data community.
Jignesh M. Patel

Big Data Benchmark Compendium

The field of Big Data and related technologies is rapidly evolving. Consequently, many benchmarks are emerging, driven by academia and industry alike. As these benchmarks are emphasizing different aspects of Big Data and, in many cases, covering different technical platforms and uses cases, it is extremely difficult to keep up with the pace of benchmark creation. Also with the combinations of large volumes of data, heterogeneous data formats and the changing processing velocity, it becomes complex to specify an architecture which best suits all application requirements. This makes the investigation and standardization of such systems very difficult. Therefore, the traditional way of specifying a standardized benchmark with pre-defined workloads, which have been in use for years in the transaction and analytical processing systems, is not trivial to employ for Big Data systems. This document provides a summary of existing benchmarks and those that are in development, gives a side-by-side comparison of their characteristics and discusses their pros and cons. The goal is to understand the current state in Big Data benchmarking and guide practitioners in their approaches and use cases.
Todor Ivanov, Tilmann Rabl, Meikel Poess, Anna Queralt, John  Poelman, Nicolas Poggi, Jeffrey Buell

Profiling the Performance of Virtualized Databases with the TPCx-V Benchmark

The proliferation of virtualized servers in data centers has conquered the last frontier of bare-iron servers: back-end databases. The multi-tenancy issues of elasticity, capacity planning, and load variation in cloud data centers now coincide with the heavy demands of database workloads; which in turn creates a call for a benchmark specifically intended for this environment.
The TPC–V benchmark will fill this need with a publicly-available, end-to-end benchmark kit. Using a prototype of the kit, we profiled the performance of a server running 60 virtual machines with 48 databases of different sizes, load levels, and workloads. We will show that virtualized servers can indeed handle the elasticity and multi-tenancy requirements of the cloud, but only after careful tuning of the system configuration to avoid bottlenecks.
In this paper, we will provide a brief description of the benchmark, discuss the results and the conclusions drawn from the experiments, and propose future directions for analyzing the performance of cloud data centers by augmenting the capabilities of the TPCx-V benchmark kit.
Andrew Bond, Doug Johnson, Greg Kopczynski, H. Reza Taheri


Additional information