Skip to main content

Über dieses Buch

This book constitutes the refereed post-conference proceedings of the 6th TPC Technology Conference, TPCTC 2014, held in Hangzhou, China, in September 2014. It contains 12 selected peer-reviewed papers, a report from the TPC Public Relations Committee.

Many buyers use TPC benchmark results as points of comparison when purchasing new computing systems. The information technology landscape is evolving at a rapid pace, challenging industry experts and researchers to develop innovative techniques for evaluation, measurement and characterization of complex systems. The TPC remains committed to developing new benchmark standards to keep pace and one vehicle for achieving this objective is the sponsorship of the Technology Conference on Performance Evaluation and Benchmarking (TPCTC). Over the last five years TPCTC has been held successfully in conjunction with VLDB.



Introducing TPCx-HS: The First Industry Standard for Benchmarking Big Data Systems

The designation Big Data has become a mainstream buzz phrase across many industries as well as research circles. Today many companies are making performance claims that are not easily verifiable and comparable in the absence of a neutral industry benchmark. Instead one of the test suites used to compare performance of Hadoop based Big Data systems is the TeraSort. While it nicely defines the data set and tasks to measure Big Data Hadoop systems it lacks a formal specification and enforcement rules that enable the comparison of results across systems. In this paper we introduce TPCx-HS, the industry’s first industry standard benchmark, designed to stress both hardware and software that is based on Apache HDFS API compatible distributions. TPCx-HS extends the workload defined in TeraSort with formal rules for implementation, execution, metric, result verification, publication and pricing. It can be used to asses a broad range of system topologies and implementation methodologies of Big Data Hadoop systems in a technically rigorous and directly comparable and vendor-neutral manner.
Raghunath Nambiar, Meikel Poess, Akon Dey, Paul Cao, Tariq Magdon-Ismail, Da Qi Ren, Andrew Bond

An Evaluation of Alternative Physical Graph Data Designs for Processing Interactive Social Networking Actions

This study quantifies the tradeoff associated with alternative physical representations of a social graph for processing interactive social networking actions. We conduct this evaluation using a graph data store named Neo4j deployed in a client-server (REST) architecture using the BG benchmark. In addition to the average response time of a design, we quantify its SoAR defined as the highest observed throughput given the following service level agreement: 95 % of actions to observe a response time of 100 ms or faster. For an action such as computing the shortest distance between two members, we observe a tradeoff between speed and accuracy of the computed result. With this action, a relational data design provides a significantly faster response time than a graph design. The graph designs provide a higher SoAR than a relational one when the social graph includes large member profile images stored in the data store.
Shahram Ghandeharizadeh, Reihane Boghrati, Sumita Barahmand

On Characterizing the Performance of Distributed Graph Computation Platforms

Graphs are widely used for modeling complicated data in different application domains such as social networks, protein networks, transportation networks, bibliographical networks, knowledge bases and many more. Currently, graphs with millions and billions of nodes and edges have become very common. Therefore, designing scalable systems for processing and analyzing large scale graphs has become one of the most timely problems facing the big data research community. In practice, distributed processing of large scale graphs is a challenging task due to their size in addition to their inherent irregular structure and the iterative nature of graph processing and computation algorithms. In recent years, several distributed graph processing systems have been presented, most notably Pregel and GraphLab, to tackle this challenge. In particular, both systems use a vertex-centric computation model which enables the user to design a program that is executed locally for each vertex in parallel. In this paper, we analyze the performance characteristics of distributed graph processing systems and provide an experimental comparison on the performance of two popular systems in this area.
Ahmed Barnawi, Omar Batarfi, Seyed-Mehdi-Reza Behteshi, Radwa Elshawi, Ayman Fayoumi, Reza Nouri, Sherif Sakr

Discussion of BigBench: A Proposed Industry Standard Performance Benchmark for Big Data

Enterprises perceive a huge opportunity in mining information that can be found in big data. New storage systems and processing paradigms are allowing for ever larger data sets to be collected and analyzed. The high demand for data analytics and rapid development in technologies has led to a sizable ecosystem of big data processing systems. However, the lack of established, standardized benchmarks makes it difficult for users to choose the appropriate systems that suit their requirements. To address this problem, we have developed the BigBench benchmark specification. BigBench is the first end-to-end big data analytics benchmark suite. In this paper, we present the BigBench benchmark and analyze the workload from technical as well as business point of view. We characterize the queries in the workload along different dimensions, according to their functional characteristics, and also analyze their runtime behavior. Finally, we evaluate the suitability and relevance of the workload from the point of view of enterprise applications, and discuss potential extensions to the proposed specification in order to cover typical big data processing use cases.
Chaitanya Baru, Milind Bhandarkar, Carlo Curino, Manuel Danisch, Michael Frank, Bhaskar Gowda, Hans-Arno Jacobsen, Huang Jie, Dileep Kumar, Raghunath Nambiar, Meikel Poess, Francois Raab, Tilmann Rabl, Nishkam Ravi, Kai Sachs, Saptak Sen, Lan Yi, Choonhan Youn

A Scalable Framework for Universal Data Generation in Parallel

Nowadays, more and more companies, such as Amazon, Twitter and etc., are facing the big data problem, which requires higher performance to manage tremendous large data sets. Data management systems with a new architecture taking full advantages of computer hardware are emerging, on the purpose of maximizing the system performance and fulfilling customs’ current or even future requirements. How to test performance and confirm the suitability of the new data management system becomes a primary task of these companies. Hence, how to generate a scaled data set with desired volumes and in desired velocity effectively becomes a problem imperative to be solved, together with the goal to keep the characters of their real data set as many as possible (realistic). In this paper, we proposed PSUG to generate a realistic database in terms of required volume and velocity in a scalable parallel manner. Our extensive experimental studies confirm the efficiency and effectiveness of our proposed method.
Ling Gu, Minqi Zhou, Qiangqiang Kang, Aoying Zhou

Towards an Extensible Middleware for Database Benchmarking

Today’s database benchmarks are designed to evaluate a particular type of database. Furthermore, popular benchmarks, like those from TPC, come without a ready-to-use implementation requiring database benchmark users to implement the benchmarking tool from scratch. The result of this is that there is no single framework that can be used to compare arbitrary database systems. The primary reason for this, among others, being the complexity of designing and implementing distributed benchmarking tools.
In this paper, we describe our vision of a middleware for database benchmarking which eliminates the complexity and difficulty of designing and running arbitrary benchmarks: workload specification and interface mappers for the system under test should be nothing but configuration properties of the middleware. We also sketch out an architecture for this benchmarking middleware and describe the main components and their requirements.
David Bermbach, Jörn Kuhlenkamp, Akon Dey, Sherif Sakr, Raghunath Nambiar

Scaling Up Mixed Workloads: A Battle of Data Freshness, Flexibility, and Scheduling

The common “one size does not fit all” paradigm isolates transactional and analytical workloads into separate, specialized database systems. Operational data is periodically replicated to a data warehouse for analytics. Competitiveness of enterprises today, however, depends on real-time reporting on operational data, necessitating an integration of transactional and analytical processing in a single database system. The mixed workload should be able to query and modify common data in a shared schema. The database needs to provide performance guarantees for transactional workloads, and, at the same time, efficiently evaluate complex analytical queries. In this paper, we share our analysis of the performance of two main-memory databases that support mixed workloads, SAP HANA and HyPer, while evaluating the mixed workload CH-benCHmark. By examining their similarities and differences, we identify the factors that affect performance while scaling the number of concurrent transactional and analytical clients. The three main factors are (a) data freshness, i.e., how recent is the data processed by analytical queries, (b) flexibility, i.e., restricting transactional features in order to increase optimization choices and enhance performance, and (c) scheduling, i.e., how the mixed workload utilizes resources. Specifically for scheduling, we show that the absence of workload management under cases of high concurrency leads to analytical workloads overwhelming the system and severely hurting the performance of transactional workloads.
Iraklis Psaroudakis, Florian Wolf, Norman May, Thomas Neumann, Alexander Böhm, Anastasia Ailamaki, Kai-Uwe Sattler

Parameter Curation for Benchmark Queries

In this paper we consider the problem of generating parameters for benchmark queries so these have stable behavior despite being executed on datasets (real-world or synthetic) with skewed data distributions and value correlations. We show that uniform random sampling of the substitution parameters is not well suited for such benchmarks, since it results in unpredictable runtime behavior of queries. We present our approach of Parameter Curation with the goal of selecting parameter bindings that have consistently low-variance intermediate query result sizes throughout the query plan. Our solution is illustrated with IMDB data and the recently proposed LDBC Social Network Benchmark (SNB).
Andrey Gubichev, Peter Boncz

Downtime-Free Live Migration in a Multitenant Database

Multitenant databases provide database services to a large number of users, called tenants. In such environments, an efficient management of resources is essential for providers of these services in order to minimize their capital as well as operational costs. This is typically achieved by dynamic sharing of resources between tenants depending on their current demand, which allows providers to oversubscribe their infrastructure and increase the density (the number of supported tenants) of their database deployment. In order to react quickly to variability in demand and provide consistent quality of service to all tenants, a multitenant database must be very elastic and able to reallocate resources between tenants at a low cost and with minimal disruption. While some existing database and virtualization technologies accomplish this fairly well for resources within a server, the cost of migrating a tenant to a different server often remains high. We present an efficient technique for live migration of database tenants in a shared-disk architecture which imposes no downtime on the migrated tenant and reduces the amount of data to be copied to a minimum. We achieve this by gradually migrating database connections from the source to the target node of a database cluster using a self-adapting algorithm that minimizes performance impact for the migrated tenant. As part of the migration, only frequently accessed cache content is transferred from the source to the target server, while database integrity is guaranteed at all times. We thoroughly analyze the performance characteristics of this technique through experimental evaluation using various database workloads and parameters, and demonstrate that even databases with a size of 100 GB executing 2500 transactions per second can be migrated at a minimal cost with no downtime or failed transactions.
Nicolas Michael, Yixiao Shen

Performance Analysis of Database Virtualization with the TPC-VMS Benchmark

TPC-VMS is a benchmark designed to measure the performance of virtualized databases using existing, time-tested TPC workloads. In this paper, we will present our experience in using the TPC-E workload under the TPC-VMS rules to measure the performance of 3 OLTP databases consolidated onto a single server. We will describe the tuning steps employed to more than double the performance and reach 98.6 % of the performance of a non-virtualized server – if we aggregate the throughputs of the 3 VMs for quantifying the tuning process. The paper will detail lessons learned in optimizing performance by tuning the application, the database manager, the guest operating system, the hypervisor, and the hardware on both AMD and Intel processors.
Since TPC-E results have been disclosed with non-virtualized databases on both platforms, we can analyze the performance overheads of virtualization for database workloads. With a native-virtual performance gap of just a few percentage points, we will show that virtualized servers make excellent platforms for the most demanding database workloads.
Eric Deehr, Wen-Qi Fang, H. Reza Taheri, Hai-Fang Yun

A Query, a Minute: Evaluating Performance Isolation in Cloud Databases

Several cloud providers offer reltional databases as part of their portfolio. It is however not obvious how resource virtualization and sharing, which is inherent to cloud computing, influence performance and predictability of these cloud databases.
Cloud providers give little to no guarantees for consistent execution or isolation from other users. To evaluate the performance isolation capabilities of two commercial cloud databases, we ran a series of experiments over the course of a week (a query, a minute) and report variations in query response times. As a baseline, we ran the same experiments on a dedicated server in our data center. The results show that in the cloud single outliers are up to 31 times slower than the average. Additionally, one can see a point in time after which the average performance of all executed queries improves by 38 %.
Tim Kiefer, Hendrik Schön, Dirk Habich, Wolfgang Lehner

Composite Key Generation on a Shared-Nothing Architecture

Generating synthetic data sets is integral to benchmarking, debugging, and simulating future scenarios. As data sets become larger, real data characteristics thereby become necessary for the success of new algorithms. Recently introduced software systems allow for synthetic data generation that is truly parallel. These systems use fast pseudorandom number generators and can handle complex schemas and uniqueness constraints on single attributes. Uniqueness is essential for forming keys, which identify single entries in a database instance. The uniqueness property is usually guaranteed by sampling from a uniform distribution and adjusting the sample size to the output size of the table such that there are no collisions. However, when it comes to real composite keys, where only the combination of the key attribute has the uniqueness property, a different strategy needs to be employed. In this paper, we present a novel approach on how to generate composite keys within a parallel data generation framework. We compute a joint probability distribution that incorporates the distributions of the key attributes and use the unique sequence positions of entries to address distinct values in the key domain.
Marie Hoffmann, Alexander Alexandrov, Periklis Andritsos, Juan Soto, Volker Markl


Weitere Informationen