Top

2017 | Book

Read chapter Read first chapter

Data Management on New Hardware

7th International Workshop on Accelerating Data Analysis and Data Management Systems Using Modern Processor and Storage Architectures, ADMS 2016 and 4th International Workshop on In-Memory Data Management and Analytics, IMDM 2016, New Delhi, India, September 1, 2016, Revised Selected Papers

Editors: Spyros Blanas, Rajesh Bordawekar, Tirthankar Lahiri, Justin Levandoski, Andrew Pavlo

Publisher: Springer International Publishing

Book Series : Lecture Notes in Computer Science

Part of: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

About this book

This book contains selected papers from the 7th International Workshop on Accelerating Analytics and Data Management Systems Using Modern Processor and Storage Architectures, ADMS 2016, and the 4th International Workshop on In-Memory Data Management and Analytics, IMDM 2016, held in New Dehli, India, in September 2016. The joint Workshops were co-located with VLDB 2016. The 9 papers presented were carefully reviewed and selected from 18 submissions. They investigate opportunities in accelerating analytics/data management systems and workloads (including traditional OLTP, data warehousing/OLAP, ETL streaming/real-time, business analytics, and XML/RDF processing) running memory-only environments, using processors (e.g. commodity and specialized multi-core, GPUs and FPGAs, storage systems (e.g. storage-class memories like SSDs and phase-change memory), and hybrid programming models like CUDA, OpenCL, and Open ACC. The papers also explore the interplay between overall system design, core algorithms, query optimization strategies, programming approaches, performance modeling and evaluation, from the perspective of data management applications.

Frontmatter

Cache-Sensitive Skip List: Efficient Range Queries on Modern CPUs

Abstract

Due to ever falling prices and advancements in chip technologies, many of today’s databases can be entirely kept in main memory. However, reusing existing disk-based index structures for managing data in memory leads to suboptimal performance due to inefficient cache usage and negligence of the capabilities of modern CPUs. Accordingly, a number of main-memory optimized index structures have been proposed, yet most of them focus entirely on single-key lookups, neglecting the equally important range queries. We present Cache-Sensitive Skip Lists (CSSL) as a novel index structure that is optimized for range queries and exploits modern CPUs. CSSL is based on a cache-friendly data layout and traversal algorithm that minimizes cache misses, branch mispredictions, and allows to exploit SIMD instructions for search. In our experiments, CSSL’s range query performance surpasses all competitors significantly. Even for lookups, it is only surpassed by the recently presented ART index structure. We therefore see CSSL as a serious alternative for mixed key/range workloads on main-memory databases.

Stefan Sprenger, Steffen Zeuch, Ulf Leser

Exploit Every Cycle: Vectorized Time Series Algorithms on Modern Commodity CPUs

Abstract

Many time series algorithms reduce the computation cost by pruning unpromising candidates with lower-bound distance functions. In this paper, we focus on an orthogonal research direction that further boosts the performance by unlocking the potentials of modern commodity CPUs. First, we conduct a performance profiling on existing algorithms to understand where does time go. Second, we design vectorized implementations for lower-bound and distance functions that can enjoy characteristics (e.g., data parallelism, caching, branch prediction) provided by CPU. Third, our vectorized methods are general and applicable to many time series problems such as subsequence search, motif discovery and kNN classification. Our experimental study on real datasets shows that our proposal can achieve up to 6 times of speedup.

Bo Tang, Man Lung Yiu, Yuhong Li, Leong Hou U

Compression-Aware In-Memory Query Processing: Vision, System Design and Beyond

Abstract

In-memory database systems have to keep base data as well as intermediate results generated during query processing in main memory. In addition, the effort to access intermediate results is equivalent to the effort to access the base data. Therefore, the optimization of intermediate results is interesting and has a high impact on the performance of the query execution. For this domain, we propose the continuous use of lightweight compression methods for intermediate results and have the aim of developing a balanced query processing approach based on compressed intermediate results. To minimize the overall query execution time, it is important to find a balance between the reduced transfer times and the increased computational effort. This paper provides an overview and presents a system design for our vision. Our system design addresses the challenge of integrating a large and evolving corpus of lightweight data compression algorithms in an in-memory column store. In detail, we present our model-driven approach and describe ongoing research topics to realize our compression-aware query processing vision.

Juliana Hildebrandt, Dirk Habich, Patrick Damme, Wolfgang Lehner

Overtaking CPU DBMSes with a GPU in Whole-Query Analytic Processing with Parallelism-Friendly Execution Plan Optimization

Abstract

Existing work on accelerating analytic DB query processing with (discrete) GPUs fails to fully realize their potential for speedup through parallelism: Published results do not achieve significant speedup over more performant CPU-only DBMSes when processing complete queries.

This paper presents a successful effort to better meet this challenge, in the form of a proof-of-concept query processing framework. The framework constitutes a graft onto an existing DBMS, altering some parts of it and replacing its execution engine entirely. It intensively refactors query execution plans, making them better-parallelizable, before executing them on either a CPU or on GPU. This results in a significant speedup even on a CPU, and a further speedup when using a GPU, over the chosen host DBMS (MonetDB) — which itself already bests most published results utilizing a GPU for query processing.

Finally, we outline some concrete future improvements on our results which can cut processing time by half and possibly much more.

Adnan Agbaria, David Minor, Natan Peterfreund, Eyal Rozenberg, Ofer Rosenberg

To Copy or Not to Copy: Making In-Memory Databases Fast on Modern NICs

Abstract

When databases resided primarily on disks, the problem of data layout was focused on structures that enabled efficient reads and writes from that medium, as well the as effective use of main memory as a cache. With in-memory databases, this part of the I/O problem is largely gone, but that does not mean that I/O considerations can be completely ignored. We argue for the importance of considering the “other” side of the I/O equation—network communication with clients—when designing in-memory databases.

Modern NICs include a number of optimizations intended to improve I/O performance, including kernel bypass, zero-copy and scatter-gather DMA. Applied carefully, these features can reduce the involvement of the CPU in network transfers and can save memory bandwidth. However, our experiments show that some optimizations do not always provide a benefit in a database context, and using them can be tricky, as they affect strategies for processing updates and managing data lifetimes. In this paper, we explore the application of zero-copy NIC DMA to in-memory databases, and we explore how the NIC can influence and explicitly leverage data layout and concurrency control. We apply these results to the Bw-Tree structure by proposing a client-assisted design for transmitting large range scan results. Overall, the approach boosts throughput by 1.7\(\times \) and reduces CPU overhead by 75% compared to simple zero-copy DMA.

Aniraj Kesavan, Robert Ricci, Ryan Stutsman

DBMS Data Loading: An Analysis on Modern Hardware

Abstract

Data loading has traditionally been considered a “one-time deal” – an offline process out of the critical path of query execution. The architecture of DBMS is aligned with this assumption. Nevertheless, the rate in which data is produced and gathered nowadays has nullified the “one-off” assumption, and has turned data loading into a major bottleneck of the data analysis pipeline.

This paper analyzes the behavior of modern DBMS in order to quantify their ability to fully exploit multicore processors and modern storage hardware during data loading. We examine multiple state-of-the-art DBMS, a variety of hardware configurations, and a combination of synthetic and real-world datasets to identify bottlenecks in the data loading process and to provide guidelines on how to accelerate data loading. Our findings show that modern DBMS are unable to saturate the available hardware resources. We therefore identify opportunities to accelerate data loading.

Adam Dziedzic, Manos Karpathiotakis, Ioannis Alagiannis, Raja Appuswamy, Anastasia Ailamaki

Locality-Adaptive Parallel Hash Joins Using Hardware Transactional Memory

Abstract

Previous work [1] has claimed that the best performing implementation of in-memory hash joins is based on (radix-)partitioning of the build-side input. Indeed, despite the overhead of partitioning, the benefits from increased cache-locality and synchronization free parallelism in the build-phase outweigh the costs when the input data is randomly ordered. However, many datasets already exhibit significant spatial locality (i.e., non-randomness) due to the way data items enter the database: through periodic ETL or trickle loaded in the form of transactions. In such cases, the first benefit of partitioning — increased locality — is largely irrelevant. In this paper, we demonstrate how hardware transactional memory (HTM) can render the other benefit, freedom from synchronization, irrelevant as well.

Specifically, using careful analysis and engineering, we develop an adaptive hash join implementation that outperforms parallel radix-partitioned hash joins as well as sort-merge joins on data with high spatial locality. In addition, we show how, through lightweight (less than 1% overhead) runtime monitoring of the transaction abort rate, our implementation can detect inputs with low spatial locality and dynamically fall back to radix-partitioning of the build-side input. The result is a hash join implementation that is more than 3 times faster than the state-of-the-art on high-locality data and never more than 1% slower.

Anil Shanbhag, Holger Pirk, Sam Madden

SwingDB: An Embedded In-memory DBMS Enabling Instant Snapshot Sharing

Abstract

Data transmission between an in-memory DBMS and a data analytical program is usually slow, partially due to the inadequate IPC support of modern operating systems. In this paper, we present SWING, a novel inter-process data sharing mechanism of OS, which allows processes to share physical memory through an instant system call. Based on SWING, we develop an embedded in-memory DBMS called SwingDB, which enables data analytical applications to access databases in their own memory space, instead of resorting to traditional inter-process communication. Extensive experiments were conducted to demonstrate the advantage of such a DBMS-OS co-design.

Qingzhong Meng, Xuan Zhou, Shiping Chen, Shan Wang

Runtime Fragility in Main Memory

Abstract

In this paper we investigate the following problem: Given a database workload (tables and queries), which data layout (row, column or a suitable PAX-layout) should we choose in order to get the best possible performance? We show that this is not an easy problem. We explore careful combinations of various parameters that have an impact on the performance including: (1) the schema, (2) the CPU architecture, (3) the compiler, and (4) the optimization level. We include a CPU from each of the past 4 generations of Intel CPUs.

In addition, we demonstrate the importance of taking variance into account when deciding on the optimal storage layout. We observe considerable variance throughout our measurements which makes it difficult to argue along means over different runs of an experiment. Therefore, we compute confidence intervals for all measurements and exploit this to detect outliers and define classes of methods that we are not allowed to distinguish statistically. The variance of different performance measurements can be so significant that the optimal solution may not be the best one in practice.

Our results also indicate that a carefully or ill-chosen compilation setup can trigger a performance gain or loss of factor 1.1 to factor 25 in even the simplest workloads: a table with four attributes and a simple query reading those attributes. This latter observation is not caused by variance in the measured runtimes, but due to using a different compiler setup.

Besides the compilation setup, the data layout is another source of query time fragility. Various size metrics of the memory subsystem are round numbers in binary, or put more simply: powers of 2 in decimal. System engineers have followed this tradition over time. Surprisingly, there exists a use-case in query processing where using powers of 2 is always a suboptimal choice, leading to one more cause of fragile query times. Using this finding, we will show how to improve tuple-reconstruction costs by using a novel main-memory data-layout.

Endre Palatinus, Jens Dittrich

Backmatter

Title: Data Management on New Hardware
Editors: Spyros Blanas
Rajesh Bordawekar
Tirthankar Lahiri
Justin Levandoski
Andrew Pavlo
Publisher: Springer International Publishing
Electronic ISBN: 978-3-319-56111-0
Print ISBN: 978-3-319-56110-3
DOI: https://doi.org/10.1007/978-3-319-56111-0

Springer Professional

Data Management on New Hardware

7th International Workshop on Accelerating Data Analysis and Data Management Systems Using Modern Processor and Storage Architectures, ADMS 2016 and 4th International Workshop on In-Memory Data Management and Analytics, IMDM 2016, New Delhi, India, September 1, 2016, Revised Selected Papers

About this book

Table of Contents

Frontmatter

Cache-Sensitive Skip List: Efficient Range Queries on Modern CPUs

Exploit Every Cycle: Vectorized Time Series Algorithms on Modern Commodity CPUs

Compression-Aware In-Memory Query Processing: Vision, System Design and Beyond

Overtaking CPU DBMSes with a GPU in Whole-Query Analytic Processing with Parallelism-Friendly Execution Plan Optimization

To Copy or Not to Copy: Making In-Memory Databases Fast on Modern NICs

DBMS Data Loading: An Analysis on Modern Hardware

Locality-Adaptive Parallel Hash Joins Using Hardware Transactional Memory

SwingDB: An Embedded In-memory DBMS Enabling Instant Snapshot Sharing

Runtime Fragility in Main Memory

Backmatter

Premium Partner