Skip to main content

2013 | Buch

A Course in In-Memory Data Management

The Inner Mechanics of In-Memory Databases

insite
SUCHEN

Über dieses Buch

Recent achievements in hardware and software development, such as multi-core CPUs and DRAM capacities of multiple terabytes per server, enabled the introduction of a revolutionary technology: in-memory data management. This technology supports the flexible and extremely fast analysis of massive amounts of enterprise data. Professor Hasso Plattner and his research group at the Hasso Plattner Institute in Potsdam, Germany, have been investigating and teaching the corresponding concepts and their adoption in the software industry for years.

This book is based on the first online course on the openHPI e-learning platform, which was launched in autumn 2012 with more than 13,000 learners. The book is designed for students of computer science, software engineering, and IT related subjects. However, it addresses business experts, decision makers, software developers, technology experts, and IT analysts alike. Plattner and his group focus on exploring the inner mechanics of a column-oriented dictionary-encoded in-memory database. Covered topics include - amongst others - physical data storage and access, basic database operators, compression mechanisms, and parallel join algorithms. Beyond that, implications for future enterprise applications and their development are discussed. Readers are lead to understand the radical differences and advantages of the new technology over traditional row-oriented disk-based databases.

Inhaltsverzeichnis

Frontmatter
Chapter 1. Introduction

This book

A Course in In-Memory Data Management

focuses on the technical details of in-memory columnar databases. In-memory databases, and especially column-oriented databases, are a recently vastly researched topic [BMK09, KNF+12, Pla09]. With modern hardware technologies and increasing main memory capacities, groundbreaking new applications are becoming viable.

Hasso Plattner

The Future of Enterprise Computing

Frontmatter
Chapter 2. New Requirements for Enterprise Computing

When thinking about developing a completely new database management system for enterprise computing, the question whether there is a need for a new database management system arises. And the answer is yes! Modern companies have changed dramatically.

Hasso Plattner
Chapter 3. Enterprise Application Characteristics

An enterprise data management system should be able to handle data coming from several different source types.

Hasso Plattner
Chapter 4. Changes in Hardware

This chapter deals with hardware and lays the foundations to understand how the changing hardware impacts software and application development and is partly taken from [SKP12].

Hasso Plattner
Chapter 5. A Blueprint of SanssouciDB

SanssouciDB is a prototypical database system for unified analytical and transactional processing. The concepts of SanssouciDB build on prototypes developed at the HPI and an existing SAP database system. SanssouciDB is an SQL database and it contains similar components as other databases such as a query builder, a plan executer, meta data, a transaction manager, etc.

Hasso Plattner

Foundations of Database Storage Techniques

Frontmatter
Chapter 6. Dictionary Encoding

Since memory is the new bottleneck, it is required to minimize access to it. Accessing a smaller number of columns can do this on the one hand; so only required attributes are queried. On the other hand, decreasing the number of bits used for data representation can reduce both memory consumption and memory access times.

Hasso Plattner
Chapter 7. Compression

As discussed in Chap. 5, SanssouciDB is a database architecture designed to run transactional and analytical workloads in enterprise computing. The underlying data set can easily reach a size of several terabytes in large companies. Although memory capacities of commodity servers are growing, it is still expensive to process those huge data sets entirely in main memory. Therefore, SanssouciDB and most modern in-memory storage engines use compression techniques on top of the initial dictionary encoding to decrease the total memory requirements. The columnar storage of data, as applied in SanssouciDB, is well suited for compression techniques, as data of the same type and domain is stored consecutively.

Hasso Plattner
Chapter 8. Data Layout in Main Memory

In this chapter, we address the question how data is organized in memory. Relational database tables have a two-dimensional structure but main memory is organized unidimensional, providing memory addresses that start at zero and increase serially to the highest available location. The database storage layer has to decide how to map the two-dimensional table structures to the linear memory address space.

Hasso Plattner
Chapter 9. Partitioning

Partitioning

is the process of dividing a logical database into distinct independent datasets. Partitions are database objects itself and can be managed independently. The main reason to apply data partitioning is to achieve data-level parallelism.

Hasso Plattner

In-Memory Database Operators

Frontmatter
Chapter 10. Delete

The delete operation terminates the validity of a given tuple. It stores the information in the database that a certain item is no longer valid. This operation can either be of

physical

or

logical

nature. A physical delete operation removes an item from the database so that it is no longer physically accessible. In contrast, a

logical

delete operation only terminates the validity of an item in the dataset, but keeps the tuple still available for temporal queries [Pla09]

Hasso Plattner
Chapter 11. Insert

This chapter outlines what happens when inserting a new tuple into a table (execution of an insert statement).Compared to a row-based database, the insert in a column store is a bit more complicated. For a row-oriented database, the new tuple is simply appended to the end of the table, i.e., the tuple is stored as one piece.SanssouciDB uses column-orientation to store the data physically.A detailed description of the differences between row store and column store is given in Chap. 8. So, adding a new tuple to the database means to add a new entry to every column that the table comprises of.Internally, every column consists of a dictionary and an attribute vector (see Chap. 6).Adding a new entry to a column means to check the dictionary and adding a new value if necessary.Afterwards, the respective value of the dictionary entry is added to the attribute vector of the column.Since the dictionary is sorted, adding a new entry to a column results in three different scenarios:

Hasso Plattner
Chapter 12. Update

The “UPDATE” is part of SQL’s data manipulation language (DML) and is used for changing one or more tuples in a table.

Hasso Plattner
Chapter 13. Tuple Reconstruction

As mentioned earlier, data with matrix characteristics can be stored in linear memory either column by column (columnar layout) or row by row (row layout). The impacts were already discussed in Chap. 8 in more detail.The columnar layout is optimized for analytical set-based operations that work on many rows but for a notably smaller subset of all columns of data.The row layout shows a better performance for select operations on few single tuples. In this chapter, we discuss the operations needed for tuple reconstruction in detail and explain the influence of the different layouts on the performance of these operations. Tuple reconstruction is a typical functionality in OLTP applications. It is executed whenever more than one column is requested from the database, for example when the user in an ERP system calls the “show” or “edit” transactions for the master data object or for a document.

Hasso Plattner
Chapter 14. Scan Performance

In this chapter, we discuss the performance of the scan operation. Scan operations require the values of a single attribute or a small set of attributes but go through the whole dataset. Scan operations search one or more attributes for a certain value.

Hasso Plattner
Chapter 15. Select

In this chapter, we describe how an application can extract data that was once stored in the database (execution of the

SELECT

statement).

Hasso Plattner
Chapter 16. Materialization Strategies

SQL is the most common language to interact with databases. Users are accustomed to the table-oriented output format of SQL. To provide the same data interfaces as known from row stores in column stores, the returned results have to be transformed into tuples in row format. The process of transforming encoded columnar data into row-oriented tuples is called materialization.

Hasso Plattner
Chapter 17. Parallel Data Processing

In the following, we discuss how to achieve parallelism in in-memory and traditional database management systems. Pipelined parallelism and data parallelism are two approaches to speed up query processing.

Hasso Plattner
Chapter 18. Indices

Usually, applications work only with a subset of records at a time. Therefore, before processing the portion, it must be located within the database. Hence, records should be stored in a manner that makes it possible to locate them efficiently whenever they are needed. The process of locating a specific set of records is determined by the predicates that are used to characterize these records.

Hasso Plattner
Chapter 19. Join

This chapter is about joins and their execution in in-memory database systems. In general, joins are a way to combine tuples of two or more tables. There are two general categories of joins which can be than further specialized

Hasso Plattner
Chapter 20. Aggregate Functions

This chapter discusses aggregate functions. It outlines what aggregate functions are, how they work, and how they can be executed in an in-memory database system.

Hasso Plattner
Chapter 21. Parallel Select

Chapter 18 introduced the concept of an inverted index to prevent the database system from performing a full column scan every time a query searches for a predicate in the column. However, maintaining an index is expensive and consumes additional memory. So the decision to use an index should be made carefully, balancing all the pros and cons an index would bring in the particular situation. This chapter discusses how to speed up a full column scan despite adding an index to it. Chapter 17 introduced parallelism as a means to parallelize the execution of database operations. In this chapter, we present a detailed description of how parallelism can be used to speed up the execution of a SELECT operation.

Hasso Plattner
Chapter 22. Workload Management and Scheduling

One of the most important factors that determine the usability for an application is response time. Psychological studies show that the acceptable maximum application response time for a human is about three seconds. After three seconds, the user might loose concentration and does something else. Once the application has finished its processing and is ready for the next human input, the user has to focus on the application again. These context switches are extremely expensive as they constantly interrupt the user and lead to being unproductive.

Hasso Plattner
Chapter 23. Parallel Join

Join procedures are cost intensive tasks even for in-memory databases. As today’s systems introduce more and more parallelism, intra-operator parallelization moves into focus. This chapter discusses possible schemes to parallelize a hash-join algorithm, as described in Chap. 19.

Hasso Plattner
Chapter 24. Parallel Aggregation

Similar to the parallel join mechanism described in Chap. 23, aggregation operations can also be accelerated using parallelism and hash-based algorithms. In this chapter, we discuss how parallel aggregation is implemented in SanssouciDB. Note that multiple other ways to implement parallel aggregation are also conceivable. However, we focus on our parallel implementation using hashing and thread-local storage.

Hasso Plattner

Advanced Database Storage Techniques

Frontmatter
Chapter 25. Differential Buffer

The database architecture discussed so far was optimized for read operations. In the previously described approach an insert of a single tuple can force a restructuring of the whole table if a new attribute value occurs and the dictionary has to be resorted. To overcome this problem, we will introduce the differential buffer in this chapter.

Hasso Plattner
Chapter 26. Insert-Only

Data stored in database tables changes over time and these changes should be traceable, as enterprises need to access their historical data. Additionally, the possibility to access any data that has been stored in the database in the past and keeping historical data is mandatory e.g. for financial audits.

Hasso Plattner
Chapter 27. The Merge Process

Using a differential buffer as an additional data structure to improve write performance requires to cyclically combine this data with the compressed main partition. This process is called “merge”.

Hasso Plattner
Chapter 28. Logging

Databases need to provide durability guarantees (as part of ACID to be used in productive enterprise applications. To provide these guarantees, fault-tolerance and high availability have to be ensured. However, since hardware failures or power outages cannot be avoided or foreseen, measures have to be taken which allow the system to recover from failures.

Hasso Plattner
Chapter 29. Recovery

To handle steadily growing volumes of data and intensifying workloads, modern enterprise systems have to scale out, using multiple servers within the enterprise system landscape. With the growing number of servers—and consequently growing number of racks and CPUs—the probability of hardware-induced failures is rising.

Hasso Plattner
Chapter 30. On-the-Fly Database Reorganization

In typical enterprise applications, schema and data layout have to be changed from time to time. The main cases for such changes are software upgrades, software customization, or workload changes. Therefore, the option of database reorganization, such as adding an attribute to a table or changing attribute properties, is required.

Hasso Plattner

Foundations for a New Enterprise Application Development Era

Frontmatter
Chapter 31. Implications on Application Development

In the previous chapters, we introduced the ideas behind our new database architecture and their technical details. In addition, we showed that the in-memory approach can significantly improve the performance of existing database applications.

Hasso Plattner
Chapter 32. Database Views

Database views define a transformation rule that is processed when the underlying data item is accessed [PZ11]. Thus, views describe a structured subset of the complete data available in the database.

Hasso Plattner
Chapter 33. Handling Business Objects

Enterprise applications are typically developed in an object-oriented fashion: Real world objects, such as production facilities or warehouses as well as artifacts like sales orders, are mapped to so-called business objects. A business object is an entity capable of storing information and state. It typically has a tree like structure with leaves holding information about the object or connections to other business objects.

Hasso Plattner
Chapter 34. Bypass Solution

As illustrated throughout the course, in-memory data management can enable significant advantages to data processing within enterprises. However, the transition for enterprise applications to an in-memory database will require radical changes to data organization and processing, resulting in major adaptations throughout the entire stack of enterprise applications. By considering conservative upgrade policies used by many ERP system customers, the adoption of in-memory technology is often delayed, because such radical changes do not align well with the evolutionary modification schemes of business-critical customer systems. Consequently, a risk-free approach is required to help enterprises to immediately leverage in-memory data management technology without disruption of their existing enterprise systems.

Hasso Plattner
Backmatter
Metadaten
Titel
A Course in In-Memory Data Management
verfasst von
Hasso Plattner
Copyright-Jahr
2013
Verlag
Springer Berlin Heidelberg
Electronic ISBN
978-3-642-36524-9
Print ISBN
978-3-642-36523-2
DOI
https://doi.org/10.1007/978-3-642-36524-9