Introduction
-
Performance-oriented aspects. Such a DBMS should be able to load, store and query big data efficiently.
-
Management-oriented aspects. It should be easy for a database administrator to manage DBMS handling big data.
Database (self-)tuning basics
Related work
On-line tuning in row-stores
Study | Structures | Experiments | Notes |
---|---|---|---|
COLT [35] | Indexes | PostgreSQL | Uses storage budget constraint; employs “what-if” optimizer mode; manages active, hot and cold sets of indexes |
Alerter [36] | Indexes, materialized views | SQL server | Uses storage budget and mininum improvement constraints; Notifies DBA and provides a set of candidate structures |
Alerter [49] | Aggregation tables | Mondrian | Uses soft storage budget constraint |
AdaptPD [50] | Vertical partitioning | SQL server | Employs “what-if” optimizer mode with caching |
WFIT [51] | Indexes | IBM DB2 Express-C | Takes a workload and user feedback into account |
Materialized views, storage selection | Multiple | Tuning of multistore system physical design. Uses storage and transfer constraints | |
EVO [54] | Indexes | Multiple | Authors propose query plan transformations using genetic algorithm for index selection |
ARH [55] | Automatic re-indexing | PostgreSQL | A set of heuristics is used to decide when to trigger a re-indexing process to counter an index fragmentation |
Tuner [33] | Multiple | PostgreSQL | An on-line tool which tunes several physical structures—indexes, partitions, and is capable of tracking index interaction |
AutoStore [56] | Vertical partitioning | Custom | A comparison of on-line algorithms for vertical partitioning |
SMOPD [57] | Vertical partitioning | Custom | Closed itemset mining for on-line vertical paritioning |
Database tuning for column-stores
Study | Notes |
---|---|
SAP HANA [58] | In-memory, data reorganization |
H2O [60] | Query compilation and data reorganization |
Peloton [61] | In-memory system, can adapt data layout |
Vertica [62] | Projection recommender, not online |
C-store [63] | Recommender of materialized views, not online |
Cliffguard [64] | Robust configuration recommender, works with Vertica |
Snowflake [65] | Commercial, distributed, relies not on tuning, but on data pruning |
-
Currently, there is a heightened interest in on-line tuners in row-stores.
-
There is a shortage of on-line tuning tools for column-store systems, especially for distributed disk-based ones.
Column stores
-
A new approach to query processing: the problem of early/late materialization (when to perform tuple reconstruction), novel query plans (in some approaches a query plan is not a DAG anymore) and cost models, a new algebra of operations.
-
A new approach to operator design: new operators and their implementations are possible.
-
Compression is ubiquitous in this class of systems.
-
C-store is a disk-oriented column-store database. It features different sort orders, compression (different compression method may be used for each column), late materialization, special join operators and operations on compressed data.
-
MonetDB is a main memory-oriented column-store database. It puts a special emphasis on efficient hardware usage, e.g. minimization of CPU cache misses or utilization of hardware parallelism. It features a special algebra operating on columns, adaptive indexing technologies (index is a by-product of query execution) and operators designed for an efficient hardware usage.
-
To address the shortcomings of MonetDB, a new main-memory system called MonetDB/X100 was developed. The main novelties is the presence of block/vector of a column processing and a cooperative scans.
-
Supersonic [69] is an open-source in-memory columnar query engine which is oriented for efficient data processing. Its core features are cache consciousness, vectorized execution, instruction pipelining and SIMD usage.
-
Peloton [61] is an open-source in-memory DBMS designed for hybrid transaction-analytical processing. This system has a built-in on-line tuning component.
The design of a self-managed distributed column-store system
Problem, environment and queries
Environment
Queries
Alerter
-
The most common approach is to use a sliding window, which helps to keep track of the recently processed queries. These recent queries allow us to compute some aggregate characteristics, which are then used to decide whether to trigger reorganization or not.
-
Another approach is to use the knowledge of some predefined query patterns. For example, we might know the existence of query patterns like it is done in the reference [70]. We can detect these patterns or rely on some external knowledge (e.g. we know that data loading happens only at night).
-
A single query information can also be useful. Consider a query which takes a long time to complete. We can predict its performance and evaluate its plan. Then, we can start the evaluation and simultaneously start the physical reorganization. Later, we switch its evaluation to a new plan while reusing the partial results obtained earlier. The goal is to change it in such a way that the query would benefit from the reorganization on upcoming stages of processing.
Control of the reorganization
-
A kind of heuristic strategy which will guide the search behavior of a system. This “forgotten” approach was also used since earlier days of physical design tuning [37]. In this approach, no performance model is used; the process is guided by a rule set. It was employed as a separate algorithm or as a pre-filtering step in later cost-based studies [28] in order to lower computational complexity of the problem. Another prominent example is the group of affinity-based approaches [4, 37].
-
A combination of both.
Actions
Column relocation
-
Faster relocation;
-
Incremental relocation;
-
Cheap vertical partitioning;
-
Corrective query processing;
-
Additional candidate configurations for physical design.
Fast relocation
Incremental relocation
-
Perform column deletion when all columns of the relation were copied.
-
Perform column deletion right after it was copied. In this case, we immediately obtain a chunk of free space which can be used in a number of ways. For example, the node can start receiving columns of another relation.
-
Execute without taking relocation into account. This alternative would require too much of extra network communication.
-
Halt relocation, execute queries, resume relocation. This also requires too much network communication.
-
Finish relocation, then execute. In this case no inter-node parallelism is possible and query response time is large.
Cheap on-line vertical partitioning
New aspects of corrective query processing (CQP)
Additional candidate configurations for physical design
Column reorderings
Id | Name | Wage | Skill |
---|---|---|---|
1 | Ivan | 80 | C++ |
2 | Petr | 50 | C++ |
3 | Slava | 30 | Java |
4 | Vasya | 60 | Php |
5 | Sasha | 70 | Java |
Id | Name | Wage | Skill |
---|---|---|---|
3 | Slava | 30 | Java |
2 | Petr | 50 | C++ |
4 | Vasya | 60 | Php |
5 | Sasha | 70 | Java |
1 | Ivan | 80 | C++ |
Name | Wage | Skill |
---|---|---|
Ivan | 80 | C++ |
Petr | 50 | C++ |
Sasha | 70 | Java |
Slava | 30 | Java |
Vasya | 60 | Php |
-
Sort Order Selection Problem (SOSP): find a single good sort order for a given workload;
-
Multiple Sort Order Selection Problem (MSOSP): find a good set of sort orders for a given workload and a given storage budget;
-
Dynamization of SOSP and MSOSP: find a good set of sort orders for a dynamically changing workload and a given storage budget. This formulation considers the physical design problem in a dynamic environment, where query patterns change over time. The system should adapt stored sort orders to provide the best performance. While two previous formulations were already mentioned or even addressed in the past [63, 75], to the best of our knowledge, there are no studies which consider dynamic formulation for column-stores.
Column duplication
Horizontal partitioning
-
The partition can be constructed for a lesser cost in terms of the number of accessed pages. Thus, one can achieve higher repartitioning speed compared to row-stores. This speed will provide a substantial benefit to an on-line tuning system, allowing it to adapt faster. Compression can be a potential source of problems, but the lightweight compression methods [43, 44] employed by the column-stores should not present any difficulties.
-
The control of partitioning granularity—one can perform a partitioning of a subset of columns. For example, we can horizontally partition the “fat” column only without touching the rest. However, there could be queries which would benefit from partitioning columns together.
-
The enlargement of the configuration space and the increased number of useful configurations lead to improved performance of query processing.
Column-store specific actions
-
Adaptive indexing. Adaptive indexing is a technique which constructs indexes on-the-fly during query processing. This approach is extensively used in column-store systems [44, 79‐81]. To the best of our knowledge, there is only one study [82] related to selection and management of such indexes. Moreover, employing these indexes in a distributed environment looks promising because there are a lot of potential scenarios which may benefit from this structure. For example, one can keep two copies of a column on different nodes and perform a reorganization without any overhead for the processing. This can be achieved by alternating working and idle nodes. The application of these indexes in a distributed context also was not studied.
-
Join index [75, section 2]. This is a structure used to speed up the reconstruction of a tuple by linking parts of a single record in a different projections. The follow-up paper [43, section 3.2] stated that early experiments showed that this technique was inefficient due to several reasons. However, in case of an adaptive system these results should be reconsidered since there are several novel scenarios. Unfortunately, there are no published studies regarding these experiments.