1 Introduction
SUM
, AVG
, COUNT
, MAX
and MIN
due to it should use statistical tools to give approximate results for numerical types of answers.orders
O, customers
C, states
ST, and the relations among the tables. We consider the following three use cases of AQP.
PSAQ
) which needs to make assumption on QCS or queries, Histogram
[88], Wavelet
[46], and Sketch
[14]. The advantages of these techniques are that the results are more accurate on skewed data, and the query processing is fast (as they do not need to on-the-fly select samples), but they have some limitations. First, they cannot support general queries, especially the complex nested queries. Second, they involve too much storage to store the synopses. We will discuss more details in Sect. 3.PSAQ
, Histogram
, Wavelet
and Sketch
need to know the queries or data in advance, and generate synopses offline. They can support skewed data well but involve large space to store the synopses.Query | Pre-computed information | Skewed data | |
---|---|---|---|
\({\mathtt{OLA}}\)
| No
| Index or data distribution
|
\(\times \)
|
\({\mathtt{PSAQ}}\)
| Queries or QCS
| Synopses
|
\(\surd \)
|
\({\mathtt{Wavelet}}\)
| Queries
| Synopses
|
\(\surd \)
|
\({\mathtt{Histogram}}\)
| Queries
| Synopses
|
\(\surd \)
|
\({\mathtt{Sketch}}\)
| Queries
| Synopses
|
\(\surd \)
|
Sample
, Histogram
, Wavelet
and Sketch
[29]. However, it only surveyed the offline synopses but did not cover online AQP techniques. Besides, all the new techniques after 2012 were not surveyed and analyzed. There were three keynotes on AQP at SIGMOD 2017. Surajit focused on online aggregation [22]. Mozafari [78] emphasized on new challenges and opportunities, including interface, effective query planning and the theories of database learning. Tim Krastra focused on their newly built interactive data exploration system IDEA [66].2 Online Aggregation Methods
2.1 Online Aggregation
Pf-ola
[93] aims to make online aggregation in parallel where the estimated results and corresponding confidence bounds are continuously refined based on the selected samples during the query processing. These parallel techniques will avoid wasting extra time for error estimation during the query processing. G-OLA
[111] is an online aggregation architecture which can deal with arbitrarily nested aggregates using efficient delta maintenance techniques. G-OLA
randomly partitions the dataset into smaller uniform batches, by computing a “delta update” [111] on each mini-batch of data. Then by carefully partitioning the intermediate results of nested queries, it can iteratively refine the query results.2.2 Error Estimation
2.2.1 Error Estimation with Known Distribution
AVG
(S) on a normal distribution as an example (see Figure 3).
2.2.2 Error Estimation without Known Distribution
AGG
(S), there are mainly three methods, Bootstrap, closed-form estimate, and example, when computing SUM.AGG
(S) can be computed and these values compose a distribution which can be used to estimate the aggregation result. Then a confidence interval is computed by Pr([\(\texttt {AGG}({\mathbf S })-l, \texttt {AGG}({\mathbf S} )+l\)])=c to estimate the aggregation answer where 2l is the length of confidence interval, Pr is an estimation function (e.g., probability density function) of the distribution of AGG
(S), and c is the confidence. Then the challenge is how to estimate AGG
(S) and the distribution of AGG
(S).AGG
(S), we can arbitrarily sample many times (e.g., 10,000) from the population D, and the inferential statistics of the sample can be used to estimate the sampling distribution of AGG
(S), e.g., standard error is an estimate of the standard deviation of that distribution. However, it is too expensive to sample many times from the whole population D, or even infeasible because the population is unknown and there is only one sample S. To address this problem, resampling such as Bootstrap is proposed to estimate the result error [82]. The concept of Bootstrap is well known in statistics for more than half a century, which has been used to estimate error in relational databases [90]. Recently Bootstrap is borrowed to estimate errors of AQP [113].
AGG
(S). However, if it draws samples from S for too many times, it is still expensive. In practice, it aims to draw reasonable number of resamples, e.g., 100. For example, in Fig. 4a, we draw the samples from S for k times and compute a distribution, then we can compute a confidence interval based on the bootstrap distribution. Interested readers are referred to [53] for more details on Bootstrap.MAX
) or the size of S is too small.AGG
(S) can be approximated as \(N(\texttt {AGG}({\mathbf S} ),\sigma ^2)\), where \(\sigma \) can be computed by the mean squared error Var(S) [20]. This method is known as closed-form estimation [6] as shown in Fig. 4b. Computing Var(S) for a small dataset S will be faster than the brute-force resampling. However, this method can only work for COUNT
, SUM
, AVG
, VARIANCE
but fail to deal with the queries whose variance is hard to compute such as MAX
, MIN
or user-defined functions.AGG
. For example, when computing SUM
, the bound will be MAX
and MIN
as shown Fig. 4. In practice, it computes a bound much bigger than the real width of the sampling distribution.2.3 Online Aggreagtion on Multiple Tables
2.4 Online AQP in Distributed Setting
EARL
[69](Early Accurate Result Library) generates online uniform samples from HDFS and uses Bootstrap to incrementally evaluate the accuracy computed so far. It can support complex queries well; however, it does not consider the skewness of dataset distribution. ApproxHadoop
[44] assumes that the datasets are uniformly distributed in clusters but if the sub-datasets are not uniformly distributed, it cannot work well because random sampling on the cluster will generate a biased sample. To overcome these limitations, Sapprox
[114] constructs probabilistic SegMap (i.e., collect the distribution of subsets) of segments offline and uses these results to generate online sampling plan.2.5 Other Online AQP Methods
2.5.1 Database Learning
2.5.2 Approximate Hardware
Enerj
[96] is developed as an extension of Java that adds approximate data types. Enerj
also proposes a hardware architecture that offers explicit approximate storage and computation. When using Enerj
to program, one can use approximate data types for the computations that only need approximation instead of exact answers. Several applications to Enerj
show that these extensions are expressive and effective because Enerj
uses just a small number of annotations but leads to significant potential energy savings at expense of very little accuracy.ApproxiDB
[51] is the first hybrid data management system based on a hybrid hardware including approximate hardware and precise hardware, and thus, it not only supports approximate query processing but also can return an exact answer. When the system can answer the query with an exact answer within the time constraints, it uses the precise hardware; otherwise, it uses the approximate hardware. ApproxiDB
[51] is the first work that proposes the concept of “Approximate Hardware” which summarizes existing “Approximate Hardware” works and concludes the challenge and opportunity of “Approximate Hardware” well.2.5.3 Other Works
DAQ
[92] is a variant of OLA which borrows ideas from probabilistic database and iteratively uses the high-order bits of numerical data to compute the approximation. For example, a DAQ
scheme stores numbers in column PROFIT in Fig. 1 using “Bitsliced-Index” [92]. If we query MAX
on the column PROFIT, DAQ
checks the first bit of the numbers in the 16 tuples of O, if there is only one tuple whose first bit is ‘1’, we get the exact answer rather than travel all the bits (e.g., 32 bits); otherwise, we check the next bit until finding the maximum one. Unfortunately, such technique can only support simple queries over numerical columns (such as SUM
and AVG
) but cannot support general SQL queries.sample+seek
” [31] proposes to design different plans for big groups and rare groups, i.e., “sample” (uniformly or measure-biased sampling) for big group and “seek” (building index) for rare groups. This work introduces a new precision metric, called distribution precision to provide error guarantee for queries. This work also provides a measure-biased sampling method to support any aggregation that can be estimated from random samples within a user-given error bound.3 Offline Methods
PSAQ
), Histogram
, Wavelets
and Sketch
. Then, we introduce other recent offline methods with bounded guarantees.3.1 Pre-computed Samples
PSAQ
, which generates samples offline and uses these samples to answer online queries. Note that PSAQ
has an assumption that the query workload is relatively stable, i.e., the queries will not be dynamically changed.PSAQ
A naive method builds a synopsis for each query in the workload and uses the synopsis to answer the future queries. For example, given the query in Case 1 which computes the aggregation result for customer \(c_1\), we can build a synopsis using the samples of this query, e.g., \(o_1, o_3\). Then, we can use the synopsis to answer the future queries which contain the same column but may use different aggregation functions. A challenge here is given a space budget, how to select the queries to build offline synopsis. One can also merge synopses of different queries to reduce the synopsis size. This method has two limitations. First, if the query workload is large, this method will generate many synopses. Second, the synopsis of a query can only be used to answer this query but cannot answer other queries. For example, it can only answer queries for customer \(c_1\) but cannot answer queries for other customers.PSAQ
To address the above problems, query column set (QCS)-based PSAQ
is proposed [7]. The column set of a query is the set of all columns in the query (including select, where and group clauses). This method groups the queries based on the column sets in the queries, and the queries with the same column set will be in the same group. Then for each group, the method selects samples for the columns in the group. Next it can use the synopsis to answer queries with the same QCS (or the queries whose columns are contained in the QCS). For example, suppose many queries contain columns CustomerID and Profit in the query workload, it does not need to build samples for each of the queries. Instead this method builds a sample for column (CustomerID, Profit), and thus can save a lot of space. BlinkDB
[7] studies how to select samples for each QCS and uses the samples to answer online queries in distributed file system. BlinkDB
also tries to share samples among different QCSs. For example, a sample for column (CustomerID, Profit) can be used to answer the queries for column CustomerID.AQUA
[2], START
[21], BlinkDB
[7], Babcock
[11]) are proposed to deal with sparse data. The common idea of these work is to select more tuples for big groups meanwhile selecting enough tuples in rare groups (which may be lost in a random sample).PSAQ
also uses closed-form estimation and Bootstrap
[6, 90, 113] to diagnose the results by using multiple samples via resampling. The method for error analysis is similar to online sampling, but they can spend more time to analyze the errors by thousands of resampling offline.3.2 Histograms
Histogram
summarizes a dataset and divides it into multiple buckets based on values in a numerical column [88]. For each bucket, it computes the most representative statistics which can be used to reconstruct the value of the whole dataset in this bucket, e.g., store the lower and upper bound of this bucket and count the numbers in this bucket. Histogram
has been widely studied and incorporated into commercial relational databases which can be easily constructed and used for estimation [28, 91]. Histograms include equi-depth and equi-width histograms. The former has the same bucket size and the latter has the same bucket width (i.e., the difference of the maximal value and minimal value in each bucket). For example, the numbers of column Profit in O is {1120, 1170, 1230, 1250, 1290, 1350, 1417, 1460, 1470, 1560, 1630, 1673, 1732, 1890, 1983, 2000}, an equi-width Histogram
[29] will split the numbers into buckets with the same length (e.g., 200). Then it can be divided into {(1100, 1300], 5}, {(1300, 1500], 4}, {(1500, 1700], 3}, {(1700, 1900], 2} and {(1900, 2100], 2}. For a query with the SUM
function on the attribute, one can compute \(\frac{1100*5+1300*4+1500*3+1700*2+1900*2}{16}=1400.0\) to approximate the answer. An equi-depth histogram selects bucket boundaries so that each bucket contains the same number of data points. For example, if the bucket depth is 4, the column Profit in O will be divided into {(1100, 1250]}, {(1250, 1460]}, {(1460, 1700]} and {(1700, 2000]}.Histogram
method is to find appropriate algorithms to decide the buckets. The bucketing strategy should consider both the number of buckets (the less the better) and accuracy (the higher the better). Besides equi-width and equi-depth [29], many other types of Histogram
s such as Singleton-Bucket Histogram
[54, 55] and Maxdiff Histogram
[91] are also widely studied. More complex methods have been designed to find bucketing scheme to trade-off efficiency and accuracy. A recent work proposed a near-optimal algorithms based on Histogram
for describing the distribution of dataset [1]. In addition, multi-dimensional Histograms
are proposed to support different applications. For example, Digithist
[100] combines multi-dimensional, one-dimensional Histogram
s and grids to provide a tightly error-bounded Histogram
for multi-dimensional data.Histogram
that it only supports numerical columns and cannot support complex SQL queries accurately, e.g., multiple attributes range query. Moreover, it will cost too much space to store a synopsis for each column. The advantage is that Histogram
can process queries instantly and has quality guarantees.3.3 Wavelets
Wavelet
is conceptually close to the Histogram
. Wavelet
transforms the data and aims to compress the most expressive features in a Wavelet
domain but Histogram
simply produces buckets that are subset of the original data. For example, if the numbers in column C in T are {1, 3, 4, 4}, a Haar-wavelet transform (HWT) decomposes it as {2, 4} with the loss − 1, 0, then HWT compresses it again as {3} with the loss − 1. By storing 3, {− 1}, {− 1, 0}, we can decompress it to get the original data set. By storing 3, {− 1}, we can approximately represent the original dataset as {2, 2, 4, 4} with loss 1, − 1, 0, 0. Then, if we query SUM
of this numerical column, we can decompress it to get {2, 2, 4, 4} and use it to compute the value instead of {1, 3, 4, 4}. There are many variants of Wavelet
that have been widely studied in recent years. HWT is the most widely studied Wavelet
, which selects the largest HWT statistics in a synopsis that provides the \(L_2\) error for data decompression [99]. Recent work focuses more on Non-Euclidean Wavelet
[46, 63‐65]. Mytilinis et al. [80] developed parallel algorithms to generate Wavelets within an error bound.3.4 Sketches
Sketch
[43] models a numerical column as a vector or matrix and transforms the data by a fixed matrix to construct the synopsis. For example, the well-known bloom filter can be seen as a special case of Sketch
which maps data into a vector of bits. Sketch
is not suitable for general relational database but performs well when dealing with streaming data where the sketch summary must continually be updated quickly and compactly. Sketch
is not only fast but also easy to parallelize and can provide the high approximation accuracy. Sketch
has two main categories. The first is “Frequency-based sketch” [14] which focuses on the frequency distribution of the original dataset. The second is “Distinct-value-based sketch” [41] which counts the distinct values in a given multi-set. Different from other synopsis,Sketch
has also been used successfully to estimate the answers of COUNT
and DISTINCT
queries [29].COUNT
queries, a sketch
may initialize a matrix C with \(d \times w\)(d and w should be tuned to proper values) zeros, for each item t in the data stream, for each integer number j from 1 to d, it increases \({\mathbf C} [j,h_j(t)]\) by 1 where \(h_j\) is a hash function. Then, when a COUNT
query comes and it wants to count the number of t in a data stream, it first sets the current answer a as the biggest number in matrix C, and then it iteratively checks whether \({\mathbf C} [j,h_j(t)] \le a\) from 1 to d. If so, \(a={\mathbf C} [j,h_j(t)]\). After d iterations, it can find the exact answer a.Sketch
technique [24] formulates a bias-aware linear sketching and recovery problem, and proposes algorithms to generalize the widely used Count-Sketch and Count-Median algorithms. Due to its linearity, it can be easily implemented in the streaming and distributed computation models. [89] proposes a Count-Min-Log Sketch
method to improve the average relative error of Count-Min-Sketch within bounded storage via logarithm-based, approximate counters instead of linear counters. CQF [85] proposes the counting quotient filter (CQF) to support general computing operators of approximate membership query. These techniques can be easily extended into database systems.Sketch
can be used in many other fields, e.g., NLP [45], since Sketch
can be used to count the frequency of words, conduct Pseudo-Words evaluation, find semantic orientation of a word, and compute distributional similarity in NLP domain. Sketch
is also good at dealing with real-time system such as financial data streams where the data dynamically changes. More details of different types of Sketch
can be found in [27, 29].3.5 Materialized Views
SUM
, g is the function of computing frequency. Then a data cube for SUM
on table T
is {1:22, 2:33, 3:18}. It can support many complex queries, e.g., WHERE
clause. Many studies on different types of data cubes have been studied for supporting OLAP. A recent work [108] provides a complete set of techniques for probabilistic data cubes. Cube can also be used on ad-hoc interactive analytics over large datasets in distributed clusters [56, 59] and exploring machine learning results [9].3.6 Other Offline Methods
3.6.1 Bounded Resources
3.6.2 Bounded Error and Bounded Time
4 AQP on Complex Data
4.1 AQP on Spatial Data
4.1.1 Online Spatial AQP
4.1.2 Offline Spatial AQP
Sketch
is proved to be effective on geographic information systems, it is further investigated in the background of spatial AQP. Based on these considerations, [12] provides new types of queries that rely on qualitative relationships and discusses how to define query processing algorithms in metric space to handle qualitative information. More recent discussions of spatial databases can be found in [34].4.2 AQP on Trajectory Data
COUNT
query. In order to further improve the scalability of the system, [73] extended the parallel random index sampling (CRIS) algorithm of the RIS algorithm to deal with multiple track aggregation queries to reduce time and space queries at the same time. These techniques are applied to the actual large-scale user trajectory database from “China Mobile” service providers to verify the effectiveness of the proposed sampling and estimation methods. However, this method cannot support complex queries in the trajectory database system. Designing schemas for AQP on trajectory indexing and retrieving is still an open problem. More details about trajectory query and trajectory mining can be found in [39] and [115].5 AQP Applications
5.1 AQP on Data Cleaning
SampleClean
[68, 105] aims to address this problem which only requires users to clean a sample of data, and utilizes the cleaned sample to process aggregation queries. SampleClean
also proposes a statistical method which can use a cleaned sample to correct the bias of the query results over the dirty data. Furthermore, it uses a cleaned sample to directly estimate the query results of the cleaned data. Along the same idea, a recent work [67] efficiently cleans a sample of rows from stale materialized views and uses the cleaned samples to estimate the query results.5.2 AQP on Data Visualization
VAS
) [86] guarantees high-quality visualizations with a small subset of the entire dataset. VAS
divides the whole map into small blocks, and uses stratified sampling to choose a set of tuples that minimizes a visualization-inspired loss function in each block. While existing sampling approaches minimize the error of aggregation queries, VAS
focuses on using a loss function that maximizes the visual fidelity of scatter plots. It consequently selects more samples in each block and provides the users with a clearer visualization.ExploreSample
[107] approximates scatter plots to solve an “object-centric” exploration query. In an object-centric exploration query, one may predict the performance of an NBA basketball player using the whole dataset of all the NBA basketball players, i.e., drawing a heatmap and judging whether it is an outlier. However, drawing such heatmap may consist of many potentially expensive aggregation queries over the entire database. Thus, ExploreSample
selects a sample of the whole dataset and draws a heatmap to help predict the claims which is verified to be helpful.Online/offline | Distributed/standalone | Bounded | Platform | Algorithm | Skewed data | |
---|---|---|---|---|---|---|
BlinkDB
| Offline | Distributed |
\(\times \)
| Hive/Hadoop (Shark) | Stratified sampling |
\(\surd \)
|
Sapprox
| Online | Distributed |
\(\times \)
| Hadoop | Distribution-aware [44] Online sampling |
\(\times \)
|
Approxhadoop
| Online | Distributed |
\(\times \)
| Hadoop | Approximation-enabled MapReduce [44] |
\(\times \)
|
Quickr
| Online | Distributed |
\(\times \)
| N/A | ASALQA algorithm [61] |
\(\times \)
|
SnappyData
| Online | Distributed |
\(\times \)
| Spark and GemFire | Spark, as a computational engine; GemFire, as transactional store |
\(\times \)
|
FluoDB
| Online | Distributed |
\(\times \)
| Spark | Mini-batch execution [111] OLA Model |
\(\times \)
|
XDB
| Online | Standalone |
\(\times \)
| PostgreSQL | Wander join [72] |
\(\times \)
|
Verdict
| Online | Standalone |
\(\times \)
| Spark SQL | Database learning |
\(\times \)
|
IDEA
| Online | Standalone |
\(\times \)
| N/A | Reuse answers of past overlapping queries for new query |
\(\times \)
|
BEAS
| Online | Standalone |
\(\surd \)
| Commercial DBMS | Approximability theorem [17] |
\(\times \)
|
ABS
| Online | Standalone |
\(\times \)
| N/A | Bootstrap |
\(\times \)
|
SampleAction
[40] allows a user to formulate a query, and the system responds with a partial result, displaying a bar chart with confidence bounds. As the analyst waits, the system increases its sample size, narrows the confidence intervals and produces more precise results. By using different sampling strategy, Pangloss
[76] uses an idea of Sample \(+\) Seek [31] which “sample” (uniformly or measure-biased sampling) for big groups and “seek” (building index) for rare groups. It incrementally loads more records into the sample to update the bar chart until either the confidence is higher than a predefined threshold or until a timeout. Pangloss
can also be used for heatmap. Different from Pangloss
, IFocus
[62] mainly focuses on whether the comparison between different bars is correct, e.g., if bar A is higher than bar B on the entire data, then bar A should be higher than bar B on the sample. IFocus
calculates the aggregation answer to draw a bar chart and computes the probability that bar A will be higher than bar B. When the bar A is higher than bar B and the probability of \(A >B\) is higher than a threshold, it stops. Besides, FastMatch
[74] is an end-to-end approach for interactively retrieving the bar chart visualizations which are most similar to a user-specified target, from a large collection of bar chart. The primary technical contribution underlying FastMatch
is a probabilistic algorithm, HistSim, a theoretical sampling-based approach to sample a lot of bar charts and identify the top-k closest bar charts under “\(L_1\)” distance.SampleAction
[40] and IFocus
can be also used for line chart. Different from these two methods, in each sampling iteration, INCVISAGE
[94] proposes the concept of “improvement potential” to select better samples which can improve the line chart visualization most. Thus, INCVISAGE
can generate approximations that use as few samples as possible for trend-line visualizations.6 AQP Systems
6.1 Distributed AQP Systems
BlinkDB
is a well-known AQP system which is built on top of Hadoop and devises effective strategies to select proper samples (offline generated) in distributed clusters to answer newly coming queries. BlinkDB is open sourced and public available.2 As BlinkDB
generates too many offline samples based on the assumption that the QCS is stable over time, it does not perform good for queries whose QCS is not covered by the query workload. Sapprox
[114] and Approxhadoop
[44] add online sampling generation during running time to overcome the shortages of BlinkDB
. Besides, Sapprox
is more flexible than BlinkDB and more efficient than Approxhadoop
.Quickr
[60, 61] is designed for executing ad-hoc queries on big-data clusters which perform even much better than BlinkDB
, Approxhadoop
and Sapprox
. It does not need any pre-computing of the whole dataset spread over the clusters. In fact, in modern distributed database systems, constructing synopses for the whole dataset spreading over all the clusters is very hard. Although the ad-hoc queries are complex, Quickr
does not need to scan the whole data for many times. It is verified to be practical in Microsoft’s search engine, i.e., Bing. What’s more, Microsoft also develops Now!
[19], a progressive data-parallel computation framework, for Windows Azure. It can provide progressive SQL support over big data in Azure.SnappyData
[79, 95] is a system built on spark which uses in-memory solution for delivering truly interactive analytics (i.e., a couple of seconds), when faced with large data volumes or high velocity streams. SnappyData
can exploit state-of-the-art approximate query-processing techniques and a variety of data synopses. Besides, FluoDB
[111] generalizes OLA to support general OLAP queries with arbitrarily nested aggregates which is also built on top of spark.6.2 Other AQP Database Engines
XDB
[72] is a system which supports online aggregation for complex queries including join operator by integrating “Wander Join” based on the latest version of PostgreSQL. It is the most recent AQP work supporting complex queries. XDB
outperforms the earlier online engine DBO
[32, 57]. Verdict
[77] uses the Database Learning
[87] method to benefit the query processing in database (see details of Database Learning
in Section 2.5). IDEA
[66] is an interactive data exploration accelerator. It allows data scientists to immediately explore the data source without pre-computation and knowledge about the data distribution and support interactive query during the data exploration process. IDEA
reuses the observations seen so far and reformulates the AQP model based on probability theory. IDEA
proposes a new type of index to help getting answers within low error even in the rare subgroup of a dataset without upfront known workload.BEAS
(Boundedly Evaluable Sql) [17] is a system which can evaluate the feasibility of each of the query plans and select a better one. Given a sampling ratio, it can either compute the exact answer or give an approximation by accessing no more than bounded numbers of tuples using bounded evaluable theory [37]. ABS
(Analytical Bootstrap) [112, 113] is a system which models bootstrap by a probabilistic relational model to predict the distributions of the AQP results. It entails a very fast computation of bootstrap-based quality measures for a general class of SQL queries.7 Emerging Challenges and Opportunities
7.1 AQP Model
7.2 Approximate Data Visualization
ExploreSample
[107] is also promising.7.3 Smarter Query Plan
verdict
mentioned in Sect. 2.5, and more details of such technique can be found in [87]. Third, as a single SQL query often corresponds to multiple query plans, a smart data engine needs a query optimizer to select the best plan. Traditionally, a query optimizer can estimate the computation cost of each query plan and choose the one with the minimum estimated cost. Future work could concentrate on above three aspects to generate smart query plan so as to benefit the accuracy and speed of AQP systems.7.4 Synopse Generation in Distributed Setting
Histogram
, Wavelet
and Sketch
cannot be easily merged. So it calls for effective algorithms to merge or approximately merge them. In addition, it is important to devise new types of mergeable synopses [4, 5, 15].