1 Introduction
Q
, as the convex subset of d-dim. data points (row vectors) \(\mathbf {x} = [x_{1}, \ldots , x_{d}] \in \mathbb {R}^{d}\) lying within a hypersphere (ball) with center \(\mathbf {x}_{0}\) and scalar radius \(\theta \), i.e., \(\mathbb {D}(\mathbf {x}_{0},\theta )\) contains all \(\mathbf {x} \in \mathbb {R}^{d} : ||\mathbf {x}-\mathbf {x}_{0} ||_{2} \le \theta \), where \(||\mathbf {x} ||_{2}\) is the Euclidean norm; for an illustration, see Fig. 1.
Q1
over a 3-dim. space \((u,x_{1},x_{2}) \in \mathbb {R}^{3}\), which returns the mean value y of the feature u (seismic signal; P-wave speed) of those spatial points \((x_{1},x_{2}) \in \mathbb {D}(\mathbf {x}_{0},\theta ) \subset \mathbb {R}^{2}\) projections (referring to surface longitude and latitude) within a disk of center \(\mathbf {x}_{0}\) and radius \(\theta \). The query Q1
is central to PMA because the average y is always used as a linear sufficient statistic for the data subspace \(\mathbb {D}\), and it is the best linear predictor of the seismic signal output u based on the region identified around the center point \((x_{1},x_{2}) \in \mathbb {D}(\mathbf {x}_{0},\theta )\) [32].Q2
calculates the coefficients of a linear regression function within a defined data subspace. For example, in Fig. 2, consider geophysicists issuing queries Q2
over a 3-dim. space \((u,x_{1},x_{2}) \in \mathbb {R}^{3}\), which returns the seismic primary-wave (P-wave) velocity u-intercept (\(b_{0}\)) and the coefficients \(b_{1}\) and \(b_{2}\) for \(x_{1}\) (longitude) and \(x_{2}\) (latitude), where the \(\mathbf {x} = [x_{1},x_{2}]\) points belong to a subspace \(\mathbb {D}(\mathbf {x}_{0}, \theta ) \in \mathbb {R}^{2}\). By estimating the linear coefficients, e.g., the parameter row vector \(\mathbf {b} = [b_{0}, b_{1}, b_{2}]\), we can then interpret the relationships among the features \(\mathbf {x}\) and u and assess the statistical significance of each feature of \(\mathbf {x}\) within \(\mathbb {D}(\mathbf {x}_{0},\theta )\). The output of the Q2
query refers to the dependency of u with \(\mathbf {x}\), which in our example is approximated by a 2-dim. plane \(u \approx b_{0} + b_{1}x_{1} + b_{2}x_{2}\), and quantifies how well the local linear model fits the data.Q2
is important in PMA because it supports model fitting through, e.g., piecewise linear regression (PLR) [10], and provides confidence whether linear models fit well or not the underlying data. To better illustrate Q1
and Q2
queries, consider their corresponding SQL syntax. The mean-value query Q1
over data subspace \(\mathbb {D}(\mathbf {x}_{0},\theta )\) for the example data space \((u,x_{1},x_{2})\) shown in Fig. 2 (relation \(R(u,x_{1},x_{2})\)) is represented by a disk of center \(\mathbf {x}_{0} = [x_{0(1)},x_{0(2)}]\) and radius \(\theta \):
SQRT
(x) is the square root of real number x.Q2
over subspace \(\mathbb {D}(\mathbf {x}_{0},\theta )\) for the example data space \((u,x_{1},x_{2})\) in Fig. 2. Based on the XLeratorDB/statistics LINEST function in SQL Server 2008 syntax, Q2
first defines the subspace \(\mathbb {D}(\mathbf {x}_{0},\theta )\), then it stores the corresponding tuples \((u,x_{1},x_{2})\) temporarily to a relation \(S(u,x_{1},x_{2})\), i.e., \((x_{1},x_{2}) \in \mathbb {D}(\mathbf {x}_{0},\theta )\), and, finally, invokes the multivariate linear regression function LINEST
over relation S:
Q1
and Q2
, the system must access the data to establish the data subspace \(\mathbb {D}(\mathbf {x}_{0},\theta )\), and then take the average value of u in that subspace for query Q1
(e.g., the average seismic signal speed in San Andreas, CA, region) and invoke a multivariate linear regression algorithm [32] for Q2
. The Q1
and Q2
type queries are provided by all modern PMA systems like Spark analytics [47], MATLAB1 and DMS systems, e.g., XLeratorDB2 of Microsoft SQL Server3 and Oracle UTL_NLA.41.1 Desiderata
Q1
and Q2
. The aim is to meet the following desiderata, providing answers to the following questions:- D1 Are there linear dependencies among dimensions in unexplored data subspaces, and which are such subspaces?
- D2 If there are data subspaces, where linear approximations fit well with high confidence, can the system provide these yet unknown linear regression models efficiently and scalably to the analysts?
- D3 If in some subspaces linear approximations do not fit well w.r.t. analysts needs, can the system provide fitting models through piecewise local linear approximations?
- D4 A solution must meet scalability, efficiency, and accuracy desiderata as well.
Q1
and Q2
to the DMS.Q2
query issued over the data subspace \(\mathbb {D}(x_{0},\theta )\) will calculate the intercept \(b_{0}\) and slope \(b_{1}\) of the linear approximation \(u \approx \hat{g}(x) = b_{0} + b_{1}x\) (the green line l) over those \(x \in \mathbb {D}(x_{0},\theta )\). Evidently, such a line shows a very coarse and unrepresentative dependency between output u and input x, since u and x do not linearly depend on each other within the entire data subspace \(\mathbb {D}(x_{0},\theta )\). The point is that we should obtain a finer grained and more accurate dependency between output u and input x. The principle of local linearity [26] states that linear approximations of the underlying data function in certain data subspaces fit the global nonlinearity better in the entire data subspace of interest. In Fig. 3(upper), we observe four local linear approximations \(l_{1}, \ldots , l_{4}\) in the data subspace. Therefore, it would be preferable if, as a result of query Q2
, the analysts were provided with a list of the local line segments \(\mathcal {S} = \{l_{1}, \ldots , l_{4}\}\), a.k.a. piecewise linear regression. These ‘local’ segments better approximate the linearity of output u. Moreover, in Fig. 3(lower) the underlying data function \(u = g(x_{1},x_{2})\) in the 3-dim. data space does not exhibit linearity over the entire \((x_{1},x_{2})\) plane. We can observe how the global linear relationship \(\hat{g}(x_{1},x_{2})\) cannot capture the very local statistical dependencies between \(\mathbf {x} = [x_{1},x_{2}]\) and u, which are better captured in certain data subspaces by certain local line segments \(\hat{g}_{k}(x_{1},x_{2})\).Q2
is issued, it is not known whether the data function g behaves with the same linearity throughout the entire \(\mathbb {D}(\mathbf {x}_{0},\theta )\) or not, and within which subspaces, if any, g changes its trend and u and \(\mathbf {x}\) exhibit linear dependencies. Thus, the desiderata D2 and D3 focus on learning the boundaries of these local subspaces within \(\mathbb {D}(\mathbf {x}_{0},\theta )\) and within each local subspace, discovering the linear dependency (segment) between output u and input \(\mathbf {x}\). This would arm analysts and data scientists with much more accurate knowledge on how the data function \(g(\mathbf {x})\) behaves within a given data subspace \(\mathbb {D}(\mathbf {x}_{0},\theta )\). Hence, decisions on further data exploration w.r.t. complex model selection and/or validation can be taken by the analysts.Q1
and Q2
in a way that is highly accurate and insensitive to the sizes of the underlying data spaces in number of data points, and thus scalable. The essence of the novel idea we put forward rests on exploiting previously executedqueries and their answers obtained from the DMS/PMA system to train a model and then use that model to:- approximate the underlying data function g over the analysts’ queried data subspaces by estimating the unknown K segments and their unknown local model coefficients/PLR segments. This has to be achieved based only on the issued queries and their answers, where no data access is provided to analysts by the DMS;
- predict the list \(\mathcal {S}\) of the linear models (segments) that model the PLR estimator data function \(\hat{g}\) minimizing the MSE in (3). Such models best explain (fit) the underlying data function g over a given data subspace \(\mathbb {D}(\mathbf {x}_{0},\theta )\);
- predict the answer y of any unseen mean-value query over a data subspace \(\mathbb {D}(\mathbf {x}_{0},\theta )\);
- predict the output data value \(\hat{u}\) given an unseen input \(\mathbf {x}\) based on the approximate data function \(\hat{g}\).
1.2 Challenges and organization
Q
\(_{1}\), Q
\(_{2}\), ...,Q
\(_{n}\}\)), and the system will have produced responses (e.g., \(y_{1}, y_{2}, \ldots , y_{n}\) for Q1
queries). Our key idea is to inject a novel statistical learning model and novel query processing algorithms in between users and the DMS that monitors queries and responses and learns to associate a query with its response. After training, say after the first \(m < n\) queries \(\mathcal {T} =\{\)Q
\(_{1}, \ldots ,\)Q
\(_{m}\}\) then, for any new/unseen query Q
\(_{t}\) with \(m < t \le n\), i.e., Q
\(_{t} \in \mathcal {V} = \mathcal {Q} \setminus \mathcal {T}= \{\)Q
\(_{m+1}, \ldots ,\)Q
\(_{n}\}\), our model approximates the data function g with an estimator function \(\hat{g}\) through a list \(\mathcal {S}\) of local linear regression coefficients (line segments) that best fits the actual and unknown function g given the query Q
\(_{t}\)’s data subspace and predicts its response \(\hat{y}_{t}\)without accessing the DMS. The efficiency and scalability benefits of our approach are evident. Computing the exact answers to queries Q1
and Q2
can be very time-consuming, especially for large data subspaces. So if this model and algorithms can deliver highly accurate answers, query processing times will be dramatically reduced. Scalability is also ensured for two reasons. Firstly, in the data dimension, as query Q1
and Q2
executions (after training) do not involve data accesses, even dramatic increases in DB size do not impact query execution. Secondly, in the query-throughput dimension, avoiding DMS internal resource utilization (that would be required if all Q1
and Q2
queries were executed over the DMS data) saves resources that can be devoted to support larger numbers of queries at any given point in time. Viewed from another angle, our contributions aim to exploit all the work performed by the DMS engine when answering previous queries, in order to facilitate accurate answers to future queries efficiently and scalably.- Identify the number and boundaries of the data subspaces with local linearities and deliver the local linear approximations for each subspace identified, i.e., predict the list \(\mathcal {S}\) for an unseen query
Q2
, thus, no need to execute the queryQ2
. Clearly, this challenge copes further with the following problems: the boundaries of these data subspaces are unknown and cannot be determined even if we could scan all of the data, which in any case would be inefficient and less scalable. - Predict the average value of answer y for an unseen query
Q1
, thus, no need to execute the queryQ1
.
2 Related work and contribution
2.1 Related work
Q1
and Q2
queries. So, if data are already in a DMS, they would need to be moved back and forth between external analytics environments and the DMS, resulting in considerable inconveniences and performance overheads, (if at all possible for big datasets). At any rate, modern DMSs should provide analysts with rich support for PMA.Q1
and Q2
queries can serve as the DMS within Fig. 4. However, in the big data era exact Q1
, Q2
computations leave much to be desired in efficiency and scalability, as the system must first execute the selection, establishing the data subspaces per query, and then access all tuples in Q1
, Q2
.Q2
, going through a series of stages: partitioning the entire data space into clusters, assigning each data point to one of these clusters, and fitting a linear regression function to each of the clusters. However, data clustering cannot automatically guarantee that the within-cluster nonlinearity of the data function g is captured by a local linear fit. Hence, all these methods are iterative, repeating the above stages until convergence to minimize the residuals estimation error of all approximated local linear regression functions. For instance, the method in [10] clusters and regresses the entire data space against K clusters with a complexity of \(O(K(n^{2}d + nd^{2}))\). Similarly, the incremental adaptive controller method [20] using self-organizing structures requires \(O(n^{2}dT)\) for training purposes. The same holds for the methods [12, 19, 20, 28] that combine iterative clustering and classification for piecewise regression requiring also \(O(n^{2}dT)\). Linear regression methods indicate their high costs when computing exact answers. As all these methods derive regression models over the whole data space, e.g., over trillions of points, the scalability and efficiency desiderata are missed, as Sect. 8 will showcase.Q1
queries, approximating the underlying data function g based on (multiple) local linear models of regression Q2
queries, and predicting the output data values given unseen inputs by estimating the underlying data function. It does so while achieving high prediction accuracy and goodness of fit, after training without executing Q1
and Q2
, thus, without accessing data. This ensures a highly efficient and scalable solution, which is independent of the data sizes, as Sect. 8 will show.2.2 Contribution
- deliver a PLR-based data approximation of the data function g over different unseen subspaces that best explain the underlying function g within \(\mathbb {D}(\mathbf {x}_{0},\theta )\),
- predict the data output value \(\hat{u} \approx g(\mathbf {x})\) for each unseen data input \(\mathbf {x} \in \mathbb {D}(\mathbf {x}_{0},\theta )\),
- predict the average value y of the data output \(u=g(\mathbf {x})\) with \(\mathbf {x} \in \mathbb {D}(\mathbf {x}_{0},\theta )\).
- A statistical learning methodology for query-driven PLR approximations of data functions over multi-dimensional data spaces. This methodology indirectly extracts information about the unknown data function g only by observing and learning the mapping between aggregation queries and their answers.
- A joint optimization algorithm for minimizing the PLR data approximation error and answer prediction error in light of quantizing the query space.
- Convergence analyses of the methodology including variants for supporting partial and global convergence.
- Mathematical analyses of the query-driven PLR approximation and prediction error bounds;
- Mean-value and data-value prediction algorithms for unseen
Q1
andQ2
queries. - A PLR data approximation algorithm over data subspaces defined by unseen
Q2
queries. - Sensitivity analysis and comparative performance assessment with PLR and multivariate linear regression algorithms found in the literature in terms of scalability, prediction accuracy, data value prediction error and goodness of fit of PLR data approximation.
3 Problem analysis
3.1 Definitions
Q1
query over a dataset \(\mathcal {B}\) returns the average of the outputs \(u_{i} = g(\mathbf {x}_{i})\), whose corresponding input vectors \(\mathbf {x}_{i} \in \mathbb {D}(\mathbf {x},\theta )\), i.e.,TRUE
.3.2 Problem formulation
- CH1: predict the aggregate output outcome \(\hat{y}\) of a random query \(\mathbf {q} = [\mathbf {x}, \theta ]\). Given an unknown query function \(f: \mathbb {Q} \subset \mathbb {R}^{d+1} \rightarrow \mathbb {R}\), which maps a query \(\mathbf {q} = [\mathbf {x},\theta ] \mapsto y\), we seek a query-PLR estimate \(\hat{f} \in \mathcal {L}_{K}\) to predict the actual answer \(y = f(\mathbf {q}) = f(\mathbf {x},\theta )\)6 for an unseen query \(\mathbf {q}\), i.e., \(\hat{y} = \hat{f}(\mathbf {q}) = \hat{f}(\mathbf {x},\theta )\). The challenge is:$$\begin{aligned} \hat{f} = \arg \min _{f \in \mathcal {L}_{K}}\text{ MSE }(f). \end{aligned}$$
- CH2: identify the local linear approximations of the unknown data function \(u=g(\mathbf {x})\) over the data subspaces \(\mathbb {D}(\mathbf {x},\theta )\) defined by unseen queries \(\mathbf {q} = [\mathbf {x},\theta ]\). Based on the query-PLR estimate \(\hat{f}\) we seek a statistical learning methodology \(\mathcal {F}\) to extract a data-PLR estimate \(\hat{g} \in \mathcal {L}_{K}\) from the query-PLR estimate \(\hat{f}\), notated by \(\hat{g} = \mathcal {F}(\hat{f})\) to fit the data function g. The challenge is:$$\begin{aligned} \hat{g} = \arg \min _{g' \in \mathcal {L}_{K}}\{ \text{ MSE }(g')|g' = \mathcal {F}(\hat{f})\}. \end{aligned}$$
- CH3: predict the data output \(\hat{u}\) of a random input data vector \(\mathbf {x}\) based on the data-PLR estimate \(\hat{g}\), i.e., \(\hat{u} = \hat{g}(\mathbf {x})\).
3.3 Preliminaries
3.3.1 Incremental learning and stochastic gradient descent
3.3.2 Adaptive vector quantization
4 Solution fundamentals
4.1 Methodology overview
4.2 Query local linear mapping
- The local intercept, with two components: the local expectation of answer y, i.e., \(\mathbb {E}[y] = f_{k}(\mathbb {E}[\mathbf {x}],\mathbb {E}[\theta ])\), notated by the scalar coefficient \(y_{k}\); and the local expectation query \(\mathbb {E}[\mathbf {q}] = [\mathbb {E}[\mathbf {x}],\mathbb {E}[\theta ]]\) notated by the vectorial coefficient \(\mathbf {w}_{k} = [\mathbf {x}_{k},\theta _{k}] \in \mathbb {Q}_{k}\), with \(\mathbf {x}_{k} = \mathbb {E}[\mathbf {x}]\) and \(\theta _{k} = \mathbb {E}[\theta ]\) such that \([\mathbf {x},\theta ] \in \mathbb {Q}_{k}\). Hereinafter, \(\mathbf {w}_{k}\) is referred to as the prototype of the query subspace \(\mathbb {Q}_{k}\).
- The local slope\(\mathbf {b}_{k} = [\mathbf {b}_{X,k},b_{\varTheta ,k}]\) of \(f_{k}\) over \(\mathbb {Q}_{k}\), which denotes the gradient \(\nabla f_{k}(\mathbb {E}[\mathbf {x}],\mathbb {E}[\theta ])\) of \(f_{k}\) at the local expectation query \(\mathbf {w}_{k}\).
4.3 Our statistical learning methodology
4.3.1 Joint quantization–regression optimization for query-LLMs
4.3.2 Data-LLM function derivation from query-LLM function
5 Query-driven statistical learning methodology
6 Data and query functions approximation and prediction
Q1
and Q2
from the test set \(\mathcal {V}\); see also Fig. 4. We adopt the principle of the nearest neighbors regression for prediction [32]. The notion of neighborhood here is materialized by the overlapping of an unseen query with the query prototypes in the quantized space \(\mathbb {Q}\) (see Example 4, Fig. 6). By Definition 7, the queries \(\mathbf {q} = [\mathbf {x}, \theta ]\) and \(\mathbf {q}' = [\mathbf {x}', \theta ']\) overlap if the condition \(\mathcal {A}(\mathbf {q},\mathbf {q}') = \texttt {TRUE}\). To quantify a degree of overlapping between those queries represented as hyper-spheres in the \((d+1)\)-dim. space, we require that the two spheres are partially intersected. Let us define the ratio between the \(L_{2}\) distance of the centers of data subspaces \(\mathbb {D}(\mathbf {x},\theta )\) and \(\mathbb {D}(\mathbf {x}', \theta ')\) over the distance of their radii, i.e., \(\frac{||\mathbf {x}-\mathbf {x}' ||_{2}}{\theta +\theta '}\). This ratio takes values in [0, 1] in the case of overlapping, with a value of unity when both spheres just meet each other. In the concentric case, the degree of overlapping should also take into consideration the remaining area from this perfect inclusion. We define the degree of overlapping for two queries as the normalized ratio \(\delta (\mathbf {q},\mathbf {q}') \in [0,1]\):Q1
and the linear regression query Q2
are based on the neighborhood set \(\mathcal {W}(\mathbf {q})\) for an unseen query \(\mathbf {q}\).Q1
(see Algorithm 2) and retrieve the data regression planes coefficients \(\mathcal {S}\) of the data-LLM functions \(g_{i},g_{k},g_{l}\) from query-LLM functions \(f_{i}, f_{k}, f_{l}\), respectively, for query Q2
(see Algorithm 3).6.1 Query Q1
: mean-value aggregate prediction
6.2 Query Q2
: PLR-based data function approximation
- (Case 1) either partially overlap with several identified convex data subspaces \(\mathbb {D}_{k}\) (corresponding to query subspaces \(\mathbb {Q}_{k}\)), or
- (Case 2) be contained or contain a data subspace \(\mathbb {D}_{k}\), or
- (Case 3) be outside of any data subspace \(\mathbb {D}_{k}\).
7 Convergence analysis and complexity
7.1 Global convergence analysis
7.2 Partial convergence analysis
- Case A If the consensual ratio \(\frac{\ell }{\ell + \kappa } \ge r\), i.e., more than \(r\%\) of the query prototypes in \(\mathcal {W}(\mathbf {q}_{t})\) have locally converged, with \(r \in (0.5, 1)\) then two options are available:
- Case A.I The model predicts and delivers the answer based only on those \(\ell \) query prototypes which have converged to the analysts and, then, executes the query for updating the \(\kappa \) not yet converged query prototypes to align with the model convergence mode. In this case, the analysts are delivered a predicted answer where the degree of confidence for this answer is regulated through the consensual ratio r. The mean-value prediction and PLR data approximation is achieved as described in Algorithms 2 and 3 by replacing \(\mathcal {W}(\mathbf {q})\) with the locally converged query prototypes \(\mathcal {C}(\mathbf {q})\) in (32). After the query execution, the query-LLM prototypes from the un-converged set \(\mathcal {U}(\mathbf {q})\) in (32) are updated as described in Algorithm 1. Obviously, if the consensual ratio \(\frac{\ell }{\ell + \kappa } = 1\), then there is no such an intermediate phase.
- Case A.II The model predicts and delivers the answer based only on those \(\ell \) query prototypes which have converged, to the analysts, and does not execute the query, thus, no update is performed for those \(\kappa \) query prototypes. The mean-value prediction and PLR data approximation is achieved as described in Algorithm 2 and Algorithm 3 by replacing \(\mathcal {W}(\mathbf {q})\) with the locally converged query prototypes \(\mathcal {C}(\mathbf {q})\) in (32). This obviously delays the global convergence and reduces the number of queries executed for convergence. This option is only preferable when most of the incoming queries focus on specific data subspaces and not on the entire data space. In other words, there is no meaning for the entire model to globally converge to transit from the training phase to the prediction phase, if most of the queries are issued on very specific data subspaces. At the extreme case, the model could delay a lot its convergence if more than 50% of the query prototypes are involved in the overlapping sets for all the incoming queries. To alleviate this case, our model creates new prototypes (incrementally) only when there is at least some interest on a specific data subspace, as discussed in Sect. 5 adopting the principles of adaptive resonance theory [30].
- Case B Otherwise, i.e., the consensual ratio \(\frac{\ell }{\ell + \kappa } < r\), the model acts as usual in the training phase, i.e., it first executes the query and delivers the actual answer to the analyst, and then based on this actual answer it updates the prototypes as discussed in Sect. 5.
7.3 Computational complexity
Q1
and a linear regression query Q2
, we require O(dK) to calculate the neighborhood \(\mathcal {W}\) set and deliver the query-LLM functions, respectively, i.e., independent on the data size, thus, achieving scalability. We also require O(dK) space to store the query prototypes and the query-LLM coefficients. The derivation of the data-LLMs is then O(1) given than we have identified the query-LLMs for a given linear regression query.8 Performance evaluation
8.1 Performance metrics
8.1.1 Predictability
Q1
query \(\mathbf {q} = [\mathbf {x},\theta ]\). Based on the EPE in (5), the A1 metric is the root-mean-square error (RMSE):Q1
queries.8.1.2 Goodness of fit
Q2
query \(\mathbf {q} = [\mathbf {x},\theta ]\) defined over the data subspace \(\mathbb {D}(\mathbf {x},\theta )\), we evaluate how well our methodology approximates the data function g through data-LLM functions comparing with the REG model and the optimal PLR data approximation model over the same data subspace \(\mathbb {D}\). For goodness of fit we adopt the metrics: Fraction of Variance Unexplained (FVU) s and Coefficient of Determination (CoD) \(R^{2}\) [26]. FVU indicates the fraction of variance of the dependent data output variable u, which cannot be explained, i.e., which is not correctly predicted by the explanatory data input variable \(\mathbf {x}\). Given a data subspace \(\mathbb {D}(\mathbf {x},\theta )\), consider the data pairs \((\mathbf {x}_{i},u_{i}): \mathbf {x}_{i} \in \mathbb {D}\), \(i \in [n_{\theta }(\mathbf {x})]\), with outputs \(u_{i} = g(\mathbf {x}_{i})\), and approximations \(\hat{u}_{i}\) for each input \(\mathbf {x}_{i}\). The sum of squared residuals (SSR) and the total sum of squares (TSS) over \(\mathbb {D}(\mathbf {x},\theta )\) are then defined as:8.2 Experimental setup
8.2.1 Real and synthetic datasets
Q1
queries, the A2 metric for data output predictions, and the FVU, CoD metrics for Q2
queries, we used two real datasets R1 from [45] and R3 from [16], and a synthetic dataset referred to as R2.
Q1
and Q2
analytics queries (prediction of the mean-value and model fitting).Q1
and Q2
analytics queries (prediction of the mean-value and model fitting).
Q1
prediction accuracy of the aggregate answer y (A1 metric), corresponding to the average of the dimensions of the first three and four PCs in R1 and in R3, respectively, and to the average data output u of the Rosenbrock in R2. The PLR approximation of the data function g regarding Q2
queries over the R1, R2, and R3 datasets is conducted over data dimensions \(d \in \{2,3,5,6\}\) for the metrics: FVU, CoD and data output prediction accuracy metric A2. For scalability and efficiency, our method compared against the PostgreSQL with a B-tree index on input vector \(\mathbf {x}\) (\(d \in \{2,5,6\} \) over Q1
queries, \(d =2\) over Q2
queries) and MATLAB (\(d=5\) and \(d=6\)) over Q2
queries using the regress
function on the server and the ARESLab tool for PLR data approximation.
8.2.2 Query workloads, training and testing sets
8.3 Model training and convergence
Parameters | Range/value |
---|---|
Data dimensionality d | \(\{2, 3, 5, 6\}\) |
Real dataset R1 [45] | \(15\cdot 10^{6}\) vectors in \([0,1]^{d}\) |
Synthetic dataset R2 | \(10^{10}\) Rosenbrock in \([-10, 10]^{d}\) |
Real dataset R3 [16] | \(10\cdot 10^{6}\) vectors in \([0,1]^{d}\) |
Vigilance coefficient a | [0.05, 1] |
Consensual threshold r | 0.7 |
Convergence threshold \(\gamma \) | 0.01 |
Training dataset size \(|\mathcal {T}|\) | \([10^{3}, \ldots , 10^{4}]\) |
Testing dataset size \(|\mathcal {V}|\) | \([10^{3}, \ldots , 2 \cdot 10^{4}]\) |
Initial learning rate \(\eta _{0}\) | 0.5 [14] |
Query center/point | Uniform vectors in \([0,1]^{d}\) |
Query radius \(\theta \) | Gaussian values \(\mathcal {N}(\mu _{\theta },\sigma _{\theta }^{2})\) |
Query mean radius \(\mu _{\theta }\) | [0.01, 0.99] |
Query radius dev. \(\sigma _{\theta }\) | 0.01 |
Q1
execution time and model updates time, is (0.41, 0.36, 2.38) h for R1, R3, and R2 datasets, respectively. This should not be perceived as overhead of our approach, as 99.62% of the training time is devoted to executing the queries over the DMS/statistical system, which we cannot avoid that anyway even in the typical case as shown in Fig. 4. Any traditional approach would thus also pay 99.62% of this cost. This only affects how early our approach switches to using the trained model versus executing the queries against the system. Our experiments show that excellent quality results can be produced using a reasonable number of past executed queries for training purposes. Obviously, this can be tuned by setting different model convergence threshold \(\gamma \) values. We set \(\gamma = 0.01\), where \(\varGamma \) is (stochastically) trapped around 0.0046, with deviation 0.0023 in R1 and 0.0012 in R3. In R2, the \(\varGamma \) is strictly less than \(\gamma \) for \(|\mathcal {T}| > 5300\).
8.4 Evaluation of Q1
query: predictability and scalability
Q1
queries using the generated testing set \(\mathcal {V}\) (see Sect. 8.2.2). For different quantization coefficient a values, our model identifies subspaces where the function \(f(\mathbf {x},\theta )\) behaves almost linearly, thus, the query-LLM functions approximate such regions with high accuracy. Interestingly, by quantizing the query space by adopting a small coefficient a value, i.e., fine-grained resolution of the query space, then high accuracy is achieved (low RMSE e values) with obvious nonlinear dependencies among dimensions. This is due to the fact that with small coefficient a values, we focus on very specific query subspaces where linear approximations suffice to approximate the query function f. This expects to result in a high number of prototypes K in order to build many query-LLM functions to capture all the possible nonlinearities of the query function f as will be discussed below.
Q1
, Fig. 24(upper) shows in log scale the average Q1
execution time over the dataset R2 for LLM (with quantization coefficient \(a=0.25\)) corresponding to \(K=92\) and \(K=450\) query prototypes for dimensions \(d \in {2,5}\), respectively. Our method requires just 0.18 ms per query over massive data spaces in its prediction phase, offering up to five orders of magnitude speedup (0.18 ms vs up to \(10^{5}\) ms/query). This is expected, since the LLM-based model predicts the Q1
’s outputs and does not execute the query over the data during prediction achieving high prediction accuracy.
8.5 Evaluation of Q2
query: PLR data approximation and scalability
Q2
queries by using our query-/data-LLMs model against the REG and PLR models and show the statistical significance of the derived accuracy metrics. The explanation over the linear/nonlinear behaviors of data function g is interpreted by the variance explanation and model fitting metrics fraction of the variance unexplained FVU and coefficient of determination CoD against the quantization resolution coefficient a and the model prototypes K. Figures 20 and 21(upper) show the sum of squared residuals SSR between the actual answers and the predicted answers for the data-LLMs and REG model with \(d \in \{2,5, 6\}\) over the datasets R1, R2, and R3 with \(p < 0.05\).Q2
execution time over the dataset R2 and R3, respectively, for data-LLM (\(a=0.25\), i.e., \(K=(92,450)\) for \(d=(2,5)\)) through Algorithm 3, the REG model from PostgreSQL (\(d=2\)) (REG-DBMS), the REG model from MATLAB (\(d=5\)) (REG-MATLAB), and the optimal PLR against dataset size. The derived results are statistically significant with \(p<0.05\). Our model is highly scalable (note the flat curves) in both datasets and highly efficient, achieving 0.56 ms/query and 0.78 ms/query (even for massive datasets)–up to six orders of magnitude better than the REG and PLR models for R2 and R3 datasets, respectively. The full picture is then that our model provides ultimate scalability by being independent of the size of the dataset) and many orders of magnitude higher efficiency, while it ensures great goodness of fit (CoD,FVU), similar to that of PLR.