Both RGSA and RLSA sample randomly from the original space, and we can use statistical analysis method to calculate errors. Confidence interval is an important component of approximation algorithms and indicates the proximity of each approximate result to extra result. Here we give the formulae related to the estimator and confidence interval of SUM.
Consider a window
W containing
m tuples, denoted by
\(t_1,t_2,\ldots ,t_m\). Suppose
v(
i) is an auxiliary function whose result is equal to
m times the value of
\(t_i\) when applied to
\(t_i\). Then we define a new function
\(F \left( f \right)\), and the extra result
\(\alpha\) of sum function is equal to
\(F \left( v \right)\).
$$\begin{aligned} F \left( f \right) = \frac{1}{m}\sum _{i=1}^{m}f(i) \end{aligned}$$
(1)
Now we sample tuples from window
W and get a sequence of tuples, denoted as
\(T_1,T_2,\ldots ,T_n\), after
n tuples are selected. Then estimated result
\(\overline{Y_n}\) can be calculated by using Eq. (
2).
\(\overline{Y_n}\) is the average value of the samples and is equal to
\(T_n \left( v \right)\).
$$\begin{aligned} T_n \left( f \right) = \frac{1}{n}\sum _{i=1}^{n}f(T_i) \end{aligned}$$
(2)
The formula of sample variance is shown as Eq. (
3).
\(T_n(f)\) is the average value of the samples. So the sample variance
\(\widetilde{\sigma ^{2}}\) is equal to
\(T_{n,2}(v)\).
$$\begin{aligned} T_{n,q} \left( f \right) = \frac{1}{n-1}\sum _{i=1}^{n}(f(T_i) - T_n(f))^{q} \end{aligned}$$
(3)
According to central limit theorem, we can conclude that the left-hand side follows normal distribution with mean 0 and variance 1.
$$\begin{aligned} \frac{\sqrt{n}(\overline{Y_n}-\alpha )}{\sqrt{T_{n,2}(v)}}\Rightarrow N(0,1) \end{aligned}$$
(4)
After we get
\(\overline{Y_n}\) and
\(\widetilde{\sigma ^{2}}\), we have
$$\begin{aligned} P\left\{ \left| \overline{Y_n} - \alpha \right| \le \varepsilon \right\} \approx 2\Psi \left( \frac{\varepsilon \sqrt{n}}{\sqrt{T_{n,2}(v)}}\right) -1 \end{aligned}$$
(5)
where
\(\Psi\) is the cumulative distribution function of the
\(\mathcal {N}(0,1)\) random variable. Finally, we define
\(z_p\) as the
\(\frac{p+1}{2}\)-quantile of the normal distribution function and set Eq. (
5) equal to
p. Then a confidence interval is computed as
$$\begin{aligned} \varepsilon _n = \frac{z_p \widetilde{\sigma }}{\sqrt{n}} \end{aligned}$$
(6)
Other aggregation functions, such as COUNT, AVG, VAR, have the similar proof process. And [
13,
15] give more details about the proofs of a few common aggregation functions. In addition, Haas and Hellerstein [
28] proved that we could estimate the aggregate by using a non-independent and uniform sample method.