In this section, we propose two algorithms to retrieve the relative kernel dataset for continuous queries in IoT systems, which are the relative kernel dataset retrieving (RKDR) algorithm and the piecewise linear fitting-based relative kernel dataset retrieving (PLF-RKDR) algorithm. The detailed description of the RKDR algorithm is presented in Algorithm 1 and the detailed description of the PLF-RKDR algorithm is presented in Algorithm 3. The RKDR algorithm is suitable for the IoT systems where simple linear correlation can reveal the relationship between the sensory data and the continuous queries. And the PLF-RKDR algorithm is more suitable for the IoT systems where the correlation between an attribute and a continuous query is not merely linear correlation.
4.1 Relative kernel dataset retrieving algorithm
The relative kernel dataset retrieving (RKDR) algorithm firstly reduces redundancies in attributes by measuring the linear correlations between different attributes. Then, the Pearson correlation coefficient is applied to measure the correlations between attributes and the given continuous query. The RKDR algorithm contains the following two steps.
Step 1. Calculate the candidate relative kernel dataset through reducing the redundancies in attributes.
The candidate relative kernel dataset is denoted as \(\mathcal {X}\). At first, the candidate relative kernel dataset is initialized as the attribute set, i.e., \(\mathcal {X}=\mathcal {A}=\{x_{1},\cdots,x_{n}\}\). Then, \(\mathcal {X}\) is updated by the following two sub-steps.
Step 1.1. For each attribute
xi (1≤
i≤
n) in
\(\mathcal {X}\), the Pearson correlation coefficient
\(R^{q}_{i}\) between
xi and the target value
yq of query
Q is calculated by formula (
2).
Step 1.2 For any two attribute
xi and
xj in
\(\mathcal {X}\), the Pearson correlation coefficient
rij of them is calculated by formula (
1). If
xi and
xj are not
β-compatible (i.e., |
rij|>
β), one of them is redundant in
\(\mathcal {X}\) and should be removed. If
xi has stronger linear correlation with query
Q than
xj, i.e.,
\(|R^{q}_{i}|>|R^{q}_{j}|\), remove
xj from
\(\mathcal {X}\). Otherwise, remove
xi from
\(\mathcal {X}\).
After the above two sub-steps, the candidate relative kernel dataset \(\mathcal {X}\) is calculated through reducing redundancies in attributes.
Step 2. Retrieve the (k,β)-relative kernel dataset from the candidate relative kernel dataset.
Any two attributes in the candidate relative kernel dataset \(\mathcal {X}\) are β-compatible according to Step 1.2. The Pearson correlation coefficient of each attribute in \(\mathcal {X}\) and query Q is calculated in Step 1.1. The absolute values of these Pearson correlation coefficients are calculated. Therefore, we select the top-k absolute values of these Pearson correlation coefficients and their corresponding k attributes. The top-k absolute values of these Pearson correlation coefficients are denoted as \(|R^{q}_{a_{1}}| \geq |R^{q}_{a_{2}}| \geq \cdots \geq |R^{q}_{a_{k}}|\) and the corresponding k attributes are \(x_{a_{1}},\cdots,x_{a_{k}}\). These k attributes are added to the (k,β)-relative kernel dataset of query Q, i.e., \(\mathcal {K}^{q}=\{x_{a_{1}},\cdots,x_{a_{k}}\}\).
After the above two steps, the (k,β)-relative kernel dataset for query Q is derived. The detailed operations are formally described in Algorithm 1. Algorithm 1 is proposed to retrieve the (k,β)-relative kernel dataset for continuous query Q in IoT systems by the linear correlation analysis.
4.2 Piecewise linear fitting-based relative kernel dataset retrieving algorithm
In some scenarios, linear correlation is not enough to describe the relationship between an attribute and a continuous query. For example, the correlation between an attribute and a continuous query could be exponential, quadric, or logarithmic. In this situation, the RKDR algorithm proposed in the last subsection may be no longer applicable. As a consequence, the piecewise linear fitting-based relative kernel dataset retrieving (PLF-RKDR) algorithm is proposed to improve the RKDR algorithm in this subsection.
Since it is almost impossible to know the actual correlation function between an attribute and a continuous query, it is difficult to measure the correlation coefficient between them. The PLF-RKDR algorithm applies the piecewise linear fitting method to approximate the actual correlation function between an attribute and a continuous query. Since each segment of the piecewise linear function is a linear function, the PLF-RKDR algorithm combines the Pearson correlation coefficients of all segments in the piecewise linear function to measure the correlation of each attribute and the continuous query. The PLF-RKDR algorithm contains the following three steps.
Step 1. Calculate the Pearson correlation coefficient of each attribute \(x_{i}\in \mathcal {A}\) and the continuous query Q by piecewise linear fitting.
We apply the training set \(\mathcal {S}\) to calculate the piecewise linear function between each attribute \(x_{i}\in \mathcal {A}\) and the target value of the continuous query Q.
Firstly, for each training example Sl (1≤l≤m), we extract the tuple (xil,yql), where xil is the value of attribute xi and yql is the target value in Sl. Then, we sort the extracted tuples by the non-descending order of the values of attribute xi and denote the ordered tuples as \(\{(x^{1}_{i},y^{1}_{q}),\cdots,(x^{m}_{i},y^{m}_{q})\}\), where \(x^{1}_{i}\leq \cdots \leq x^{m}_{i}\).
Secondly, the optimal segment points of the sorted tuples
\(\{(x^{1}_{i},y^{1}_{q}),\cdots,(x^{m}_{i},y^{m}_{q})\}\) are calculated recursively by the method proposed in [
26]. For one segment,
\(\{(x^{s}_{i},y^{s}_{q}), (x^{s+1}_{i},y^{s+1}_{q}), \cdots,(x^{e}_{i},y^{e}_{q})\}\), the linear fitting function of it is derived by the least-squares method and is denoted as
\(F_{i}^{(s,e)}(x)\). The error sum of squares of linear fitting function
\(F_{i}^{(s,e)}(x)\) is
E(
s,e). The optimal segment point
\((x^{j}_{i},y^{j}_{q})\) of
\(\{(x^{s}_{i},y^{s}_{q}), (x^{s+1}_{i},y^{s+1}_{q}), \cdots,(x^{e}_{i},y^{e}_{q})\}\) should satisfy that
\(j=\mathop {\arg \min }_{s< l< e} (E(s,l)+E(l,e))\). This process is conducted recursively until
E(
s,e)−(
E(
s,j)+
E(
j,e))≤
γ. The formal description is presented in Algorithm 2. Algorithm 2 returns the set of segment points
\(\mathcal {P}_{i}=\{s_{1},s_{2},\cdots,s_{l_{i}}\}\) for segment
\(\{(x^{s}_{i},y^{s}_{q}), (x^{s+1}_{i},y^{s+1}_{q}), \cdots,(x^{e}_{i},y^{e}_{q})\}\).
Finally, the sorted tuples
\(\{(x^{1}_{i},y^{1}_{q}),\cdots,(x^{m}_{i},y^{m}_{q})\}\) are divided into
li−1 subsets according to the calculated segment point set
\(\mathcal {P}_{i}\). Each subset is denoted as
\(D_{j}=\{(x^{s_{j}}_{i},y^{s_{j}}_{q}), \cdots,(x^{s_{j+1}}_{i},y^{s_{j+1}}_{q})\}\), where
\(s_{j},s_{j+1}\in \mathcal {P}_{i}\), and 1≤
j<
li. The Pearson correlation coefficient
\(R^{q}_{ij}\) for subset
Dj is calculated by the following formula,
$$ R^{q}_{ij}=\frac{\sum\nolimits_{l=s_{j}}^{s_{j+1}}(x_{i}^{l}-\overline{x_{i}}^{j})(y_{q}^{l}-\overline{y_{q}}^{j})}{\sqrt{\sum\nolimits_{l=s_{j}}^{s_{j+1}}(x_{i}^{l}-\overline{x_{i}}^{j})^{2}}\sqrt{\sum\nolimits_{l=s_{j}}^{s_{j+1}}(y_{q}^{l}-\overline{y_{q}}^{j})^{2}}} $$
(3)
where
\(\overline {x_{i}}^{j}=\frac {1}{s_{j+1}-s_{j}+1}\sum \nolimits _{l=s_{j}}^{s_{j+1}}x_{i}^{l}\) and
\(\overline {y_{q}}^{j}=\frac {1}{s_{j+1}-s_{j}+1}\sum \nolimits _{l=s_{j}}^{s_{j+1}}y_{q}^{l}\). As a consequence, the correlation coefficient of attribute
xi and the continuous query
Q is the weighted average of the absolute values of all calculated Pearson correlation coefficients
\(\{R^{q}_{ij}|1\leq j< l_{i}\}\), which is presented as the following formula.
$$ \begin{aligned} R^{q}_{i}=\sum\limits_{j=1}^{l_{i}-1}{|R^{q}_{ij}|\times \frac{s_{j+1}-s_{j}}{x^{m}_{i}-x^{1}_{i}}} \end{aligned} $$
(4)
Step 2. Calculate the candidate relative kernel dataset \(\mathcal {X}\) through reducing the redundancies in attributes.
Initially, all attributes are added to
\(\mathcal {X}\), i.e.,
\(\mathcal {X}=\mathcal {A}=\{x_{1},\cdots,x_{n}\}\). Then, for each two attribute
\(x_{i},x_{j}\in \mathcal {X}\), we compute the Pearson correlation coefficient
rij of them by formula (
1). When |
rij|>
β, meaning that
xi and
xj are not
β-compatible,
xi and
xj are redundant in
\(\mathcal {X}\) and one of them should be removed.
\(|R^{q}_{i}|\) and
\(|R^{q}_{j}|\) are compared, where
\(R^{q}_{i}\) and
\(R^{q}_{j}\) are calculated by formula (
4). If
\(|R^{q}_{i}|>|R^{q}_{j}|\), remove
xj from
\(\mathcal {X}\). Otherwise,
xi is removed from
\(\mathcal {X}\).
Step 3. Retrieve the relative kernel dataset from the candidate relative kernel dataset for the continuous query Q.
After Step 2, all attributes in candidate relative kernel dataset \(\mathcal {X}\) are β-compatible. Then, we can obtain the top-k correlation coefficients of all attributes in \(\mathcal {X}\) calculated in Step 1 and the corresponding k attributes. The top-k correlation coefficients are denoted as \(R^{q}_{a_{1}}\geq R^{q}_{a_{2}}\geq \cdots \geq R^{q}_{a_{k}}\), and the corresponding k attributes are denoted as \(x_{a_{1}}, x_{a_{2}}, \cdots, x_{a_{k}}\). Finally, the (k,β)-relative kernel dataset of query Q is \(\mathcal {K}^{q}=\{x_{a_{1}}, x_{a_{2}}, \cdots, x_{a_{k}}\}\).
After the above three steps, the (k,β)-relative kernel dataset of query Q is calculated. The detailed operations are formally described in Algorithm 3. Algorithm 3 is proposed to retrieve the (k,β)-relative kernel dataset of continuous query Q in IoT systems by the piecewise linear fitting method. And Algorithm 2 is called by Algorithm 3 to return the set of segment points for piecewise linear fitting.