Introduction
Research methodology
Research questions
- RQ1: What are the different types of errors in sensor data?
- RQ2: How to quantify or detect errors in sensor data?
- RQ3: How to correct the errors in sensor data?
- RQ4: What domains are the different types of methods proposed in?
Search process
Improving search strategy by topic modelling
LatentDirichletAllocation
. For the purpose of this analysis, the title and abstract of the publications, which have been identified from search query (1), are used for modelling the underlying topics. The visualization of the LDA model with 12 topics obtained from the 13,057 documents (title and abstracts) of search string (1) is shown in Fig. 2a with the intertopic distance showing the marginal topic distribution. Figure 2b–d lists the top 30 most relevant terms for Topic 1, Topic 2 and Topic 8 respectively. Topic 1 and Topic 2 both have top terms related to sensor and data. However, Topic 1 seems to be more focused on systems and applications, whereas Topic 2 is more related to methods and algorithms. Looking at the top 30 keywords of Topic 8, one might classify that topic as “Imaging” or “Satellite Imaging” since words such as “image”, “video”, “resolution”, “camera”, “satellite” and “pixel” occur in that cluster. Through this topic modelling step, it can be seen that there are a handful of papers related to “imaging” in the initial search results. Because imaging is a topic we do not want to focus on, we are using the terms of Topic 8 to refine the search string and set them to be one of the exclusion criteria in this paper.
Inclusion and exclusion criteria
Study quality assessment
Study selection
Data extraction
- Title and abstract of literature,
- Authors’ names,
- Database,
- Publication year,
- Types of sensor data errors addressed (RQ1),
- Types of methods for detecting or mitigating errors (RQ2 and RQ3),
- The domain in which the methods have been developed (RQ4).
Data synthesis
Risk of bias
Results
Symbol | Description |
---|---|
\(x_i(t_j)\) | Measured data value \(x_i\) of sensor i at a specific point in time \(t_j\) |
\({\hat{x}}\) | Estimated sensor data value |
\(\vec {x}\) | Sensor data vector, where \(\vec {x} = \left( x_1,\ldots ,x_i,\ldots ,x_V\right)\) is a row vector obtained at the same point in time |
t | Time in sensor data stream, e.g \(x_t\) is the observed sensor data value at time t |
i | Column index \(i=1,\ldots ,V\) |
j | Row index \(j=1,\ldots ,N\) |
f | Feature |
q | Size of moving window |
N | Number of samples |
V | Number of variables e.g. temperature, humidity, voltage |
M | Number of sensor unit |
F | Number of features |
\(\mathbf{Z }\) | Sensor data stream in the form of a time series, \(\mathbf{Z } = \left( \dots ,\vec {x}_{t-1},\vec {x}_{t},\vec {x}_{t+1},\dots \right)\) |
\(\mathbf{X }\) | Sensor data matrix where \(\mathbf{X } \in {\mathbb {R}}^{N\times V}\), \(\mathbf{X } = \left( \vec {x}_{1}, \ldots ,\vec {x}_{j},\dots ,\vec {x}_{N}\right)\) |
Types of errors in sensor data
Type of error | Papers | Total |
---|---|---|
Outliers | 32 | |
Missing data | 16 | |
Bias | 12 | |
Drift | 12 | |
Noise | 8 | |
Constant value | 7 | |
Uncertainty | 6 | |
Stuck-at-zero | 6 |
Methods for detecting and quantifying errors in sensor data
Method | Errors addressed | Papers | Total |
---|---|---|---|
Principal component analysis | Outliers, bias, drift, stuck-at-zero | 7 | |
Artificial neural network | Outliers, bias, drift, constant values, noise, stuck-at-zero, uncertainty | 6 | |
Ensemble classifiers | Outliers, drift, constant values, noise, uncertainty | 4 | |
Support vector machine | Outliers | 2 | |
Clustering | Outliers | 2 | |
Ontology/knowledge-based systems | Uncertainty (inaccurate data), missing data (incomplete data) | 2 | |
Univariate autoregressive models | Outliers | [40] | 1 |
Statistical generative models | Outliers | [49] | 1 |
Grey prediction model | Outliers, noise, constant values | [52] | 1 |
Particle filtering | Bias, scaling | [71] | 1 |
Association rule mining | Outliers | [56] | 1 |
Bayesian network | Outliers, noise | [44] | 1 |
Euclidean distance | Outliers | [42] | 1 |
Hybrid methods | |||
Polynomial predictive filter and fuzzy rules | Outliers | [53] | 1 |
Dempster–Shafer theory and mathematical modelling | Drift, noise | [75] | 1 |
Anomaly/fault detection
Principal component analysis (PCA)
Artificial neural network
Ensemble classifiers
Support vector machine
tsfresh
[90] can also be used to obtain time series features. The normal behaviour at each time window is learned using One-Class Centered Quarter-Sphere SVM to find the minimum radius (hyperplane), which helps detect temporal anomalies. Then, the radius is broadcasted to all spatially neighbouring nodes i.e. sensor nodes that are within communication range, and the median radius is calculated. The online characteristic allows the data can be checked against other neighbouring nodes to identify if the temporal anomaly is also spatially anomalous, thus confirming the detection of an actual anomaly.Clustering
Univariate autoregressive models
Statistical generative models
Grey prediction model
Particle filtering
Association rule mining
Bayesian network
Euclidean distance
Hybrid methods
Uncertainty quantification
Artificial neural network
Ensemble classifiers
Ontology/knowledge-based systems
Methods for correcting errors in sensor data
Method | Errors addressed | Papers | Total |
---|---|---|---|
Association rule mining | Missing data | 4 | |
Clustering | Missing data | [65] | 1 |
k-Nearest Neighbour | Missing data | [9] | 1 |
Singular value decomposition | Missing data | [67] | 1 |
Empirical mode decomposition | Noise | [76] | 1 |
Savitzky–Golay filter and multivariate thresholding | Noise | [77] | 1 |
Hybrid methods | |||
Clustering and probabilistic matrix factorization | Missing data | [63] | 1 |
Missing data imputation
Association rule mining
Clustering
k-Nearest Neighbour
Singular value decomposition
Hybrid methods
De-noising
Signal processing
Savitzky–Golay filter and multivariate thresholding
Methods for detecting and correcting errors in sensor data
Method | Errors addressed | Papers | Total |
---|---|---|---|
Principal component analysis | Outliers, bias, drift, constant values, noise, stuck-at-zero | 2 | |
Artificial neural network | Outliers, bias | 2 | |
Bayesian network | Outliers, missing data | 2 | |
Grey prediction model | Outliers, bias, constant values, stuck-at-zero | [30] | 1 |
Dempster–Shafer theory | Uncertainty | [80] | 1 |
Calibration-based method | Bias, drift, noise, stuck-at-zero | [73] | 1 |
Hybrid methods | |||
Principal component analysis-based methods | Outliers, bias, drift, noise, constant values, stuck-at-zero | 3 | |
Kalman filter-based methods | Outliers, bias, drift, missing data | 2 | |
Dempster–Shafer theory & Ontology | Uncertainty (inaccurate data), missing data (incomplete data) | [68] | 1 |
Fault detection, isolation and recovery
Principal component analysis
Artificial neural network
Bayesian network
Grey prediction model
Dempster–Shafer theory
Calibration-based method
Principal component analysis-based hybrid methods
Kalman filter-based hybrid methods
Dempster–Shafer theory-based hybrid method
Types of domains
Domain | Papers | Total |
---|---|---|
General e.g. WSNs, IoT, streaming data | 29 | |
Industrial processes e.g. Chemical gas process monitoring, power plants, part injection molding | 7 | |
Environmental sensing e.g. air quality monitoring, marine environment, soil moisture | 6 | |
Smart city e.g. Smart Spaces, Smart Grid, Wastewater treatment, Traffic flow | 6 | |
Healthcare e.g. body sensor networks, artificial pancreas, continuous glucose monitor | 4 | |
HVAC systems | 3 | |
Context-based application / activity recognition | 2 |
Dataset | Domain | Papers | Total |
---|---|---|---|
SensorScope (GSB, LUCE, FishNet) | Environmental sensing | [38] (GSB and FishNet) [48] (GSB and LUCE) | 7 |
Intel Berkeley | Environmental sensing | 6 | |
UCI machine learning repository water treatment plant dataset | Smart city (wastewater treatment) | 2 | |
Numenta anomaly benchmark | General (streaming data) | [34] | 1 |
Networked aquatic microbial observing system (NAMOS) | Environmental sensing (marine environment) | [48] | 1 |
TasMAN Sullivans Cove Marine | Environmental sensing (marine environment) | [79] | 1 |
MERLSense | Environmental sensing | [66] | 1 |
Caltrans PeMS traffic monitoring | Smart city (traffic flow monitoring) | [9] | 1 |
PhysioNet | Healthcare | [7] | 1 |
Discussion
Datasets and error imputation and labelling
Dataset type | Availability | No. datasets | Papers | No. papers |
---|---|---|---|---|
Real-world datasets | Published and currently available | 21 | 16 | |
Unpublished or currently not available | 33 | 32 | ||
Simulated datasets | Published or reproducible | 2 | 2 | |
Not reproducible | 16 | 15 |
PyMC
[105], which is designed to implement Bayesian statistical models. A Bayesian estimation is done instead of the classical t-test, as it shows the complete distributional information, i.e. the probability of every possible difference of means and every possible difference of standard deviations which allows the estimation of the difference between the two groups rather than simply testing whether the two groups are different based on the observed data [104]. Figure 6a, b show the posterior distribution of the mean citation rates for both groups, i.e. the available group and the non-available group. The mean of the available group, available_mean is approximately 6.87 whereas the mean of the non-available group, non_available_mean is 2.16. In order to compare the means of both groups, Fig. 6c shows the posterior distribution of the difference of means of both groups. There is a 99.9% probability that the mean citation rate of publications, which are using public datasets, is larger than the mean citation rate of publications, which are not using public datasets. This suggests that the publicly available datasets are easier to access, which leads to a higher citation rate for papers that involve publicly available datasets for method evaluation. Moreover, the ease of access for publicly available datasets allows researchers to directly test and compare their methods with other existing solutions for solving sensor data quality problems which are done on the same dataset.
Evaluation metrics
Classification metrics
Actual positive | Actual negative | |
---|---|---|
Predicted positive | True positive (TP) | False positive (FP) |
Predicted negative | False negative (FN) | True negative (TN) |
Evaluation metric | Formula | Papers | Total |
---|---|---|---|
Recall |
\(\frac{TP}{TP+FN}\)
| 13 | |
False positive rate (FPR) |
\(\frac{FP}{TN+FP}\)
| 12 | |
False negative rate (FNR) |
\(\frac{FN}{TP+FN}\)
| 6 | |
Precision |
\(\frac{TP}{TP+FP}\)
| 5 | |
Accuracy |
\(\frac{TP+TN}{TP+TN+FP+FN}\)
| 4 | |
F-score |
\(2 \times \frac{precision \times recall}{precision + recall}\)
| 2 | |
Matthew’s correlation coefficient (MCC) |
\(\frac{TP \times TN - FP \times FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}\)
| [65] | 1 |
Regression metrics | |||
Root mean squared error (RMSE) |
\(\sqrt{MSE}\)
| 4 | |
Mean squared error (MSE) |
\(\frac{1}{n}\sum _{i=1}^n(x_i - {\hat{x}}_i)^2\)
| 2 | |
Mean absolute error (MAE) |
\(\frac{1}{n}\sum _{i=1}^{n}|x_i - {\hat{x}}_i|\)
| 2 | |
Mean relative error (MRE) |
\(\frac{1}{n}\sum _{i=1}^{n}\frac{|x_i - {\hat{x}}_i|}{x_i}\)
| 2 |