Introduction
Missing data patterns and mechanisms
Missing data patterns
Missing data mechanisms
Missing completely at random (MCAR)
Missing at random (MAR)
Missing not at random (MNAR)
Missing values approaches
Deletion
List-wise or case deletion
Pairwise deletion
Imputation
Simple imputation
Regression imputation
Hot-deck imputation
Expectation–maximization
Multiple imputation
-
Missing data is handled in M resulting in M complete data sets.
-
The M complete data sets are then analysed.
-
The results of all the M imputed data sets are combined for the final imputation result.
Imputation methods inspired by machine learning
K nearest neighbour classification
Support vector machine (SVM)
Decision tree
-
CART: Classification and Regression Trees (CART) addresses both continuous and categorical values to generate a decision tree and handle missing values. The algorithm identifies a twofold rule based on one indicator variable that segments the data into two nodes by minimizing variance of the outcome within each node. The tree is then developed by proceeding this splitting recursively until reaching a stopping point determined by the tuning parameters. Imputation is then made from a regression tree by identifying the terminal node to which a new subject belongs and sampling from the outcomes in that node [90]. An attribute selection measure Gini Indexing is used in CART to build a decision tree which unlike ID3, C4.5 does not use probabilistic assumptions. Also, CART generates binary splits that produce binary trees which other decision tree methods do not. Furthermore, this method uses cost complexity pruning to remove the unreliable branches from the decision tree to improve accuracy and does not rely upon distributional assumptions on the data [91].
-
ID3: This is a decision tree technique that can be built in two stages: tree building and pruning. A top-down, greedy search is applied through a given set to test each attribute at every tree node. Then information gain measure is used to select the splitting attribute. It only accepts categorical attributes when building a tree model and does not give precise outcome when there is noise. Continuous missing values can be handled by this method by discrediting or considering the value for the best split point and taking a threshold on the attribute values. This method does not support pruning by default; however, it can be done after building a data model [91].
-
C4.5: This algorithm was developed after the ID3 algorithm and handles both continuous and categorical values when constructing a decision tree. C4.5 addresses continuous attributes by separating the attribute values into two portions based on the selected threshold such that all the values above the threshold is regarded as one child and the remaining as another child. Gain Ratio is used as an attribute selection measure to construct a decision tree. The algorithm handles missing values by selecting an attribute using all instances of a known value for information gain calculation. Instances with non missing attributes are then split as per actual values and instances with missing attribute are split proportionate to the split off known values. A test instance with missing value is then split into branches according to the portions of training examples into all the child nodes [92]. The algorithm withdraws bias information gain when there are many output values of an attribute.
Clustering imputation
-
Every feature vector has to be assigned to a cluster$$\bigcup _{k=1}^{K} {C}_k = {T}$$(15)
-
With at least one feature vector assigned to it$$C_k\ne \phi , k=1,\ldots,K$$(16)
-
Each feature vector is assigned to one and only one clusterwhere \(k\ne kk.\)$$C_k\bigcap C_{kk} =\phi$$(17)
Ensemble methods
Refs. | DataSet | Performance objective | Mechanism | Summary | Limitations |
---|---|---|---|---|---|
[124] | Balance, Breast, Glass, Bupa, Cmc, Iris, Housing, Ionosphere, wine | To study the influence of noise on missing value handling methods when noise and missing values distributed throughout the dataset | MCAR, MAR, MNAR | The technique proved that noise had a negative effect on imputation methods, particularly when the noise level is high | Division of qualitative values may have been a problem |
[85] | German, Glass(g2), heart-statlo, ionosphere, kr-vs-kp, labor, Pima-indians, sonar, balance-scale, iris, waveform, lymphography, vehicle, anneal, glass, satimage, image, zoo, LED, vowel, letter | Experimenting methods for handling incomplete training and test data for different missing data with various proportions and mechanisms | MCAR, MAR | In this technique an understanding of the relative strengths and weaknesses of decision trees for missing value imputation was discussed | The approach did not consider correlations between features |
[125] | Los Angeles ozone pollution and Simulated data | To study classification and regression problems using a variety of missing data mechanisms in order to compare the approaches on high dimensional problems | MCAR, MAR | Here the authors tested the potential of imputation technique’s dependence on the correlation structure of the data | Random choice of missing values may have weakened the experiment consistency |
[38] | Breast Cancer | To evaluate the performance of statistical and machine learning imputation techniques that were used to predict recurrence in breast cancer patient data | The machine learning techniques proved to be the most suited imputation and led to a significant enhancement of prognosis accuracy compared to statistical techniques | One type of data was used for the imputation model, therefore, the presented results may not generalise to different datasets | |
[126] | Iris, Wine, Voting, Tic-Tiac-Toe, Hepatitis | To propose a novel technique to impute missing values based on feature relevance | MCAR, MAR | The approach employed mutual information to measure feature relevance and proved to reduce classification bias | Random choice of missing values may have weakened the experiment consistency |
[127] | Liver, Diabetis, Breast Cancer, Heart, WDSC, Sonar | Experimented on missing data handling using Random Forests and specifically analysed the impact of correlation of features on the imputation results | MCAR, MAR, MNAR | The imputation approach was reported to be generally robust with performance improving when increasing correlation | Random choice of missing values in MNAR could have weakened the consistency of the experiment |
[128] | Wine , Simulated | To create an improved imputation algorithm for handling missing values | MCAR, MAR, MNAR | Demonstrated the superiority of a new algorithm to existing imputation methods on accuracy of imputing missing data | Features may have had different percentages of missing data, also MAR and MNAR may have been weakened |
[129] | De novo simulation, Health surveys S1, S2 and S3 | To compare various techniques of combining internal validation with multiple imputation | MCAR,MAR | The approach was regarded to be comprehensive with regard to the use of simulated and real data with different data characteristics, validation strategies and performance measures | The approach influenced potential bias by the relationship between effect strengths and missingness in covariates |
[130] | Pima Indian Diabetes dataset | To experiment on missing values approach that takes into account feature relevance | The results of the technique proved that the hybrid algorithm was better than the existing methods in terms of accuracy, RMSE and MAE | Missing values mechanism was not considered | |
[13] | Iris, Voting, Hepatitis | Proposed an iterative KNN that took into account the presence of the class labels | MCAR, MAR | The technique considered class labels and proved to perform good against other imputation methods | The approach has not been theoretically proven to converge, though it was empirically shown |
[74] | Camel, Ant, Ivy, Arc, Pcs, Mwl, KC3, Mc2 | To develop a novel incomplete-instance based imputation approach that utilized cross-validation to improve the parameters for each missing value | MCAR, MAR | The study demonstrated that their approach was superior to other missing values approaches | |
[131] | Blood, breast-cancer, ecoli, glass, ionosphere, iris, Magic, optdigits, pendigits, pima, segment, sonar, waveform, wine, yeast, balance-scale, Car, chess-c, chess-m, CNAE-9, lymphography, mushroom, nursery, promoters, SPECT, tic-tac-toe, abalone, acute, card, contraceptive, German, heart, liver, zoo | To develop a missing handling approach is introduced with effective imputation results | MCAR | The method was based on calculating the class center of every class and using the distances between it and the observed data to define a threshold for imputation. The method performed better and had less imputation time | Only one missing mechanism was implemented |
[132] | Groundwater | Developed a multiple imputation method that can handle the missingness in ground water dataset with high rate of missing values | MAR | Here the technique used to handle the missing values, was chosen looking at its ability to consider the relationships between the variables of interest | There was no prior knowledge on the label of missing data which may have provided difficulty when performing imputation |
[133] | Dukes’ B colon cancer, the Mice Protein Expression and Yeast | Developed a novel hybrid Fuzzy C means Rough parameter missing value imputation method | The technique handled the vagueness and coarseness in the dataset and proved to produce better imputation results | There was no report of missing values mechanisms used for the experiment | |
[134] | Forest fire, Glass, Housing, Iris, MPG, MV, Stocks, Wine | The method proposed a variant of the forward stage-wise regression algorithm for data imputation by modelling the missing values as random variables following a Gaussian mixture distribution. Categorical | The method proved to be effective compared to other approaches that combined standard missing data approaches and the original FSR algorithm | There was no report of missing values mechanisms used for the experiment | |
[135] | Weather dataset | This method applied four(Likewise, Multiple imputation, KNN, MICE) missing data handling methods to the training data before classification | Of the imputation methods applied the authors concluded that the most effective missing data imputation method for photovoltaic forecasting was the KNN method | There was no report of missing values mechanisms used for the experiment | |
[136] | Air quality data | To make time series prediction for missing values using three machine learning algorithms and identify the best method | The study concluded that deep learning performed better when data was large and machine learning models produced better results when the data was less | Heavy costs in time consumption and computational powers for training when implementing their most effective method (deep learning) | |
[137] | Traumatic Brain Injury and Diabetes | To demonstrate how performance varies with different missing value mechanisms and the imputation method used and further demonstrate how MNAR is an important tool to give confidence that valid results are obtained using multiple imputation and complete case analysis | MCAR, MAR, MNAR | The study showed that both complete case analysis and multiple imputation can produce unbiased results under more conditions | The method was limited by the absence of nonlinear terms in the substantive models |
[138] | Grades Dataset | To develop a new decision tree approach for missing data handling | MCAR, MAR, MNAR | The method produced a higher accuracy compared to other missing values handling techniques and had more interpretable classifier | The algorithm suffered from a weakness when the gating variable had no predictive power |
[139] | Air Pressure System data | The study proposed a sorted missing percentages approach for filtering attributes when building machine learning classification model using sensor readings with missing data | The technique proved to be effective for scenarios dealing with missing data in industrial sensor data analysis | The proposed approach could not meet the needs of automation | |
[139] | Abalone and Boston Housing | To experiment the reliability of missing value handling at not missing at random | MAR | The results of the study indicated that the approach achieved satisfactory performance in solving the lower incomplete problem compared to other six methods | The approach did not consider any missingness rate which may have affected the analysis |
[140] | Cleveland Heart disease | Proposed a systematic methodology for the identification of missing values using the KNN, MICE, mean, and mode with four classifiers Naive Bayes, SVM, logistic regression, and random forest | The result of the study demonstrated that MICE imputation performed better than other imputation methods used on the study | The approach compared stage of the art methods with simple imputation methods, mean and mode that are bias and unrealistic results | |
[141] | Iris, Wine, Ecoli and Sonar datasets | To retrieve missing data by considering the attribute correlation in the imputation process using a class center-based adaptive approach using the firefly algorithm | MCAR | The result of the experiment demonstrated that the class center-based firefly algorithm was an efficient method for handling missing values | Imputation was tested on only one missing value mechanism |
[15] | Abalone, Iris, Lymphography and Parkinsons | Proposed a novel tuple-based region splitting imputation approach that used a new metric, mean integrity rate to measure the missing degree of a dataset to impute various types missing data | The region splitting imputation model outperformed the competitive models of imputation | Random generator was used to impute missing values and other mechanisms for missing values were not considered | |
[142] | Artificial and real metabolomics data | To develop a new kernel weight function-based imputation approach that handles missing values and outliers | MAR | The proposed kernel weight-based approach proved to be superior compared to other data imputation techniques | The method was experimented on one type of dataset and may not perform as reported on other types of data |
Performance metrics for missing data imputation
Mean Absolute Error (MAE)
Mean Squared Error (MSE)
Root Mean Square Error (RMSE)
Area under the curve (AUC)
Comparisons
Evaluation metrics
Citation | Year | Publisher | Article | Journal/conference/book chapter |
---|---|---|---|---|
[153] | 2020 | Applied Science | Missing value imputation in stature estimation by learning algorithms using anthropometric data: a comparative study | Multidisciplinary Digital Publishing Institute |
[139] | 2020 | Applied Science | Evaluating machine learning classification using sorted missing percentage technique based on missing data | Multidisciplinary Digital Publishing Institute |
[154] | 2020 | Biometrical Journal | Multiple imputation methods for handling missing values in longitudinal studies with sampling weights: comparison of methods implemented in Stata | Wiley Online Library |
[155] | 2019 | Applied Artificial Intelligence | Comparison of performance of data imputation methods for numeric dataset | Taylor and Francis |
[8] | 2006 | Elsevier | A gentle introduction to imputation of missing values | Journal of clinical epidemiology |
[127] | 2017 | Elsevier | Adjusted weight voting algorithm for random forests in handling missing values | Pattern Recognition |
[60] | 2017 | Elsevier | kNN-IS: an Iterative Spark-based design of the k-Nearest Neighbors classifier for big data | Knowledge-Based Systems |
[156] | 2021 | Elsevier | Ground PM2. 5 prediction using imputed MAIAC AOD with uncertainty quantification | Environmental Pollution |
[157] | 2021 | Elsevier | A neural network approach for traffic prediction and routing with missing data imputation for intelligent transportation system | Expert Systems with Applications |
[158] | 2021 | Elsevier | Handling complex missing data using random forest approach for an air quality monitoring dataset: a case study of Kuwait environmental data (2012 to 2018) | Multidisciplinary Digital Publishing Institute |
[159] | 2021 | Elsevier | HA new method of data missing estimation with FNN-based tensor heterogeneous ensemble learning for internet of vehicle | Neurocomputing |
[111] | 2006 | IEEE | Ensemble based systems in decision making | IEEE Circuits and systems magazine |
[160] | 2010 | IEEE | Missing value estimation for mixed-attribute data sets | IEEE Transactions on Knowledge and Data Engineering |
[161] | 2014 | IEEE | Modeling and optimization for big data analytics:(statistical) learning tools for our era of data deluge | IEEE Signal Processing Magazine |
[2] | 2014 | IEEE | Handling missing data problems with sampling methods | 2014 International Conference on Advanced Networking Distributed Systems and Applications |
[123] | 2018 | IEEE | An imputation method for missing data based on an extreme learning machine auto-encoder | IEEE ACCESS |
[162] | 2018 | IEEE | A data imputation model in phasor measurement units based on bagged averaging of multiple linear regression | IEEE ACCESS |
[163] | 2018 | IEEE | Missing network data a comparison of different imputation methods | 2018 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM) |
[164] | 2018 | IEEE | MIAEC: missing data imputation based on the evidence chain | IEEE ACCESS |
[165] | 2018 | IEEE | A survey on data imputation techniques: water distribution system as a use case | IEEE ACCESS |
[166] | 2019 | IEEE | Missing values estimation on multivariate dataset: comparison of three type methods approach | International Conference on Information and Communications Technology (ICOIACT) |
[122] | 2019 | IEEE | A novel algorithm for missing data imputation on machine learning | 2019 International Conference on Smart Systems and Inventive Technology (ICSSIT) |
[167] | 2020 | IEEE | Approaches to dealing with missing data in railway asset management | IEEE ACCESS |
[168] | 2020 | IEEE | Traffic data imputation and prediction: an efficient realization of deep learning | IEEE ACCESS |
[169] | 2020 | IEEE | Iterative robust semi-supervised missing data imputation | IEEE ACCESS |
[170] | 2021 | IEEE | Missing network data a comparison of different imputation methods Neighborhood-aware autoencoder for missing value imputation | 2020 28th European Signal Processing Conference (EUSIPCO) |
[171] | 2021 | IEEE | Hybrid missing value imputation algorithms using fuzzy C-means and vaguely quantified rough set | IEEE Transactions on Fuzzy Systems |
[56] | 2016 | SAGE Publications | Multiple imputation in the presence of high-dimensional data | Statistical Methods in Medical Research |
[172] | 2020 | Sensors | A method for sensor-based activity recognition in missing data scenario | Multidisciplinary Digital Publishing Institute |
[31] | 2012 | Springer | Analysis of missing data | Missing data |
[65] | 2015 | Springer | CKNNI: an improved knn-based missing value handling technique | International Conference on Intelligent Computing |
[126] | 2015 | Springer | Missing data imputation by K nearest neighbours based on grey relational structure and mutual information | Applied Intelligence |
[63] | 2016 | Springer | Nearest neighbor imputation algorithms: a critical evaluation | BMC medical informatics and decision making |
[105] | 2017 | Springer | Multiple imputation and ensemble learning for classification with incomplete data | Intelligent and Evolutionary Systems |
[68] | 2018 | Springer | NS-kNN: a modified k-nearest neighbors approach for imputing metabolomics data | Metabolomics |
[136] | 2019 | Springer | Analysis of interpolation algorithms for the missing values in IoT time series: a case of air quality in Taiwan | The Journal of Super computing |
[39] | 2020 | Springer Open | SICE: an improved missing data imputation technique | Journal of Big Data |
[138] | 2020 | Springer | BEST: a decision tree algorithm that handles missing values | Computational Statistics |
[173] | 2020 | Springer | A new multi-view learning machine with incomplete data | Pattern Analysis and Applications |
[140] | 2021 | Springer | Multistage model for accurate prediction of missing values using imputation methods in heart disease dataset | Innovative Data Communication Technologies and Application |
[14] | 2021 | Springer | A new imputation method based on genetic programming and weighted KNN for symbolic regression with incomplete data | Soft Computing |
[174] | 2021 | Springer | An exploration of online missing value imputation in non-stationary data stream | SN Computer Science |
[175] | 2021 | Springer | Data imputation in wireless sensor network using deep learning techniques | Data Analytics and Management |
[176] | 2020 | Sustainable and Resilient Infrastructure | Handling incomplete and missing data in water network database using imputation methods | Taylor and Francis |
Publication | Performance metrics | |||
---|---|---|---|---|
RMSE | MAE | MSE | AUC | |
[125] | \({\times }\) | \({\times }\) | \(\checkmark\) | \({\times }\) |
[129] | \({\times }\) | \({\times }\) | \(\checkmark\) | \(\checkmark\) |
[74] | \(\checkmark\) | \({\times }\) | \({\times }\) | \({\times }\) |
[127] | \({\times }\) | \({\times }\) | \({\times }\) | \(\checkmark\) |
[131] | \(\checkmark\) | \(\checkmark\) | \({\times }\) | \({\times }\) |
[133] | \(\checkmark\) | \({\times }\) | \({\times }\) | \({\times }\) |
[135] | \(\checkmark\) | \({\times }\) | \({\times }\) | \({\times }\) |
[136] | \({\times }\) | \(\checkmark\) | \(\checkmark\) | \({\times }\) |
[126] | \(\checkmark\) | \({\times }\) | \({\times }\) | \({\times }\) |
[130] | \(\checkmark\) | \(\checkmark\) | \({\times }\) | \({\times }\) |
[128] | \(\checkmark\) | \({\times }\) | \({\times }\) | \({\times }\) |
[139] | \({\times }\) | \({\times }\) | \({\times }\) | \(\checkmark\) |
[138] | \({\times }\) | \({\times }\) | \({\times }\) | \(\checkmark\) |
[140] | \(\checkmark\) | \({\times }\) | \({\times }\) | \({\times }\) |
[174] | \(\checkmark\) | \({\times }\) | \({\times }\) | \({\times }\) |
[156] | \(\checkmark\) | \({\times }\) | \({\times }\) | \({\times }\) |
[158] | \(\checkmark\) | \(\checkmark\) | \({\times }\) | \({\times }\) |
[170] | \(\checkmark\) | \({\times }\) | \({\times }\) | \({\times }\) |
[15] | \(\checkmark\) | \({\times }\) | \({\times }\) | \({\times }\) |
[142] | \({\times }\) | \({\times }\) | \(\checkmark\) | \(\checkmark\) |
[38] | \({\times }\) | \({\times }\) | \(\checkmark\) | \(\checkmark\) |
[13] | \(\checkmark\) | \({\times }\) | \({\times }\) | \({\times }\) |
[141] | \(\checkmark\) | \({\times }\) | \({\times }\) | \({\times }\) |
Experimental evaluation on machine learning methods
Missing ratio% | KNN | RF |
---|---|---|
5 | 0.6693 | 0.6486 |
10 | 0.2382 | 0.2860 |
15 | 0.1932 | 0.2578 |
Missing ratio% | KNN | RF |
---|---|---|
5 | 0.2099 | 0.0549 |
10 | 0.1581 | 0.0416 |
15 | 0.1487 | 0.0654 |