Skip to main content
Erschienen in: Medical & Biological Engineering & Computing 11/2020

24.09.2020 | Original Article

Missing data techniques in classification for cardiovascular dysautonomias diagnosis

verfasst von: Ali Idri, Ilham Kadi, Ibtissam Abnane, José Luis Fernandez-Aleman

Erschienen in: Medical & Biological Engineering & Computing | Ausgabe 11/2020

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Missing data (MD) is a common and inevitable problem facing data mining (DM)–based decision systems in e-health since many medical historical datasets contain a huge number of missing values. Therefore, a pre-processing stage is usually required to deal with missing values before building any DM–based decision system. The purpose of this paper is to evaluate the impact of MD techniques on classification systems in cardiovascular dysautonomias diagnosis. We analyzed and compared the accuracy rates of four classification techniques: random forest (RF), support vector machines (SVM), C4.5 decision tree, and Naive Bayes (NB), using two MD techniques: deletion or imputation with k-nearest neighbors (KNN). A total of 216 experiments were therefore carried out using three missingness mechanisms (MCAR: missing completely at random, MAR: missing at random and NMAR: not missing at random), two MD techniques (deletion and KNN imputation), nine MD percentages from 10 to 90% over a dataset collected from the autonomic nervous system (ANS) unit of the University Hospital Avicenne in Morocco. The results obtained suggest that using KNN imputation rather than deletion enhances the accuracy rates of the four classifiers. Moreover, the MD percentages have a negative impact on the performance of classification techniques regardless of the MD mechanisms and MD techniques used. In fact, the accuracy rates of the four classifiers decrease as the MD percentage increases.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Gaziano T, Reddy KS, Paccaud F et al (2006) Cardiovascular disease. disease control priorities in developing countries, 2nd edn. World Bank, Washington (DC) Gaziano T, Reddy KS, Paccaud F et al (2006) Cardiovascular disease. disease control priorities in developing countries, 2nd edn. World Bank, Washington (DC)
3.
Zurück zum Zitat Fayyad U, Piatetsky-Shapiro G, Smyth P (1996) From data mining to knowledge discovery in databases. AI Mag 17:37–54 Fayyad U, Piatetsky-Shapiro G, Smyth P (1996) From data mining to knowledge discovery in databases. AI Mag 17:37–54
4.
Zurück zum Zitat Kadi I, Idri A, Fernandez-Aleman JL (2017) Knowledge discovery in cardiology: a systematic literature review. Int J Med Inform 97:12–32CrossRef Kadi I, Idri A, Fernandez-Aleman JL (2017) Knowledge discovery in cardiology: a systematic literature review. Int J Med Inform 97:12–32CrossRef
5.
Zurück zum Zitat Liou DM, Chang WP (2014) Applying data mining for the analysis of breast cancer data. Data Mining in Clinical Medicine, Volume of the series. Methods Mol Biol 1246:175–189CrossRef Liou DM, Chang WP (2014) Applying data mining for the analysis of breast cancer data. Data Mining in Clinical Medicine, Volume of the series. Methods Mol Biol 1246:175–189CrossRef
6.
Zurück zum Zitat Marinov M, Mosa AM, Yoo I, Boren SA (2011) Data-mining technologies for diabetes: a systematic review. J Diabetes Sci Technol 5:1549–1556CrossRef Marinov M, Mosa AM, Yoo I, Boren SA (2011) Data-mining technologies for diabetes: a systematic review. J Diabetes Sci Technol 5:1549–1556CrossRef
7.
Zurück zum Zitat Kadi I, Idri A, Fernandez-Aleman JL (2017) Systematic mapping study of data mining-based empirical studies in cardiology. Health Inf J 1–30 Kadi I, Idri A, Fernandez-Aleman JL (2017) Systematic mapping study of data mining-based empirical studies in cardiology. Health Inf J 1–30
8.
Zurück zum Zitat Han J, Kamber M (2011) Data mining: concepts and techniques. 2nd edition, The Morgan Kaufmann Series in “Data Management Systems”, Morgan Kaufmann Publishers Han J, Kamber M (2011) Data mining: concepts and techniques. 2nd edition, The Morgan Kaufmann Series in “Data Management Systems”, Morgan Kaufmann Publishers
9.
Zurück zum Zitat Rahm E, Do HH (2000) Data cleaning: problems and current approaches. IEEE Data Eng Bull 23:3–13 Rahm E, Do HH (2000) Data cleaning: problems and current approaches. IEEE Data Eng Bull 23:3–13
10.
Zurück zum Zitat Lenzerini M (2002) Data integration: a theoretical perspective. PODS 233–246 Lenzerini M (2002) Data integration: a theoretical perspective. PODS 233–246
11.
Zurück zum Zitat Familia A, Shen WM, Weber R, Simoudis E (1997) Data preprocessing and intelligent data analysis. Intell Data Anal 1:3–23CrossRef Familia A, Shen WM, Weber R, Simoudis E (1997) Data preprocessing and intelligent data analysis. Intell Data Anal 1:3–23CrossRef
12.
Zurück zum Zitat Cismondi F, Fialhoa AS, Vieira SM, Reti SR, Sousa JMC, Finkelstein SN (2013) Missing data in medical databases: impute, delete or classify? Artif Intell Med 58:63–72CrossRef Cismondi F, Fialhoa AS, Vieira SM, Reti SR, Sousa JMC, Finkelstein SN (2013) Missing data in medical databases: impute, delete or classify? Artif Intell Med 58:63–72CrossRef
13.
Zurück zum Zitat Kaiser J (2014) Dealing with missing values in data. J Syst Integr 5:42–51CrossRef Kaiser J (2014) Dealing with missing values in data. J Syst Integr 5:42–51CrossRef
14.
Zurück zum Zitat Idri A, Abnane I, Abran A (2016) Missing data techniques in analogy-based software development effort estimation. J Syst Softw 117:595–611CrossRef Idri A, Abnane I, Abran A (2016) Missing data techniques in analogy-based software development effort estimation. J Syst Softw 117:595–611CrossRef
15.
Zurück zum Zitat Abnane I. and Idri A (2016) Evaluating fuzzy analogy on incomplete software projects data. IEEE Symposium Series on Computational Intelligence (SSCI) Abnane I. and Idri A (2016) Evaluating fuzzy analogy on incomplete software projects data. IEEE Symposium Series on Computational Intelligence (SSCI)
16.
Zurück zum Zitat Fichman M, Cummings JN (2003) Multiple imputation for missing data: making the most of what you know. Organ Res Methods 6:282–295CrossRef Fichman M, Cummings JN (2003) Multiple imputation for missing data: making the most of what you know. Organ Res Methods 6:282–295CrossRef
17.
Zurück zum Zitat Newman DA (2003) Longitudinal modeling with randomly and systematically missing data: a simulation of ad hoc, maximum likelihood, and multiple imputation techniques. Organ Res Methods 6:328–339CrossRef Newman DA (2003) Longitudinal modeling with randomly and systematically missing data: a simulation of ad hoc, maximum likelihood, and multiple imputation techniques. Organ Res Methods 6:328–339CrossRef
18.
Zurück zum Zitat Stinebrickner TR (1999) Estimation of a duration model in the presence of missing data. Rev Econ Stat 81:529–546CrossRef Stinebrickner TR (1999) Estimation of a duration model in the presence of missing data. Rev Econ Stat 81:529–546CrossRef
19.
Zurück zum Zitat Idri A, Abnane I, Abran A (2015) Systematic mapping study of missing values techniques in software engineering data. In: International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD), pp 1–8 Idri A, Abnane I, Abran A (2015) Systematic mapping study of missing values techniques in software engineering data. In: International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD), pp 1–8
20.
Zurück zum Zitat Bhat VH, Rao PG, Krishna S, Shenoy PD, Venugopal KR, Patnaik LM (2011) An efficient framework for prediction in healthcare data using soft computing techniques. Commun Comput Inf Sci 192 Bhat VH, Rao PG, Krishna S, Shenoy PD, Venugopal KR, Patnaik LM (2011) An efficient framework for prediction in healthcare data using soft computing techniques. Commun Comput Inf Sci 192
21.
Zurück zum Zitat Grzymala-Busse JW, Hu M (2005) A comparison of several approaches to missing attribute values in data mining. In: Rough Sets and Current Trends in Computing, pp 378–385 Grzymala-Busse JW, Hu M (2005) A comparison of several approaches to missing attribute values in data mining. In: Rough Sets and Current Trends in Computing, pp 378–385
22.
Zurück zum Zitat Setiawan NA, Venkatachalam PA, Hani AFM (2007) Missing data estimation on heart disease using artificial neural network and rough set theory, International Conference on Intelligent and Advanced Systems, Kuala Lumpur, Malaysia Setiawan NA, Venkatachalam PA, Hani AFM (2007) Missing data estimation on heart disease using artificial neural network and rough set theory, International Conference on Intelligent and Advanced Systems, Kuala Lumpur, Malaysia
23.
Zurück zum Zitat Zhang Y, Kambhampati C, Davis DN, Goode K, Cleland JGF (2012) A comparative study of missing value imputation with multiclass classification for clinical heart failure data. 2012 9th International Conference on Fuzzy Systems and Knowledge Discovery Zhang Y, Kambhampati C, Davis DN, Goode K, Cleland JGF (2012) A comparative study of missing value imputation with multiclass classification for clinical heart failure data. 2012 9th International Conference on Fuzzy Systems and Knowledge Discovery
24.
Zurück zum Zitat Poolsawad N, Moore L, Kambhampati C, Cleland JGF (2012) Handling missing values in data mining - a case study of heart failure dataset. 2012 9th International Conference on Fuzzy Systems and Knowledge Discovery Poolsawad N, Moore L, Kambhampati C, Cleland JGF (2012) Handling missing values in data mining - a case study of heart failure dataset. 2012 9th International Conference on Fuzzy Systems and Knowledge Discovery
25.
Zurück zum Zitat Al Shalabi L, Najjar M, Al Kayed A (2006) A framework to deal with missing data in data sets. J Comput Sci 2:740–745CrossRef Al Shalabi L, Najjar M, Al Kayed A (2006) A framework to deal with missing data in data sets. J Comput Sci 2:740–745CrossRef
26.
Zurück zum Zitat Blankers M, Koeter MWJ, Schippers GM (2010) Missing data approaches in eHealth Research: simulation study and a tutorial for nonmathematically inclined researchers. J Med Internet Res 12:e54CrossRef Blankers M, Koeter MWJ, Schippers GM (2010) Missing data approaches in eHealth Research: simulation study and a tutorial for nonmathematically inclined researchers. J Med Internet Res 12:e54CrossRef
27.
Zurück zum Zitat Rubin DB (1976) Inference and missing data. Biometrika 63(3):581–592CrossRef Rubin DB (1976) Inference and missing data. Biometrika 63(3):581–592CrossRef
28.
Zurück zum Zitat Little RJA, Rubin D (1987) Statistical analysis with missing data. Wiley, New York Little RJA, Rubin D (1987) Statistical analysis with missing data. Wiley, New York
29.
Zurück zum Zitat Li J, Ruhe G, Al-Emran A, Richter MM (2007) A flexible method for soft- ware effort estimation by analogy. Empir Softw Eng 12:65–106CrossRef Li J, Ruhe G, Al-Emran A, Richter MM (2007) A flexible method for soft- ware effort estimation by analogy. Empir Softw Eng 12:65–106CrossRef
30.
Zurück zum Zitat Song Q, Shepperd M, Chen X, Liu J (2008) Can k-NN imputation improve the performance of C4.5 with small software project data sets? A comparative evaluation. J Syst Softw 81:2361–2370CrossRef Song Q, Shepperd M, Chen X, Liu J (2008) Can k-NN imputation improve the performance of C4.5 with small software project data sets? A comparative evaluation. J Syst Softw 81:2361–2370CrossRef
31.
Zurück zum Zitat Batista GE, Monard MC (2003) An analysis of four missing data treatment methods for supervised learning. Appl Artif Intell 17:519–533CrossRef Batista GE, Monard MC (2003) An analysis of four missing data treatment methods for supervised learning. Appl Artif Intell 17:519–533CrossRef
32.
Zurück zum Zitat Grzymala-Busse JW, Grzymala-Busse WJ (2005) Handling missing attribute values. In: Data Mining and Knowledge Discovery Handbook, pp 37–57 Grzymala-Busse JW, Grzymala-Busse WJ (2005) Handling missing attribute values. In: Data Mining and Knowledge Discovery Handbook, pp 37–57
33.
Zurück zum Zitat Yenduri S (2005) An empirical study of imputation techniques for software data sets. Louisiana State Yenduri S (2005) An empirical study of imputation techniques for software data sets. Louisiana State
34.
Zurück zum Zitat Setiawan NA, Venkatachalam PA, Hani AFM (2008) A comparative study of imputation methods to predict missing attribute values in coronary heart disease data set. In: 4th Kuala Lumpur International Conference on Biomedical Engineering 21, IFMBE Proceedings, Springer Setiawan NA, Venkatachalam PA, Hani AFM (2008) A comparative study of imputation methods to predict missing attribute values in coronary heart disease data set. In: 4th Kuala Lumpur International Conference on Biomedical Engineering 21, IFMBE Proceedings, Springer
35.
Zurück zum Zitat Idri A, Kadi I (2015) Evaluating a decision tree-based approach for cardiovascular dysautonomias diagnosis. SpringerPlus 5:81CrossRef Idri A, Kadi I (2015) Evaluating a decision tree-based approach for cardiovascular dysautonomias diagnosis. SpringerPlus 5:81CrossRef
36.
Zurück zum Zitat Kadi I, Idri A (2016) Cardiovascular dysautonomias diagnosis using crisp and fuzzy decision tree: a comparative study. Stud Health Technol Inf 223:1–8 Kadi I, Idri A (2016) Cardiovascular dysautonomias diagnosis using crisp and fuzzy decision tree: a comparative study. Stud Health Technol Inf 223:1–8
37.
Zurück zum Zitat Chawla NV (2010) Data mining for imbalanced datasets: an overview. Data Mining and Knowledge Discovery Handbook, pp 853–867 Chawla NV (2010) Data mining for imbalanced datasets: an overview. Data Mining and Knowledge Discovery Handbook, pp 853–867
38.
Zurück zum Zitat Quinlan JR (1993) C4.5 Programs for Machine Learning. Morgan Kaufmann, CA, pp 1–302 Quinlan JR (1993) C4.5 Programs for Machine Learning. Morgan Kaufmann, CA, pp 1–302
39.
Zurück zum Zitat Quinlan JR (1986) Induction of decision trees. Mach. Learn. 1, p. 81–106RUBIN, D. B., 1976. Inference and missing data. Biometrika 63:581–592 Quinlan JR (1986) Induction of decision trees. Mach. Learn. 1, p. 81–106RUBIN, D. B., 1976. Inference and missing data. Biometrika 63:581–592
40.
Zurück zum Zitat Vapnik V (1982) Estimation of dependences based on empirical data. Springer, Verlag Vapnik V (1982) Estimation of dependences based on empirical data. Springer, Verlag
41.
Zurück zum Zitat Pappu V, Pardalos PM (2014) High-dimensional data classification. In: Clusters, orders, and trees: methods and applications 92:119–150 Pappu V, Pardalos PM (2014) High-dimensional data classification. In: Clusters, orders, and trees: methods and applications 92:119–150
42.
Zurück zum Zitat Ho TM (2001) The random subspace method for constructing decision forests. IEEE Trans Pattern Anal Mach Intell 1998(20):832–844 Ho TM (2001) The random subspace method for constructing decision forests. IEEE Trans Pattern Anal Mach Intell 1998(20):832–844
43.
Zurück zum Zitat Breiman L Random forests. Mach Learn 45:5–32 Breiman L Random forests. Mach Learn 45:5–32
44.
Zurück zum Zitat Song Q, Ni J, Wang G (2013) A fast clustering based feature selection algorithm for high dimensional data. IEEE Trans Knowl Data Eng 25(1) Song Q, Ni J, Wang G (2013) A fast clustering based feature selection algorithm for high dimensional data. IEEE Trans Knowl Data Eng 25(1)
45.
Zurück zum Zitat Tan PN et al. (2006) Introduction to data mining, Pearson Education. Tan PN et al. (2006) Introduction to data mining, Pearson Education.
46.
Zurück zum Zitat Salzberg SL (1997) On comparing classifiers: pitfalls to avoid and a recommended approach. Data Min Knowl Disc 1:317–327CrossRef Salzberg SL (1997) On comparing classifiers: pitfalls to avoid and a recommended approach. Data Min Knowl Disc 1:317–327CrossRef
47.
Zurück zum Zitat Fawcett T (2006) An introduction to ROC analysis. Pattern Recogn Lett 27:861–874CrossRef Fawcett T (2006) An introduction to ROC analysis. Pattern Recogn Lett 27:861–874CrossRef
48.
Zurück zum Zitat Sheskin D (1997) Handbook of parametric and non-parametric procedures. CRC Press Sheskin D (1997) Handbook of parametric and non-parametric procedures. CRC Press
50.
Zurück zum Zitat Liu-Peng LL (2005) A review of missing data treatment methods. Int J Intell Inf Syst Tech 412–419 Liu-Peng LL (2005) A review of missing data treatment methods. Int J Intell Inf Syst Tech 412–419
51.
Zurück zum Zitat Soley-Bori M (2013) Dealing with missing data: key assumptions and methods for applied analysis. Boston University School of Public Health, Boston Soley-Bori M (2013) Dealing with missing data: key assumptions and methods for applied analysis. Boston University School of Public Health, Boston
Metadaten
Titel
Missing data techniques in classification for cardiovascular dysautonomias diagnosis
verfasst von
Ali Idri
Ilham Kadi
Ibtissam Abnane
José Luis Fernandez-Aleman
Publikationsdatum
24.09.2020
Verlag
Springer Berlin Heidelberg
Erschienen in
Medical & Biological Engineering & Computing / Ausgabe 11/2020
Print ISSN: 0140-0118
Elektronische ISSN: 1741-0444
DOI
https://doi.org/10.1007/s11517-020-02266-x

Weitere Artikel der Ausgabe 11/2020

Medical & Biological Engineering & Computing 11/2020 Zur Ausgabe