Skip to main content
Top
Published in: Soft Computing 8/2021

07-02-2021 | Methodologies and Application

A new imputation method based on genetic programming and weighted KNN for symbolic regression with incomplete data

Authors: Baligh Al-Helali, Qi Chen, Bing Xue, Mengjie Zhang

Published in: Soft Computing | Issue 8/2021

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Incompleteness is one of the problematic data quality challenges in real-world machine learning tasks. A large number of studies have been conducted for addressing this challenge. However, most of the existing studies focus on the classification task and only a limited number of studies for symbolic regression with missing values exist. In this work, a new imputation method for symbolic regression with incomplete data is proposed. The method aims to improve both the effectiveness and efficiency of imputing missing values for symbolic regression. This method is based on genetic programming (GP) and weighted K-nearest neighbors (KNN). It constructs GP-based models using other available features to predict the missing values of incomplete features. The instances used for constructing such models are selected using weighted KNN. The experimental results on real-world data sets show that the proposed method outperforms a number of state-of-the-art methods with respect to the imputation accuracy, the symbolic regression performance, and the imputation time.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Appendix
Available only for authorised users
Literature
go back to reference Al-Helali B, Chen Q, Xue B, Zhang M (2018) A hybrid GP-KNN imputation for symbolic regression with missing values. In: Australasian joint conference on artificial intelligence. Springer, pp 345–357 Al-Helali B, Chen Q, Xue B, Zhang M (2018) A hybrid GP-KNN imputation for symbolic regression with missing values. In: Australasian joint conference on artificial intelligence. Springer, pp 345–357
go back to reference Anjum A, Sun F, Wang L, Orchard J (2019) A novel continuous representation of genetic programmings using recurrent neural networks for symbolic regression. arXiv preprint arXiv:1904.03368 Anjum A, Sun F, Wang L, Orchard J (2019) A novel continuous representation of genetic programmings using recurrent neural networks for symbolic regression. arXiv preprint arXiv:​1904.​03368
go back to reference Arnaldo I, O’Reilly UM, Veeramachaneni K (2015) Building predictive models via feature synthesis. In: Proceedings of the 2015 annual conference on genetic and evolutionary computation, pp 983–990 Arnaldo I, O’Reilly UM, Veeramachaneni K (2015) Building predictive models via feature synthesis. In: Proceedings of the 2015 annual conference on genetic and evolutionary computation, pp 983–990
go back to reference Chen C, Luo C, Jiang Z (2017) Elite bases regression: A real-time algorithm for symbolic regression. In: 2017 13th international conference on natural computation, fuzzy systems and knowledge discovery (ICNC-FSKD). IEEE, pp 529–535 Chen C, Luo C, Jiang Z (2017) Elite bases regression: A real-time algorithm for symbolic regression. In: 2017 13th international conference on natural computation, fuzzy systems and knowledge discovery (ICNC-FSKD). IEEE, pp 529–535
go back to reference Chen Q (2018) Improving the generalisation of genetic programming for symbolic regression. PhD thesis, Victoria University of Wellington Chen Q (2018) Improving the generalisation of genetic programming for symbolic regression. PhD thesis, Victoria University of Wellington
go back to reference Davidson JW, Savic DA, Walters GA (2003) Symbolic and numerical regression: experiments and applications. Inf Sci 150(1–2):95–117MathSciNetCrossRef Davidson JW, Savic DA, Walters GA (2003) Symbolic and numerical regression: experiments and applications. Inf Sci 150(1–2):95–117MathSciNetCrossRef
go back to reference Donders ART, Van Der Heijden GJ, Stijnen T, Moons KG (2006) A gentle introduction to imputation of missing values. J Clin Epidemiol 59(10):1087–1091CrossRef Donders ART, Van Der Heijden GJ, Stijnen T, Moons KG (2006) A gentle introduction to imputation of missing values. J Clin Epidemiol 59(10):1087–1091CrossRef
go back to reference Fortin FA, Rainville FMD, Gardner MA, Parizeau M, Gagné C (2012) Deap: evolutionary algorithms made easy. J Mach Learn Res 13:2171–2175MathSciNet Fortin FA, Rainville FMD, Gardner MA, Parizeau M, Gagné C (2012) Deap: evolutionary algorithms made easy. J Mach Learn Res 13:2171–2175MathSciNet
go back to reference García JCF, Kalenatic D, Bello CAL (2011) Missing data imputation in multivariate data by evolutionary algorithms. Comput Hum Behav 27(5):1468–1474CrossRef García JCF, Kalenatic D, Bello CAL (2011) Missing data imputation in multivariate data by evolutionary algorithms. Comput Hum Behav 27(5):1468–1474CrossRef
go back to reference García-Laencina PJ, Sancho-Gómez JL, Figueiras-Vidal AR (2010) Pattern classification with missing data: a review. Neural Comput Appl 19(2):263–282CrossRef García-Laencina PJ, Sancho-Gómez JL, Figueiras-Vidal AR (2010) Pattern classification with missing data: a review. Neural Comput Appl 19(2):263–282CrossRef
go back to reference Gautam C, Ravi V (2015) Data imputation via evolutionary computation, clustering and a neural network. Neurocomputing 156:134–142CrossRef Gautam C, Ravi V (2015) Data imputation via evolutionary computation, clustering and a neural network. Neurocomputing 156:134–142CrossRef
go back to reference Ghorbani A, Zou JY (2018) Embedding for informative missingness: deep learning with incomplete data. In: 2018 56th annual allerton conference on communication, control, and computing (Allerton). IEEE, pp 437–445 Ghorbani A, Zou JY (2018) Embedding for informative missingness: deep learning with incomplete data. In: 2018 56th annual allerton conference on communication, control, and computing (Allerton). IEEE, pp 437–445
go back to reference Johnson CG (2003) Artificial immune system programming for symbolic regression. In: European conference on genetic programming. Springer, pp 345–353 Johnson CG (2003) Artificial immune system programming for symbolic regression. In: European conference on genetic programming. Springer, pp 345–353
go back to reference Kammerer L, Kronberger G, Burlacu B, Winkler SM, Kommenda M, Affenzeller M (2020) Symbolic regression by exhaustive search: reducing the search space using syntactical constraints and efficient semantic structure deduplication. In: Genetic programming theory and practice, vol XVII. Springer, pp 79–99 Kammerer L, Kronberger G, Burlacu B, Winkler SM, Kommenda M, Affenzeller M (2020) Symbolic regression by exhaustive search: reducing the search space using syntactical constraints and efficient semantic structure deduplication. In: Genetic programming theory and practice, vol XVII. Springer, pp 79–99
go back to reference Koza JR (1992) Genetic programming II, automatic discovery of reusable subprograms. MIT Press, Cambridge Koza JR (1992) Genetic programming II, automatic discovery of reusable subprograms. MIT Press, Cambridge
go back to reference Koza JR (1994) Genetic programming as a means for programming computers by natural selection. Stat Comput 4(2):87–112CrossRef Koza JR (1994) Genetic programming as a means for programming computers by natural selection. Stat Comput 4(2):87–112CrossRef
go back to reference Kronberger G (2011) Symbolic regression for knowledge discovery: bloat, overfitting, and variable interaction networks. Trauner, LinzCrossRef Kronberger G (2011) Symbolic regression for knowledge discovery: bloat, overfitting, and variable interaction networks. Trauner, LinzCrossRef
go back to reference Kubalík J, Žegklitz J, Derner E, Babuška R (2019) Symbolic regression methods for reinforcement learning. arXiv preprint arXiv:1903.09688 Kubalík J, Žegklitz J, Derner E, Babuška R (2019) Symbolic regression methods for reinforcement learning. arXiv preprint arXiv:​1903.​09688
go back to reference Lobato F, Sales C, Araujo I, Tadaiesky V, Dias L, Ramos L, Santana A (2015a) Multi-objective genetic algorithm for missing data imputation. Pattern Recogn Lett 68:126–131CrossRef Lobato F, Sales C, Araujo I, Tadaiesky V, Dias L, Ramos L, Santana A (2015a) Multi-objective genetic algorithm for missing data imputation. Pattern Recogn Lett 68:126–131CrossRef
go back to reference Lobato FM, Tadaiesky VW, Araújo IM, de Santana ÁL (2015b) An evolutionary missing data imputation method for pattern classification. In: Proceedings of the companion publication of the 2015 annual conference on genetic and evolutionary computation. ACM, pp 1013–1019 Lobato FM, Tadaiesky VW, Araújo IM, de Santana ÁL (2015b) An evolutionary missing data imputation method for pattern classification. In: Proceedings of the companion publication of the 2015 annual conference on genetic and evolutionary computation. ACM, pp 1013–1019
go back to reference Martins JFB, Oliveira LOV, Miranda LF, Casadei F, Pappa GL (2018) Solving the exponential growth of symbolic regression trees in geometric semantic genetic programming. In: Proceedings of the genetic and evolutionary computation conference, pp 1151–1158 Martins JFB, Oliveira LOV, Miranda LF, Casadei F, Pappa GL (2018) Solving the exponential growth of symbolic regression trees in geometric semantic genetic programming. In: Proceedings of the genetic and evolutionary computation conference, pp 1151–1158
go back to reference McConaghy T (2011) Ffx: Fast, scalable, deterministic symbolic regression technology. In: Genetic programming theory and practice, vol IX. Springer, pp 235–260 McConaghy T (2011) Ffx: Fast, scalable, deterministic symbolic regression technology. In: Genetic programming theory and practice, vol IX. Springer, pp 235–260
go back to reference Oliveira LOV, Otero FE, Miranda LF, Pappa GL (2016) Revisiting the sequential symbolic regression genetic programming. In: 2016 5th Brazilian conference on intelligent systems (BRACIS). IEEE, pp 163–168 Oliveira LOV, Otero FE, Miranda LF, Pappa GL (2016) Revisiting the sequential symbolic regression genetic programming. In: 2016 5th Brazilian conference on intelligent systems (BRACIS). IEEE, pp 163–168
go back to reference O’Sullivan J, Ryan C (2002) An investigation into the use of different search strategies with grammatical evolution. In: European conference on genetic programming. Springer, pp 268–277 O’Sullivan J, Ryan C (2002) An investigation into the use of different search strategies with grammatical evolution. In: European conference on genetic programming. Springer, pp 268–277
go back to reference Patil DV, Bichkar R (2010) Multiple imputation of missing data with genetic algorithm based techniques. In: IJCA special issue on evolutionary computation for optimization techniques, pp 74–78 Patil DV, Bichkar R (2010) Multiple imputation of missing data with genetic algorithm based techniques. In: IJCA special issue on evolutionary computation for optimization techniques, pp 74–78
go back to reference Pennachin C, Looks M, de Vasconcelos J (2011) Improved time series prediction and symbolic regression with affine arithmetic. In: Genetic programming theory and practice, vol IX. Springer, pp 97–112 Pennachin C, Looks M, de Vasconcelos J (2011) Improved time series prediction and symbolic regression with affine arithmetic. In: Genetic programming theory and practice, vol IX. Springer, pp 97–112
go back to reference Pornprasertmanit S, Miller P, Schoemann A, Quick C, Jorgensen T, Pornprasertmanit MS (2016) Package ’simsem’ Pornprasertmanit S, Miller P, Schoemann A, Quick C, Jorgensen T, Pornprasertmanit MS (2016) Package ’simsem’
go back to reference Priya RD, Kuppuswami S (2012) A genetic algorithm based approach for imputing missing discrete attribute values in databases. WSEAS Trans Inf Sci Appl 9(6):169–178 Priya RD, Kuppuswami S (2012) A genetic algorithm based approach for imputing missing discrete attribute values in databases. WSEAS Trans Inf Sci Appl 9(6):169–178
go back to reference Salleh MNM, Samat NA (2017) An imputation for missing data features based on fuzzy swarm approach in heart disease classification. In: International conference in swarm intelligence. Springer, pp 285–292 Salleh MNM, Samat NA (2017) An imputation for missing data features based on fuzzy swarm approach in heart disease classification. In: International conference in swarm intelligence. Springer, pp 285–292
go back to reference Samat NA, Salleh MNM (2016) A study of data imputation using fuzzy c-means with particle swarm optimization. In: International conference on soft computing and data mining. Springer, pp 91–100 Samat NA, Salleh MNM (2016) A study of data imputation using fuzzy c-means with particle swarm optimization. In: International conference on soft computing and data mining. Springer, pp 91–100
go back to reference Schafer JL, Graham JW (2002) Missing data: our view of the state of the art. Psychol Methods 7(2):147CrossRef Schafer JL, Graham JW (2002) Missing data: our view of the state of the art. Psychol Methods 7(2):147CrossRef
go back to reference Searson DP (2015) Gptips 2: an open-source software platform for symbolic data mining. In: Handbook of genetic programming applications. Springer, New York, pp 551–573 Searson DP (2015) Gptips 2: an open-source software platform for symbolic data mining. In: Handbook of genetic programming applications. Springer, New York, pp 551–573
go back to reference Takahashi M, Ito T (2012) Multiple imputation of turnover in edinet data: toward the improvement of imputation for the economic census. In: Work session on statistical data editing, UNECE, pp 24–26 Takahashi M, Ito T (2012) Multiple imputation of turnover in edinet data: toward the improvement of imputation for the economic census. In: Work session on statistical data editing, UNECE, pp 24–26
go back to reference Tran CT (2018) Evolutionary machine learning for classification with incomplete data. PhD thesis, Victoria University of Wellington Tran CT (2018) Evolutionary machine learning for classification with incomplete data. PhD thesis, Victoria University of Wellington
go back to reference Tran CT, Zhang M, Andreae P (2015) Multiple imputation for missing data using genetic programming. In: Proceedings of the 2015 annual conference on genetic and evolutionary computation. ACM, pp 583–590 Tran CT, Zhang M, Andreae P (2015) Multiple imputation for missing data using genetic programming. In: Proceedings of the 2015 annual conference on genetic and evolutionary computation. ACM, pp 583–590
go back to reference Tran CT, Zhang M, Andreae P (2016) A genetic programming-based imputation method for classification with missing data. In: European conference on genetic programming. Springer, pp 149–163 Tran CT, Zhang M, Andreae P (2016) A genetic programming-based imputation method for classification with missing data. In: European conference on genetic programming. Springer, pp 149–163
go back to reference Tran CT, Zhang M, Andreae P, Xue B (2017) Multiple imputation and genetic programming for classification with incomplete data. In: Proceedings of the genetic and evolutionary computation conference. ACM, pp 521–528 Tran CT, Zhang M, Andreae P, Xue B (2017) Multiple imputation and genetic programming for classification with incomplete data. In: Proceedings of the genetic and evolutionary computation conference. ACM, pp 521–528
go back to reference Vanschoren J, Van Rijn JN, Bischl B, Torgo L (2014) Openml: networked science in machine learning. ACM SIGKDD Exp Newsl 15(2):49–60CrossRef Vanschoren J, Van Rijn JN, Bischl B, Torgo L (2014) Openml: networked science in machine learning. ACM SIGKDD Exp Newsl 15(2):49–60CrossRef
go back to reference Virgolin M, Alderliesten T, Bosman PA (2019) Linear scaling with and within semantic backpropagation-based genetic programming for symbolic regression. In: Proceedings of the genetic and evolutionary computation conference, pp 1084–1092 Virgolin M, Alderliesten T, Bosman PA (2019) Linear scaling with and within semantic backpropagation-based genetic programming for symbolic regression. In: Proceedings of the genetic and evolutionary computation conference, pp 1084–1092
go back to reference Vladislavleva E, Smits G, Den Hertog D (2010) On the importance of data balancing for symbolic regression. IEEE Trans Evolut Comput 14(2):252–277CrossRef Vladislavleva E, Smits G, Den Hertog D (2010) On the importance of data balancing for symbolic regression. IEEE Trans Evolut Comput 14(2):252–277CrossRef
go back to reference Wang Y, Wagner N, Rondinelli JM (2019) Symbolic regression in materials science. MRS Commun 9(3):793–805CrossRef Wang Y, Wagner N, Rondinelli JM (2019) Symbolic regression in materials science. MRS Commun 9(3):793–805CrossRef
go back to reference Žegklitz J, Pošík P (2020) Benchmarking state-of-the-art symbolic regression algorithms. In: Genetic programming and evolvable machines, pp 1–29 Žegklitz J, Pošík P (2020) Benchmarking state-of-the-art symbolic regression algorithms. In: Genetic programming and evolvable machines, pp 1–29
go back to reference Zelinka I, Oplatkova Z, Nolle L (2005) Analytic programming-symbolic regression by means of arbitrary evolutionary algorithms. Int J Simul, Syst, Sci Technol 6(9):44–56 Zelinka I, Oplatkova Z, Nolle L (2005) Analytic programming-symbolic regression by means of arbitrary evolutionary algorithms. Int J Simul, Syst, Sci Technol 6(9):44–56
Metadata
Title
A new imputation method based on genetic programming and weighted KNN for symbolic regression with incomplete data
Authors
Baligh Al-Helali
Qi Chen
Bing Xue
Mengjie Zhang
Publication date
07-02-2021
Publisher
Springer Berlin Heidelberg
Published in
Soft Computing / Issue 8/2021
Print ISSN: 1432-7643
Electronic ISSN: 1433-7479
DOI
https://doi.org/10.1007/s00500-021-05590-y

Other articles of this Issue 8/2021

Soft Computing 8/2021 Go to the issue

Premium Partner