Skip to main content
Top
Published in: Earth Science Informatics 2/2024

15-02-2024 | RESEARCH

K-Means Featurizer: A booster for intricate datasets

Authors: Kouao Laurent Kouadio, Jianxin Liu, Rong Liu, Yongfei Wang, Wenxiang Liu

Published in: Earth Science Informatics | Issue 2/2024

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Machine Learning (ML) has become pivotal across various fields, offering innovative solutions to complex data challenges. Professionals typically seek models that excel in both performance and reliability, aiming to achieve optimal generalization on future data. Since, then a variety of methods such as dummy coding, up/down-sampling, and bin-counting have been explored. However, finding a solution that effectively navigates the intricacies of limited and complex datasets still remains a challenge. This study introduces the K-Means Featurizer (KMF), an innovative algorithm crafted to enhance model performance and reliability, especially in scenarios involving complex and limited datasets. KMF employs K-Means clustering to generate enriched features that provide a nuanced understanding of the data, effectively balancing the similarity between the target variable and the feature space. This results in a more efficient predictive task by minimizing Euclidean distances and enhancing model generalizability. Our research validates KMF's effectiveness through an experiment in geoscience engineering, focusing on hydraulic conductivity (K) prediction, a vital parameter in well monitoring and infrastructure planning. Traditionally, K extraction is laborious and costly, requiring extensive pumping tests. KMF's application in this context demonstrates its potential to substantially reduce data losses during such operations. Applying KMF to the Extreme Gradient Boosting, Random Forest, K-Neighbors, Support Vector Machines, and Multiple Layers Neural Networks resulted in a significant improvement in prediction accuracy, with K-scores reaching up to 90%. While our experiment centers on geoscience engineering, KMF's utility extends to various domains facing similar data intricacies. Its adaptability to different types of complex datasets positions it as a valuable tool for diverse data-driven applications.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Appendix
Available only for authorised users
Literature
go back to reference Ali JK (1994) Neural networks: a new tool for the petroleum industry? In: SPE European Petroleum Computer Conference. p SPE--27561 Ali JK (1994) Neural networks: a new tool for the petroleum industry? In: SPE European Petroleum Computer Conference. p SPE--27561
go back to reference Alice Z, Amenda C (2018) Feature engineering for machine learning. In: Roumeliotis R, Jeff B (eds) O’Reilly Media Inc, 1rst edn. O’Reilly Media, Inc., p 218 Alice Z, Amenda C (2018) Feature engineering for machine learning. In: Roumeliotis R, Jeff B (eds) O’Reilly Media Inc, 1rst edn. O’Reilly Media, Inc., p 218
go back to reference Ankam V (2016) Big data analytics, 1rst edn. Packt Publishing Ltd, Birminghan B3, U2PB, UK Ankam V (2016) Big data analytics, 1rst edn. Packt Publishing Ltd, Birminghan B3, U2PB, UK
go back to reference Arpitha M, Ahmed SA, Harishnaika N (2023) Correction to : Land use and land cover classification using machine learning algorithms in google earth engine. Earth Sci Informatics 5:577451 Arpitha M, Ahmed SA, Harishnaika N (2023) Correction to : Land use and land cover classification using machine learning algorithms in google earth engine. Earth Sci Informatics 5:577451
go back to reference Bergen KJ, Johnson PA, de Hoop M V, Beroza GC (2019) Machine learning for data-driven discovery in solid Earth geoscience. Science (80- ) 363:eaau0323 Bergen KJ, Johnson PA, de Hoop M V, Beroza GC (2019) Machine learning for data-driven discovery in solid Earth geoscience. Science (80- ) 363:eaau0323
go back to reference Bergstra J, Bengio Y (2012) Random search for hyper-parameter optimization. J Mach Learn Res 13:281–305MathSciNet Bergstra J, Bengio Y (2012) Random search for hyper-parameter optimization. J Mach Learn Res 13:281–305MathSciNet
go back to reference Cai CZ, Han LY, Ji ZL et al (2003) SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence. Nucleic Acids Res 31:3692–3697CrossRefPubMedPubMedCentral Cai CZ, Han LY, Ji ZL et al (2003) SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence. Nucleic Acids Res 31:3692–3697CrossRefPubMedPubMedCentral
go back to reference Çimen M (2009) Effective procedure for determination of aquifer parameters from late time-drawdown data. J Hydrol Eng 14:446–452CrossRef Çimen M (2009) Effective procedure for determination of aquifer parameters from late time-drawdown data. J Hydrol Eng 14:446–452CrossRef
go back to reference Cushman JH, Tartakovsky DM, Delleur JW (2016) Elementary groundwater flow and transport Processes Cushman JH, Tartakovsky DM, Delleur JW (2016) Elementary groundwater flow and transport Processes
go back to reference Deng L, Liu Y (2018) Deep learning in natural language processing. Springer, Seattle, USACrossRef Deng L, Liu Y (2018) Deep learning in natural language processing. Springer, Seattle, USACrossRef
go back to reference Fabien-Ouellet G, Sarkar R (2020) Seismic velocity estimation: A deep recurrent neural-network approach. Geophysics 85:U21–U29ADSCrossRef Fabien-Ouellet G, Sarkar R (2020) Seismic velocity estimation: A deep recurrent neural-network approach. Geophysics 85:U21–U29ADSCrossRef
go back to reference Fraiman R, Justel A, Svarc M (2010) Pattern recognition via projection-based kNN rules. Comput Stat \& data Anal 54:1390–1403 Fraiman R, Justel A, Svarc M (2010) Pattern recognition via projection-based kNN rules. Comput Stat \& data Anal 54:1390–1403
go back to reference Geron A (2019) Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow: concepts, tools, and techniques to build intelligent systems., 1rst edn. O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472 Geron A (2019) Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow: concepts, tools, and techniques to build intelligent systems., 1rst edn. O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472
go back to reference Ho TK (1995) Random decision forests. In: Proceedings of 3rd international conference on document analysis and recognition. pp 278–282 Ho TK (1995) Random decision forests. In: Proceedings of 3rd international conference on document analysis and recognition. pp 278–282
go back to reference Huang S, Cai N, Pacheco PP, et al (2018) Applications of support vector machine (SVM) learning in cancer genomics. Cancer genomics \& proteomics 15:41–51 Huang S, Cai N, Pacheco PP, et al (2018) Applications of support vector machine (SVM) learning in cancer genomics. Cancer genomics \& proteomics 15:41–51
go back to reference Jin X, Han J (2010) K-Means Clustering. In: Sammut C, Webb GI (eds) Encyclopedia of Machine Learning. Springer, US, Boston, MA, pp 563–564 Jin X, Han J (2010) K-Means Clustering. In: Sammut C, Webb GI (eds) Encyclopedia of Machine Learning. Springer, US, Boston, MA, pp 563–564
go back to reference Kamath U, Liu J, Whitaker J (2019) Deep learning for NLP and speech recognition. Springer, VA, USACrossRef Kamath U, Liu J, Whitaker J (2019) Deep learning for NLP and speech recognition. Springer, VA, USACrossRef
go back to reference Karpatne A, Ebert-Uphoff I, Ravela S et al (2018) Machine learning for the geosciences: Challenges and opportunities. IEEE Trans Knowl Data Eng 31:1544–1554CrossRef Karpatne A, Ebert-Uphoff I, Ravela S et al (2018) Machine learning for the geosciences: Challenges and opportunities. IEEE Trans Knowl Data Eng 31:1544–1554CrossRef
go back to reference Kohavi R (1995) A study of cross validation and bootstrap for accuracy estimation and model selection. Int Jt Conf Artif Intell 14:1137–43 Kohavi R (1995) A study of cross validation and bootstrap for accuracy estimation and model selection. Int Jt Conf Artif Intell 14:1137–43
go back to reference Konrad B, Luca M (2022) The Kaggle book. In: Safis Editing (ed) PACKT, 1rst edn. Birminghan B3, U2PB, UK, p 505 Konrad B, Luca M (2022) The Kaggle book. In: Safis Editing (ed) PACKT, 1rst edn. Birminghan B3, U2PB, UK, p 505
go back to reference Lancashire LJ, Lemetre C, Ball GR (2009) An introduction to artificial neural networks in bioinformatics—application to complex microarray and mass spectrometry datasets in cancer studies. Brief Bioinform 10:315–329CrossRefPubMed Lancashire LJ, Lemetre C, Ball GR (2009) An introduction to artificial neural networks in bioinformatics—application to complex microarray and mass spectrometry datasets in cancer studies. Brief Bioinform 10:315–329CrossRefPubMed
go back to reference Lantzanakis G, Mitraka Z, Chrysoulakis N (2020) X-SVM: An extension of C-SVM algorithm for classification of high-resolution satellite imagery. IEEE Trans Geosci Remote Sens 59:3805–3815ADSCrossRef Lantzanakis G, Mitraka Z, Chrysoulakis N (2020) X-SVM: An extension of C-SVM algorithm for classification of high-resolution satellite imagery. IEEE Trans Geosci Remote Sens 59:3805–3815ADSCrossRef
go back to reference Leslie C, Eskin E, Noble WS (2001) The spectrum kernel: A string kernel for SVM protein classification. In: Biocomputing 2002. World Scientific, pp 564–575 Leslie C, Eskin E, Noble WS (2001) The spectrum kernel: A string kernel for SVM protein classification. In: Biocomputing 2002. World Scientific, pp 564–575
go back to reference Lewis A, Oliver S, Lymburner L et al (2017) The Australian geoscience data cube—foundations and lessons learned. Remote Sens Environ 202:276–292ADSCrossRef Lewis A, Oliver S, Lymburner L et al (2017) The Australian geoscience data cube—foundations and lessons learned. Remote Sens Environ 202:276–292ADSCrossRef
go back to reference Li M, Li L, Lai Y et al (2023) Geological Hazard Susceptibility Analysis Based on RF, SVM, and NB Models, Using the Puge Section of the Zemu River Valley as an Example. Sustainability 15:11228CrossRef Li M, Li L, Lai Y et al (2023) Geological Hazard Susceptibility Analysis Based on RF, SVM, and NB Models, Using the Puge Section of the Zemu River Valley as an Example. Sustainability 15:11228CrossRef
go back to reference Li X, Wang X, Jiang X et al (2022) Prediction of riverside greenway landscape aesthetic quality of urban canalized rivers using environmental modeling. J Clean Prod 367:133066CrossRef Li X, Wang X, Jiang X et al (2022) Prediction of riverside greenway landscape aesthetic quality of urban canalized rivers using environmental modeling. J Clean Prod 367:133066CrossRef
go back to reference Liu Y (2006) Serum proteomic pattern analysis for early cancer detection. Technol cancer Res \& Treat 5:61–66 Liu Y (2006) Serum proteomic pattern analysis for early cancer detection. Technol cancer Res \& Treat 5:61–66
go back to reference Negash BM, Yaw AD (2020) Artificial neural network based production forecasting for a hydrocarbon reservoir under water injection. Pet Explor Dev 47:383–392CrossRef Negash BM, Yaw AD (2020) Artificial neural network based production forecasting for a hydrocarbon reservoir under water injection. Pet Explor Dev 47:383–392CrossRef
go back to reference Ozdemir S, Susarla D (2018) Feature Engineering Made Easy: Identify unique features from your dataset in order to build powerful machine learning systems, 1rst edn. Packt Publishing Ltd, Birminghan B3, U2PB, UK Ozdemir S, Susarla D (2018) Feature Engineering Made Easy: Identify unique features from your dataset in order to build powerful machine learning systems, 1rst edn. Packt Publishing Ltd, Birminghan B3, U2PB, UK
go back to reference Poulton MM (2002) Neural networks as an intelligence amplification tool: A review of applications. Geophysics 67:979–993ADSCrossRef Poulton MM (2002) Neural networks as an intelligence amplification tool: A review of applications. Geophysics 67:979–993ADSCrossRef
go back to reference Raschka S, Mirjalili V (2019) Python Machine Learning, 3rd edn. Packt Raschka S, Mirjalili V (2019) Python Machine Learning, 3rd edn. Packt
go back to reference Rosati P, Lynn T (2021) A dataset for accounting, finance and economics research on US data breaches. Data Br 35:106924CrossRef Rosati P, Lynn T (2021) A dataset for accounting, finance and economics research on US data breaches. Data Br 35:106924CrossRef
go back to reference Rostami O, Kaveh M (2021) Optimal feature selection for SAR image classification using biogeography-based optimization (BBO), artificial bee colony (ABC) and support vector machine (SVM): a combined approach of optimization and machine learning. Comput Geosci 25:911–930. https://doi.org/10.1007/s10596-020-10030-1CrossRef Rostami O, Kaveh M (2021) Optimal feature selection for SAR image classification using biogeography-based optimization (BBO), artificial bee colony (ABC) and support vector machine (SVM): a combined approach of optimization and machine learning. Comput Geosci 25:911–930. https://​doi.​org/​10.​1007/​s10596-020-10030-1CrossRef
go back to reference Rostamian A, Jamshidi S, Zirbes E (2019) The development of a novel multi-objective optimization framework for non-vertical well placement based on a modified non-dominated sorting genetic algorithm-II. 1065–1085 Rostamian A, Jamshidi S, Zirbes E (2019) The development of a novel multi-objective optimization framework for non-vertical well placement based on a modified non-dominated sorting genetic algorithm-II. 1065–1085
go back to reference Shu K, Sliva A, Wang S et al (2017) Fake news detection on social media: A data mining perspective. ACM SIGKDD Explor Newsl 19:22–36CrossRef Shu K, Sliva A, Wang S et al (2017) Fake news detection on social media: A data mining perspective. ACM SIGKDD Explor Newsl 19:22–36CrossRef
go back to reference Singh SK (2001) Confined aquifer parameters from temporal derivative of drawdowns. J Hydraul Eng 127:466–470CrossRef Singh SK (2001) Confined aquifer parameters from temporal derivative of drawdowns. J Hydraul Eng 127:466–470CrossRef
go back to reference Srivastava R, Guzman-Guzman A (1994) Analysis of slope-matching methods for aquifer parameter determination. Groundwater 32:570–575CrossRef Srivastava R, Guzman-Guzman A (1994) Analysis of slope-matching methods for aquifer parameter determination. Groundwater 32:570–575CrossRef
go back to reference Tang Y, Heidelberg B (2016) Groundwater Engineering: Hydrogeological parameters calculation. Tongji University Press Tang Y, Heidelberg B (2016) Groundwater Engineering: Hydrogeological parameters calculation. Tongji University Press
go back to reference Theis CV (1935) The relation between the lowering of the piezometric surface and the rate and duration of discharge of a well using ground-water storage. Eos, Trans Am Geophys Union 16:519–524CrossRef Theis CV (1935) The relation between the lowering of the piezometric surface and the rate and duration of discharge of a well using ground-water storage. Eos, Trans Am Geophys Union 16:519–524CrossRef
go back to reference Tian J, Azarian MH, Pecht M (2014) Anomaly detection using self-organizing maps-based k-nearest neighbor algorithm. In: PHM society European conference Tian J, Azarian MH, Pecht M (2014) Anomaly detection using self-organizing maps-based k-nearest neighbor algorithm. In: PHM society European conference
go back to reference Xing H, Zhonglin Z, Shaoyu W (2015) The prediction model of earthquake casuailty based on robust wavelet v-SVM. Nat Hazards 77:717–732CrossRef Xing H, Zhonglin Z, Shaoyu W (2015) The prediction model of earthquake casuailty based on robust wavelet v-SVM. Nat Hazards 77:717–732CrossRef
go back to reference Zavyalova N (2017) Dataset for an analysis of communicative aspects of finance. Data Br 11:197–203CrossRef Zavyalova N (2017) Dataset for an analysis of communicative aspects of finance. Data Br 11:197–203CrossRef
go back to reference Zeye MMJ, Ouedraogo SY, Millogo M, Djigma FW, Zoure AA, Zeba M, Palenfo R, Dakio N, Zaongo SD, Wu X et al (2024) Forensic DNA database and criminal investigation in the Sahel region, a need to update the National Security Policy? Forensic Sci Res owad056. https://doi.org/10.1093/fsr/owad056 Zeye MMJ, Ouedraogo SY, Millogo M, Djigma FW, Zoure AA, Zeba M, Palenfo R, Dakio N, Zaongo SD, Wu X et al (2024) Forensic DNA database and criminal investigation in the Sahel region, a need to update the National Security Policy? Forensic Sci Res owad056. https://​doi.​org/​10.​1093/​fsr/​owad056
go back to reference Zhang G, Wang Y, Luo C, et al (2024) FurniScene: A Large-scale 3D Room Dataset with Intricate Furnishing Scenes. arXiv Prepr arXiv240103470 Zhang G, Wang Y, Luo C, et al (2024) FurniScene: A Large-scale 3D Room Dataset with Intricate Furnishing Scenes. arXiv Prepr arXiv240103470
go back to reference Zheng C, Yuan F, Luo X, et al (2023) Mineral prospectivity mapping based on Support vector machine and Random Forest algorithm-A case study from Ashele copper-zinc deposit, Xinjiang, NW China. Ore Geol Rev 105567 Zheng C, Yuan F, Luo X, et al (2023) Mineral prospectivity mapping based on Support vector machine and Random Forest algorithm-A case study from Ashele copper-zinc deposit, Xinjiang, NW China. Ore Geol Rev 105567
go back to reference Zhuang J, Cai J, Wang R, et al (2020) Deep kNN for medical image classification. In: Medical Image Computing and Computer Assisted Intervention--MICCAI 2020: 23rd International Conference, Lima, Peru, October 4--8, 2020, Proceedings, Part I 23. pp 127–136 Zhuang J, Cai J, Wang R, et al (2020) Deep kNN for medical image classification. In: Medical Image Computing and Computer Assisted Intervention--MICCAI 2020: 23rd International Conference, Lima, Peru, October 4--8, 2020, Proceedings, Part I 23. pp 127–136
Metadata
Title
K-Means Featurizer: A booster for intricate datasets
Authors
Kouao Laurent Kouadio
Jianxin Liu
Rong Liu
Yongfei Wang
Wenxiang Liu
Publication date
15-02-2024
Publisher
Springer Berlin Heidelberg
Published in
Earth Science Informatics / Issue 2/2024
Print ISSN: 1865-0473
Electronic ISSN: 1865-0481
DOI
https://doi.org/10.1007/s12145-024-01236-3

Other articles of this Issue 2/2024

Earth Science Informatics 2/2024 Go to the issue

Premium Partner