Skip to main content
Erschienen in: Earth Science Informatics 2/2024

15.02.2024 | RESEARCH

K-Means Featurizer: A booster for intricate datasets

verfasst von: Kouao Laurent Kouadio, Jianxin Liu, Rong Liu, Yongfei Wang, Wenxiang Liu

Erschienen in: Earth Science Informatics | Ausgabe 2/2024

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Machine Learning (ML) has become pivotal across various fields, offering innovative solutions to complex data challenges. Professionals typically seek models that excel in both performance and reliability, aiming to achieve optimal generalization on future data. Since, then a variety of methods such as dummy coding, up/down-sampling, and bin-counting have been explored. However, finding a solution that effectively navigates the intricacies of limited and complex datasets still remains a challenge. This study introduces the K-Means Featurizer (KMF), an innovative algorithm crafted to enhance model performance and reliability, especially in scenarios involving complex and limited datasets. KMF employs K-Means clustering to generate enriched features that provide a nuanced understanding of the data, effectively balancing the similarity between the target variable and the feature space. This results in a more efficient predictive task by minimizing Euclidean distances and enhancing model generalizability. Our research validates KMF's effectiveness through an experiment in geoscience engineering, focusing on hydraulic conductivity (K) prediction, a vital parameter in well monitoring and infrastructure planning. Traditionally, K extraction is laborious and costly, requiring extensive pumping tests. KMF's application in this context demonstrates its potential to substantially reduce data losses during such operations. Applying KMF to the Extreme Gradient Boosting, Random Forest, K-Neighbors, Support Vector Machines, and Multiple Layers Neural Networks resulted in a significant improvement in prediction accuracy, with K-scores reaching up to 90%. While our experiment centers on geoscience engineering, KMF's utility extends to various domains facing similar data intricacies. Its adaptability to different types of complex datasets positions it as a valuable tool for diverse data-driven applications.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Literatur
Zurück zum Zitat Ali JK (1994) Neural networks: a new tool for the petroleum industry? In: SPE European Petroleum Computer Conference. p SPE--27561 Ali JK (1994) Neural networks: a new tool for the petroleum industry? In: SPE European Petroleum Computer Conference. p SPE--27561
Zurück zum Zitat Alice Z, Amenda C (2018) Feature engineering for machine learning. In: Roumeliotis R, Jeff B (eds) O’Reilly Media Inc, 1rst edn. O’Reilly Media, Inc., p 218 Alice Z, Amenda C (2018) Feature engineering for machine learning. In: Roumeliotis R, Jeff B (eds) O’Reilly Media Inc, 1rst edn. O’Reilly Media, Inc., p 218
Zurück zum Zitat Altman NS (1992) An introduction to kernel and nearest-neighbor nonparametric regression. Am Stat 46:175–185MathSciNetCrossRef Altman NS (1992) An introduction to kernel and nearest-neighbor nonparametric regression. Am Stat 46:175–185MathSciNetCrossRef
Zurück zum Zitat Ankam V (2016) Big data analytics, 1rst edn. Packt Publishing Ltd, Birminghan B3, U2PB, UK Ankam V (2016) Big data analytics, 1rst edn. Packt Publishing Ltd, Birminghan B3, U2PB, UK
Zurück zum Zitat Arpitha M, Ahmed SA, Harishnaika N (2023) Correction to : Land use and land cover classification using machine learning algorithms in google earth engine. Earth Sci Informatics 5:577451 Arpitha M, Ahmed SA, Harishnaika N (2023) Correction to : Land use and land cover classification using machine learning algorithms in google earth engine. Earth Sci Informatics 5:577451
Zurück zum Zitat Bergen KJ, Johnson PA, de Hoop M V, Beroza GC (2019) Machine learning for data-driven discovery in solid Earth geoscience. Science (80- ) 363:eaau0323 Bergen KJ, Johnson PA, de Hoop M V, Beroza GC (2019) Machine learning for data-driven discovery in solid Earth geoscience. Science (80- ) 363:eaau0323
Zurück zum Zitat Bergstra J, Bengio Y (2012) Random search for hyper-parameter optimization. J Mach Learn Res 13:281–305MathSciNet Bergstra J, Bengio Y (2012) Random search for hyper-parameter optimization. J Mach Learn Res 13:281–305MathSciNet
Zurück zum Zitat Cai CZ, Han LY, Ji ZL et al (2003) SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence. Nucleic Acids Res 31:3692–3697CrossRefPubMedPubMedCentral Cai CZ, Han LY, Ji ZL et al (2003) SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence. Nucleic Acids Res 31:3692–3697CrossRefPubMedPubMedCentral
Zurück zum Zitat Çimen M (2009) Effective procedure for determination of aquifer parameters from late time-drawdown data. J Hydrol Eng 14:446–452CrossRef Çimen M (2009) Effective procedure for determination of aquifer parameters from late time-drawdown data. J Hydrol Eng 14:446–452CrossRef
Zurück zum Zitat Cushman JH, Tartakovsky DM, Delleur JW (2016) Elementary groundwater flow and transport Processes Cushman JH, Tartakovsky DM, Delleur JW (2016) Elementary groundwater flow and transport Processes
Zurück zum Zitat Deng L, Liu Y (2018) Deep learning in natural language processing. Springer, Seattle, USACrossRef Deng L, Liu Y (2018) Deep learning in natural language processing. Springer, Seattle, USACrossRef
Zurück zum Zitat Fabien-Ouellet G, Sarkar R (2020) Seismic velocity estimation: A deep recurrent neural-network approach. Geophysics 85:U21–U29ADSCrossRef Fabien-Ouellet G, Sarkar R (2020) Seismic velocity estimation: A deep recurrent neural-network approach. Geophysics 85:U21–U29ADSCrossRef
Zurück zum Zitat Fraiman R, Justel A, Svarc M (2010) Pattern recognition via projection-based kNN rules. Comput Stat \& data Anal 54:1390–1403 Fraiman R, Justel A, Svarc M (2010) Pattern recognition via projection-based kNN rules. Comput Stat \& data Anal 54:1390–1403
Zurück zum Zitat Geron A (2019) Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow: concepts, tools, and techniques to build intelligent systems., 1rst edn. O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472 Geron A (2019) Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow: concepts, tools, and techniques to build intelligent systems., 1rst edn. O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472
Zurück zum Zitat Ho TK (1995) Random decision forests. In: Proceedings of 3rd international conference on document analysis and recognition. pp 278–282 Ho TK (1995) Random decision forests. In: Proceedings of 3rd international conference on document analysis and recognition. pp 278–282
Zurück zum Zitat Huang S, Cai N, Pacheco PP, et al (2018) Applications of support vector machine (SVM) learning in cancer genomics. Cancer genomics \& proteomics 15:41–51 Huang S, Cai N, Pacheco PP, et al (2018) Applications of support vector machine (SVM) learning in cancer genomics. Cancer genomics \& proteomics 15:41–51
Zurück zum Zitat Jin X, Han J (2010) K-Means Clustering. In: Sammut C, Webb GI (eds) Encyclopedia of Machine Learning. Springer, US, Boston, MA, pp 563–564 Jin X, Han J (2010) K-Means Clustering. In: Sammut C, Webb GI (eds) Encyclopedia of Machine Learning. Springer, US, Boston, MA, pp 563–564
Zurück zum Zitat Kamath U, Liu J, Whitaker J (2019) Deep learning for NLP and speech recognition. Springer, VA, USACrossRef Kamath U, Liu J, Whitaker J (2019) Deep learning for NLP and speech recognition. Springer, VA, USACrossRef
Zurück zum Zitat Karpatne A, Ebert-Uphoff I, Ravela S et al (2018) Machine learning for the geosciences: Challenges and opportunities. IEEE Trans Knowl Data Eng 31:1544–1554CrossRef Karpatne A, Ebert-Uphoff I, Ravela S et al (2018) Machine learning for the geosciences: Challenges and opportunities. IEEE Trans Knowl Data Eng 31:1544–1554CrossRef
Zurück zum Zitat Kohavi R (1995) A study of cross validation and bootstrap for accuracy estimation and model selection. Int Jt Conf Artif Intell 14:1137–43 Kohavi R (1995) A study of cross validation and bootstrap for accuracy estimation and model selection. Int Jt Conf Artif Intell 14:1137–43
Zurück zum Zitat Konrad B, Luca M (2022) The Kaggle book. In: Safis Editing (ed) PACKT, 1rst edn. Birminghan B3, U2PB, UK, p 505 Konrad B, Luca M (2022) The Kaggle book. In: Safis Editing (ed) PACKT, 1rst edn. Birminghan B3, U2PB, UK, p 505
Zurück zum Zitat Lancashire LJ, Lemetre C, Ball GR (2009) An introduction to artificial neural networks in bioinformatics—application to complex microarray and mass spectrometry datasets in cancer studies. Brief Bioinform 10:315–329CrossRefPubMed Lancashire LJ, Lemetre C, Ball GR (2009) An introduction to artificial neural networks in bioinformatics—application to complex microarray and mass spectrometry datasets in cancer studies. Brief Bioinform 10:315–329CrossRefPubMed
Zurück zum Zitat Lantzanakis G, Mitraka Z, Chrysoulakis N (2020) X-SVM: An extension of C-SVM algorithm for classification of high-resolution satellite imagery. IEEE Trans Geosci Remote Sens 59:3805–3815ADSCrossRef Lantzanakis G, Mitraka Z, Chrysoulakis N (2020) X-SVM: An extension of C-SVM algorithm for classification of high-resolution satellite imagery. IEEE Trans Geosci Remote Sens 59:3805–3815ADSCrossRef
Zurück zum Zitat Leslie C, Eskin E, Noble WS (2001) The spectrum kernel: A string kernel for SVM protein classification. In: Biocomputing 2002. World Scientific, pp 564–575 Leslie C, Eskin E, Noble WS (2001) The spectrum kernel: A string kernel for SVM protein classification. In: Biocomputing 2002. World Scientific, pp 564–575
Zurück zum Zitat Lewis A, Oliver S, Lymburner L et al (2017) The Australian geoscience data cube—foundations and lessons learned. Remote Sens Environ 202:276–292ADSCrossRef Lewis A, Oliver S, Lymburner L et al (2017) The Australian geoscience data cube—foundations and lessons learned. Remote Sens Environ 202:276–292ADSCrossRef
Zurück zum Zitat Li M, Li L, Lai Y et al (2023) Geological Hazard Susceptibility Analysis Based on RF, SVM, and NB Models, Using the Puge Section of the Zemu River Valley as an Example. Sustainability 15:11228CrossRef Li M, Li L, Lai Y et al (2023) Geological Hazard Susceptibility Analysis Based on RF, SVM, and NB Models, Using the Puge Section of the Zemu River Valley as an Example. Sustainability 15:11228CrossRef
Zurück zum Zitat Li X, Wang X, Jiang X et al (2022) Prediction of riverside greenway landscape aesthetic quality of urban canalized rivers using environmental modeling. J Clean Prod 367:133066CrossRef Li X, Wang X, Jiang X et al (2022) Prediction of riverside greenway landscape aesthetic quality of urban canalized rivers using environmental modeling. J Clean Prod 367:133066CrossRef
Zurück zum Zitat Liu Y (2006) Serum proteomic pattern analysis for early cancer detection. Technol cancer Res \& Treat 5:61–66 Liu Y (2006) Serum proteomic pattern analysis for early cancer detection. Technol cancer Res \& Treat 5:61–66
Zurück zum Zitat Negash BM, Yaw AD (2020) Artificial neural network based production forecasting for a hydrocarbon reservoir under water injection. Pet Explor Dev 47:383–392CrossRef Negash BM, Yaw AD (2020) Artificial neural network based production forecasting for a hydrocarbon reservoir under water injection. Pet Explor Dev 47:383–392CrossRef
Zurück zum Zitat Ozdemir S, Susarla D (2018) Feature Engineering Made Easy: Identify unique features from your dataset in order to build powerful machine learning systems, 1rst edn. Packt Publishing Ltd, Birminghan B3, U2PB, UK Ozdemir S, Susarla D (2018) Feature Engineering Made Easy: Identify unique features from your dataset in order to build powerful machine learning systems, 1rst edn. Packt Publishing Ltd, Birminghan B3, U2PB, UK
Zurück zum Zitat Poulton MM (2002) Neural networks as an intelligence amplification tool: A review of applications. Geophysics 67:979–993ADSCrossRef Poulton MM (2002) Neural networks as an intelligence amplification tool: A review of applications. Geophysics 67:979–993ADSCrossRef
Zurück zum Zitat Raschka S, Mirjalili V (2019) Python Machine Learning, 3rd edn. Packt Raschka S, Mirjalili V (2019) Python Machine Learning, 3rd edn. Packt
Zurück zum Zitat Rosati P, Lynn T (2021) A dataset for accounting, finance and economics research on US data breaches. Data Br 35:106924CrossRef Rosati P, Lynn T (2021) A dataset for accounting, finance and economics research on US data breaches. Data Br 35:106924CrossRef
Zurück zum Zitat Rostami O, Kaveh M (2021) Optimal feature selection for SAR image classification using biogeography-based optimization (BBO), artificial bee colony (ABC) and support vector machine (SVM): a combined approach of optimization and machine learning. Comput Geosci 25:911–930. https://doi.org/10.1007/s10596-020-10030-1CrossRef Rostami O, Kaveh M (2021) Optimal feature selection for SAR image classification using biogeography-based optimization (BBO), artificial bee colony (ABC) and support vector machine (SVM): a combined approach of optimization and machine learning. Comput Geosci 25:911–930. https://​doi.​org/​10.​1007/​s10596-020-10030-1CrossRef
Zurück zum Zitat Rostamian A, Jamshidi S, Zirbes E (2019) The development of a novel multi-objective optimization framework for non-vertical well placement based on a modified non-dominated sorting genetic algorithm-II. 1065–1085 Rostamian A, Jamshidi S, Zirbes E (2019) The development of a novel multi-objective optimization framework for non-vertical well placement based on a modified non-dominated sorting genetic algorithm-II. 1065–1085
Zurück zum Zitat Shu K, Sliva A, Wang S et al (2017) Fake news detection on social media: A data mining perspective. ACM SIGKDD Explor Newsl 19:22–36CrossRef Shu K, Sliva A, Wang S et al (2017) Fake news detection on social media: A data mining perspective. ACM SIGKDD Explor Newsl 19:22–36CrossRef
Zurück zum Zitat Singh SK (2001) Confined aquifer parameters from temporal derivative of drawdowns. J Hydraul Eng 127:466–470CrossRef Singh SK (2001) Confined aquifer parameters from temporal derivative of drawdowns. J Hydraul Eng 127:466–470CrossRef
Zurück zum Zitat Srivastava R, Guzman-Guzman A (1994) Analysis of slope-matching methods for aquifer parameter determination. Groundwater 32:570–575CrossRef Srivastava R, Guzman-Guzman A (1994) Analysis of slope-matching methods for aquifer parameter determination. Groundwater 32:570–575CrossRef
Zurück zum Zitat Tang Y, Heidelberg B (2016) Groundwater Engineering: Hydrogeological parameters calculation. Tongji University Press Tang Y, Heidelberg B (2016) Groundwater Engineering: Hydrogeological parameters calculation. Tongji University Press
Zurück zum Zitat Theis CV (1935) The relation between the lowering of the piezometric surface and the rate and duration of discharge of a well using ground-water storage. Eos, Trans Am Geophys Union 16:519–524CrossRef Theis CV (1935) The relation between the lowering of the piezometric surface and the rate and duration of discharge of a well using ground-water storage. Eos, Trans Am Geophys Union 16:519–524CrossRef
Zurück zum Zitat Tian J, Azarian MH, Pecht M (2014) Anomaly detection using self-organizing maps-based k-nearest neighbor algorithm. In: PHM society European conference Tian J, Azarian MH, Pecht M (2014) Anomaly detection using self-organizing maps-based k-nearest neighbor algorithm. In: PHM society European conference
Zurück zum Zitat Xing H, Zhonglin Z, Shaoyu W (2015) The prediction model of earthquake casuailty based on robust wavelet v-SVM. Nat Hazards 77:717–732CrossRef Xing H, Zhonglin Z, Shaoyu W (2015) The prediction model of earthquake casuailty based on robust wavelet v-SVM. Nat Hazards 77:717–732CrossRef
Zurück zum Zitat Zavyalova N (2017) Dataset for an analysis of communicative aspects of finance. Data Br 11:197–203CrossRef Zavyalova N (2017) Dataset for an analysis of communicative aspects of finance. Data Br 11:197–203CrossRef
Zurück zum Zitat Zeye MMJ, Ouedraogo SY, Millogo M, Djigma FW, Zoure AA, Zeba M, Palenfo R, Dakio N, Zaongo SD, Wu X et al (2024) Forensic DNA database and criminal investigation in the Sahel region, a need to update the National Security Policy? Forensic Sci Res owad056. https://doi.org/10.1093/fsr/owad056 Zeye MMJ, Ouedraogo SY, Millogo M, Djigma FW, Zoure AA, Zeba M, Palenfo R, Dakio N, Zaongo SD, Wu X et al (2024) Forensic DNA database and criminal investigation in the Sahel region, a need to update the National Security Policy? Forensic Sci Res owad056. https://​doi.​org/​10.​1093/​fsr/​owad056
Zurück zum Zitat Zhang G, Wang Y, Luo C, et al (2024) FurniScene: A Large-scale 3D Room Dataset with Intricate Furnishing Scenes. arXiv Prepr arXiv240103470 Zhang G, Wang Y, Luo C, et al (2024) FurniScene: A Large-scale 3D Room Dataset with Intricate Furnishing Scenes. arXiv Prepr arXiv240103470
Zurück zum Zitat Zheng C, Yuan F, Luo X, et al (2023) Mineral prospectivity mapping based on Support vector machine and Random Forest algorithm-A case study from Ashele copper-zinc deposit, Xinjiang, NW China. Ore Geol Rev 105567 Zheng C, Yuan F, Luo X, et al (2023) Mineral prospectivity mapping based on Support vector machine and Random Forest algorithm-A case study from Ashele copper-zinc deposit, Xinjiang, NW China. Ore Geol Rev 105567
Zurück zum Zitat Zhuang J, Cai J, Wang R, et al (2020) Deep kNN for medical image classification. In: Medical Image Computing and Computer Assisted Intervention--MICCAI 2020: 23rd International Conference, Lima, Peru, October 4--8, 2020, Proceedings, Part I 23. pp 127–136 Zhuang J, Cai J, Wang R, et al (2020) Deep kNN for medical image classification. In: Medical Image Computing and Computer Assisted Intervention--MICCAI 2020: 23rd International Conference, Lima, Peru, October 4--8, 2020, Proceedings, Part I 23. pp 127–136
Metadaten
Titel
K-Means Featurizer: A booster for intricate datasets
verfasst von
Kouao Laurent Kouadio
Jianxin Liu
Rong Liu
Yongfei Wang
Wenxiang Liu
Publikationsdatum
15.02.2024
Verlag
Springer Berlin Heidelberg
Erschienen in
Earth Science Informatics / Ausgabe 2/2024
Print ISSN: 1865-0473
Elektronische ISSN: 1865-0481
DOI
https://doi.org/10.1007/s12145-024-01236-3

Weitere Artikel der Ausgabe 2/2024

Earth Science Informatics 2/2024 Zur Ausgabe

Premium Partner