Skip to main content

2024 | OriginalPaper | Buchkapitel

The Impact of Data Valuation on Feature Importance in Classification Models

verfasst von : Malick Ebiele, Malika Bendechache, Marie Ward, Una Geary, Declan Byrne, Donnacha Creagh, Rob Brennan

Erschienen in: Proceedings of Third International Conference on Computing and Communication Networks

Verlag: Springer Nature Singapore

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

This paper investigates the impact of data valuation metrics (variability and coefficient of variation) on the feature importance in classification models. Data valuation is an emerging topic in the fields of data science, accounting, data quality, and information economics concerned with methods to calculate the value of data. Feature importance or ranking is important in explaining how black-box machine learning models make predictions as well as selecting the most predictive features while training these models. Existing feature importance algorithms are either computationally expensive (e.g. SHAP values) or biased (e.g. Gini importance in Tree-based models). No previous investigation of the impact of data valuation metrics on feature importance has been conducted. Five popular machine learning models (eXtreme Gradient Boosting (XGB), Random Forest (RF), Logistic Regression (LR), Multi-Layer Perceptron (MLP), and Naive Bayes (NB)) have been used as well as six widely implemented feature ranking techniques (Information Gain, Gini importance, Frequency Importance, Cover Importance, Permutation Importance, and SHAP values) to investigate the relationship between feature importance and data valuation metrics for a clinical use case. XGB outperforms the other models with a weighted F1-score of 79.72%. The findings suggest that features with variability greater than 0.4 or a coefficient of variation greater than 23.4 have little to no value; therefore, these features can be filtered out during feature selection. This result, if generalisable, could simplify feature selection and data preparation.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
3
\(11.85 = \text {min CoV} + 0.23 \times \text {(max CoV - min CoV)}\).
 
4
\(31.3 = \text {min CoV} + 0.55 \times \text {(max CoV - min CoV)}\).
 
5
\(23.4 = \text {min CoV} + 0.42 \times \text {(max CoV - min CoV)}\).
 
Literatur
3.
Zurück zum Zitat Tang, S., Ghorbani, A., Yamashita, R., Rehman, S., Dunnmon, J., Zou, J., Rubin, D.: Data valuation for medical imaging using shapley value: application on a large-scale chest X-ray dataset. Sci. Rep. 11, 8366 (2021). arXiv:2010.08006 [cs, eess] Tang, S., Ghorbani, A., Yamashita, R., Rehman, S., Dunnmon, J., Zou, J., Rubin, D.: Data valuation for medical imaging using shapley value: application on a large-scale chest X-ray dataset. Sci. Rep. 11, 8366 (2021). arXiv:​2010.​08006 [cs, eess]
5.
6.
Zurück zum Zitat Loecher, M.: Unbiased variable importance for random forests. Commun. Stat. - Theory Methods. 51, 1413–1425 (2022). arXiv:2003.02106 [cs, stat] Loecher, M.: Unbiased variable importance for random forests. Commun. Stat. - Theory Methods. 51, 1413–1425 (2022). arXiv:​2003.​02106 [cs, stat]
9.
Zurück zum Zitat Loecher, M.: Debiasing MDI feature importance and SHAP values in tree ensembles. In: Machine Learning and Knowledge Extraction, pp. 114–129 (2022) Loecher, M.: Debiasing MDI feature importance and SHAP values in tree ensembles. In: Machine Learning and Knowledge Extraction, pp. 114–129 (2022)
10.
Zurück zum Zitat Baudeu, R., Wright, M., Loecher, M.: Are SHAP values biased towards high-entropy features?. In: Machine Learning and Principles and Practice of Knowledge Discovery in Databases, pp. 418–433 (2023) Baudeu, R., Wright, M., Loecher, M.: Are SHAP values biased towards high-entropy features?. In: Machine Learning and Principles and Practice of Knowledge Discovery in Databases, pp. 418–433 (2023)
12.
Zurück zum Zitat Maasland, T., Pereira, J., Bastos, D., Goffau, M., Nieuwdorp, M., Zwinderman, A., Levin, E.: Interpretable models via pairwise permutations algorithm. In: Machine Learning and Principles and Practice of Knowledge Discovery in Databases, pp. 15–25 (2021) Maasland, T., Pereira, J., Bastos, D., Goffau, M., Nieuwdorp, M., Zwinderman, A., Levin, E.: Interpretable models via pairwise permutations algorithm. In: Machine Learning and Principles and Practice of Knowledge Discovery in Databases, pp. 15–25 (2021)
13.
Zurück zum Zitat Jia, R., Dao, D., Wang, B., Hubis, F., Hynes, N., Gürel, N., Li, B., Zhang, C., Song, D., Spanos, C.: Towards efficient data valuation based on the shapley value. In: Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, pp. 1167–1176 (2019). https://proceedings.mlr.press/v89/jia19a.html. ISSN: 2640-3498 Jia, R., Dao, D., Wang, B., Hubis, F., Hynes, N., Gürel, N., Li, B., Zhang, C., Song, D., Spanos, C.: Towards efficient data valuation based on the shapley value. In: Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, pp. 1167–1176 (2019). https://​proceedings.​mlr.​press/​v89/​jia19a.​html. ISSN: 2640-3498
14.
Zurück zum Zitat Kumar, S., Lakshminarayanan, A., Chang, K., Guretno, F., Mien, I., Kalpathy-Cramer, J., Krishnaswamy, P., Singh, P.: Towards more efficient data valuation in healthcare federated learning using ensembling. In: Distributed, Collaborative, and Federated Learning, and Affordable AI and Healthcare for Resource Diverse Global Health, pp. 119–129 (2022) Kumar, S., Lakshminarayanan, A., Chang, K., Guretno, F., Mien, I., Kalpathy-Cramer, J., Krishnaswamy, P., Singh, P.: Towards more efficient data valuation in healthcare federated learning using ensembling. In: Distributed, Collaborative, and Federated Learning, and Affordable AI and Healthcare for Resource Diverse Global Health, pp. 119–129 (2022)
16.
Zurück zum Zitat Datta, A., Sen, S., Zick, Y.: Algorithmic transparency via quantitative input influence: theory and experiments with learning systems. In: 2016 IEEE Symposium on Security and Privacy (SP), pp. 598–617 (2016). ISSN: 2375-1207 Datta, A., Sen, S., Zick, Y.: Algorithmic transparency via quantitative input influence: theory and experiments with learning systems. In: 2016 IEEE Symposium on Security and Privacy (SP), pp. 598–617 (2016). ISSN: 2375-1207
17.
Zurück zum Zitat Cohen, S., Dror, G., Ruppin, E.: Feature selection via coalitional game theory. Neural Comput. 19, 1939–1961 (2007), Conference Name: Neural Computation Cohen, S., Dror, G., Ruppin, E.: Feature selection via coalitional game theory. Neural Comput. 19, 1939–1961 (2007), Conference Name: Neural Computation
22.
Zurück zum Zitat Strumbelj, E., Kononenko, I.: An efficient explanation of individual classifications using game theory. J. Mach. Learn. Res. 11, 1–18 (2010)MathSciNet Strumbelj, E., Kononenko, I.: An efficient explanation of individual classifications using game theory. J. Mach. Learn. Res. 11, 1–18 (2010)MathSciNet
23.
Zurück zum Zitat Brennan, R., Attard, J., Petkov, P., Nagle, T., Helfert, M.: Exploring data value assessment: a survey method and investigation of the perceived relative importance of data value dimensions. (SciTePress, 2019). https://cora.ucc.ie/handle/10468/8166, Accepted: 2019-07-16T09:18:42Z Brennan, R., Attard, J., Petkov, P., Nagle, T., Helfert, M.: Exploring data value assessment: a survey method and investigation of the perceived relative importance of data value dimensions. (SciTePress, 2019). https://​cora.​ucc.​ie/​handle/​10468/​8166, Accepted: 2019-07-16T09:18:42Z
24.
27.
Zurück zum Zitat Shobeiri, S., Aajami, M.: Shapley value in convolutional neural networks (CNNs): a comparative study. Am. J. Sci. Engin. 2, 9–14 (2021)CrossRef Shobeiri, S., Aajami, M.: Shapley value in convolutional neural networks (CNNs): a comparative study. Am. J. Sci. Engin. 2, 9–14 (2021)CrossRef
29.
Zurück zum Zitat Hapke, H., Nelson, C.: Introduction. In: Building Machine Learning Pipelines: Automating Model Life Cycles with TensorFlow. O’Reilly Media, Inc. (2020). ISBN: 9781492053194 Hapke, H., Nelson, C.: Introduction. In: Building Machine Learning Pipelines: Automating Model Life Cycles with TensorFlow. O’Reilly Media, Inc. (2020). ISBN: 9781492053194
Metadaten
Titel
The Impact of Data Valuation on Feature Importance in Classification Models
verfasst von
Malick Ebiele
Malika Bendechache
Marie Ward
Una Geary
Declan Byrne
Donnacha Creagh
Rob Brennan
Copyright-Jahr
2024
Verlag
Springer Nature Singapore
DOI
https://doi.org/10.1007/978-981-97-0892-5_47