nach oben

Erschienen in:

2014 | OriginalPaper | Buchkapitel

Interpreting Random Forest Classification Models Using a Feature Contribution Method

verfasst von : Anna Palczewska, Jan Palczewski, Richard Marchese Robinson, Daniel Neagu

Erschienen in: Integration of Reusable Systems

Verlag: Springer International Publishing

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Model interpretation is one of the key aspects of the model evaluation process. The explanation of the relationship between model variables and outputs is relatively easy for statistical models, such as linear regressions, thanks to the availability of model parameters and their statistical significance . For “black box” models, such as random forest, this information is hidden inside the model structure. This work presents an approach for computing feature contributions for random forest classification models. It allows for the determination of the influence of each variable on the model prediction for an individual instance. By analysing feature contributions for a training dataset, the most significant variables can be determined and their typical contribution towards predictions made for individual classes, i.e., class-specific feature contribution “patterns”, are discovered. These patterns represent a standard behaviour of the model and allow for an additional assessment of the model reliability for new data. Interpretation of feature contributions for two UCI benchmark datasets shows the potential of the proposed methodology. The robustness of results is demonstrated through an extensive analysis of feature contributions calculated for a large number of generated random forest models.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Vorheriges Kapitel A Minimum Description Length Technique for Semi-Supervised Time Series Classification

Nächstes Kapitel Towards a High Level Language for Reuse and Integration

The distribution \(\hat{Y}_i\) is calculated by the function predict in the R package randomForest [11] when the type of prediction is set to prob.

A covariance matrix of feature contributions has \(F(F+1)/2\) distinct entries, where \(F\) is the number of features. This value is usually larger than the size of a cluster making it impossible to retrieve useful information about the dependence structure of feature contributions. Application of more advanced methods, such as principal component analysis, is left for future research.

The likelihood is obtained by applying the exponential function to the log-likelihood.

Tropsha, A.: Best practices for QSAR model development, validation, and exploitation. Mol. Inform. 29(6–7), 476–488 (2010)

Rosenbaum, L., Hinselmann, G., Jahn, A., Zell, A.: Interpreting linear support vector machine models with heat map molecule coloring. J. Cheminf. 3(1), 11 (2011)CrossRef

Carlsson, L., Helgee, E.A., Boyer, S.: Interpretation of nonlinear QSAR models applied to ames mutagenicity data. J. Chem. Inf. Model. 49(11), 2551–2558 (2009)

Baehrens, D., Schroeter, T., Harmeling, S., Kawanabe, M., Hansen, K., Muller, K.R.: How to explain individual classification decisions. J. Mach. Learn. Res. 11, 1803–1831 (2010)MATHMathSciNet

Hansen, K., Baehrens, D., Schroeter, T., Rupp, M., Muller, K.R.: Visual interpretation of kernel-based prediction models. Mol. Inform. 30(9), 817–826 (2011)CrossRef

Kuz’min, V.E., Polishchuk, P.G., Artemenko, A.G., Andronati, S.A.: Interpretation of QSAR models based on random forest methods. Mol. Inform. 30(6–7), 593–603 (2011)

Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)CrossRefMATH

Breiman, L., Cutler, A.: Random forests. http://www.stat.berkeley.edu/~breiman/RandomForests (2008)

Strobl, C., Boulesteix, A.-L., Zeileis, A., Hothorn, T.: Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinf. 8(1), 25 (2007)

10.

Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees. Wadsworth & Brooks/Cole Advanced Books & Software, Monterey (1984)MATH

11.

Liaw, A., Wiener, M.: Classification and regression by randomforest. R News 2(3), 18–22 (2002)

12.

Iris dataset. http://archive.ics.uci.edu/ml/datasets/Iris

13.

Cormen, T.H., Stein, C., Rivest, R.L., Leiserson, C.E.: Introduction to Algorithms. 2nd edn. McGraw-Hill Higher Education, New York (2001)

14.

Hand, D.J., Smyth, P., Mannila, H.: Principles of Data Mining. MIT Press, Cambridge (2001)

15.

Rajaraman, A., Ullman, J.D.: Mining of Massive Datasets. Cambridge University Press, Cambridge (2012)

16.

Breast Cancer Wisconsin Diagnostic dataset. http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29

17.

CRAN—The Comprehensive R Archive Network. http://cran.r-project.org/

Titel: Interpreting Random Forest Classification Models Using a Feature Contribution Method
verfasst von: Anna Palczewska
Jan Palczewski
Richard Marchese Robinson
Daniel Neagu
Verlag: Springer International Publishing
Buch: Integration of Reusable Systems
Print ISBN: 978-3-319-04716-4

Electronic ISBN: 978-3-319-04717-1

Copyright-Jahr: 2014
DOI: https://doi.org/10.1007/978-3-319-04717-1_9

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"