nach oben

Data Mining and Knowledge Discovery

Erschienen in:

01.09.2014

A peek into the black box: exploring classifiers by randomization

verfasst von: Andreas Henelius, Kai Puolamäki, Henrik Boström, Lars Asker, Panagiotis Papapetrou

Erschienen in: Data Mining and Knowledge Discovery | Ausgabe 5-6/2014

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Classifiers are often opaque and cannot easily be inspected to gain understanding of which factors are of importance. We propose an efficient iterative algorithm to find the attributes and dependencies used by any classifier when making predictions. The performance and utility of the algorithm is demonstrated on two synthetic and 26 real-world datasets, using 15 commonly used learning algorithms to generate the classifiers. The empirical investigation shows that the novel algorithm is indeed able to find groupings of interacting attributes exploited by the different classifiers. These groupings allow for finding similarities among classifiers for a single dataset as well as for determining the extent to which different classifiers exploit such interactions in general.

Vorheriger Artikel Detecting localized homogeneous anomalies over spatio-temporal data

Nächster Artikel Confidence bands for time series data

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

The algorithm can be downloaded from https://bitbucket.org/aheneliu/goldeneye/ (accessed 7 July 2014) or easily installed using the command install_bitbucket from the devtools package (Wickham and Chang 2014) as follows: install_bitbucket(repo = “goldeneye”, username = “aheneliu”).

Andrews R, Diederich J, Tickle AB (1995) Survey and critique of techniques for extracting rules from trained artificial neural networks. Knowl Based Syst 8(6):373–389CrossRef

Bache K, Lichman M (2014) UCI machine learning repository. http://archive.ics.uci.edu/ml

Breiman L (2001) Random forests. Mach Learn 45(1):5–32CrossRefMATH

Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and regression trees. Wadsworth, BelmontMATH

Chanda P, Cho YR, Zhang A, Ramanathan M (2009) Mining of attribute interactions using information theoretic metrics. In: IEEE International Conference on Data Mining Workshops, pp 350–355

Clark P, Niblett T (1989) The CN2 induction algorithm. Mach Learn 3(4):261–283

Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20:273–293MATH

De Bie T (2011a) An information theoretic framework for data mining. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, New York, NY, KDD ’11, pp 564–572

De Bie T (2011b) Maximum entropy models and subjective interestingness: an application to tiles in binary databases. Data Min Knowl Discov 23(3):407–446CrossRefMATHMathSciNet

Domingos P, Pazzani MJ (1997) On the optimality of the simple bayesian classifier under zero-one loss. Machine Learning 29(2–3):103–130CrossRefMATH

Freitas AA (2001) Understanding the crucial role of attribute interaction in data mining. Artif Intell Rev 16(3):177–199CrossRefMATHMathSciNet

Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The weka data mining software: an update. ACM SIGKDD Explor Newsl 11(1):10–18CrossRef

Hanhijärvi S, Ojala M, Vuokko N, Puolamäki K, Tatti N, Mannila H (2009) Tell me something i don’t know: Randomization strategies for iterative data mining. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, New York, NY, KDD ’09, pp 379–388

Henelius A, Korpela J, Puolamäki K (2013) Explaining interval sequences by randomization. In: Blockeel H, Kersting K, Nijssen S, Z̆elezný Filip (eds) Machine Learning and Knowledge Discovery in Databases, Lecture Notes in Computer Science, vol 8188, pp 337–352

Hornik K, Buchta C, Zeileis A (2009) Open-source machine learning: R meets Weka. Comput Stat 24(2):225–232. doi:10.1007/s00180-008-0119-7 CrossRefMATHMathSciNet

Ishibuchi H, Nojima Y (2007) Analysis of interpretability-accuracy tradeoff of fuzzy systems by multiobjective fuzzy genetics-based machine learning. Int J Approx Reason 44(1):4–31CrossRefMATHMathSciNet

Jakulin A, Bratko I, Smrke D, Demsar J, Zupan B (2003) Attribute interactions in medical data analysis. In: 9th Conference on Artificial Intelligence in Medicine in Europe, pp 229–238

Janitza S, Strobl C, Boulesteix AL (2013) An auc-based permutation variable importance measure for random forests. BMC Bioinform 14:119CrossRef

Johansson U, König R, Niklasson L (2003) Rule extraction from trained neural networks using genetic programming. In: 13th International Conference on Artificial Neural Networks, pp 13–16

Liaw A, Wiener M (2002) Classification and regression by randomforest. R News 2(3):18–22, http://CRAN.R-project.org/doc/Rnews/

Lijffijt J, Papapetrou P, Puolamäki K (2014) A statistical significance testing approach to mining the most informative set of patterns. Data Min Knowl Discov 28:238–263. doi:10.1007/s10618-012-0298-2 CrossRefMATHMathSciNet

Misra G, Golshan B, Terzi E (2012) A Framework for Evaluating the Smoothness of Data-Mining Results. In: Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases, vol II, pp 660–675

Ojala M, Garriga GC (2010) Permutation tests for studying classier performance. J Mach Learn Res 11:1833–1863MATHMathSciNet

Plate T (1999) Accuracy versus interpretability in flexible modeling: implementing a tradeoff using gaussian process models. Behaviormetrika 26:29–50CrossRef

Pulkkinen P, Koivisto H (2008) Fuzzy classifier identification using decision tree and multiobjective evolutionary algorithms. Int J Approx Reason 48(2):526–543CrossRef

Quinlan JR (1986) Induction of decision trees. Mach Learn 1:81–106

R Core Team (2014) R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, http://www.R-project.org/

Segal MR, Cummings MP, Hubbard AE (2001) Relating amino acid sequence to phenotype: analysis of peptide-binding data. Biometrics 57(2):632–643CrossRefMATHMathSciNet

Strobl C, Boulesteix AL, Zeileis A, Hothorn T (2007) Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinformatics 8(25):

Strobl C, Boulesteix AL, Kneib T, Augustin T, Zeileis A (2008) Conditional variable importance for random forests. BMC Bioinform 9:307CrossRef

Wickham H, Chang W (2014) devtools: Tools to make developing R code easier. http://CRAN.R-project.org/package=devtools, r package version 1.5

Zacarias OP, Boström H (2013) Comparing support vector regression and random forests for predicting malaria incidence in Mozambique. In: International conference on advances in ICT for Emerging regions, IEEE, pp 217–221

Zhao Z, Liu H (2007) Searching for interacting features. In: Proceedings of the 20th International Joint Conference on Artificial Intelligence, pp 1156–1161

Zhao Z, Liu H (2009) Searching for interacting features in subset selection. Intell Data Anal 13(2):207–228

Titel: A peek into the black box: exploring classifiers by randomization
verfasst von: Andreas Henelius
Kai Puolamäki
Henrik Boström
Lars Asker
Panagiotis Papapetrou
Publikationsdatum: 01.09.2014
Verlag: Springer US
Erschienen in: Data Mining and Knowledge Discovery / Ausgabe 5-6/2014
Print ISSN: 1384-5810
Elektronische ISSN: 1573-756X
DOI: https://doi.org/10.1007/s10618-014-0368-8

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 5-6/2014

Preserving worker privacy in crowdsourcing

Self-organizing maps by difference of convex functions optimization

Uncovering the plot: detecting surprising coalitions of entities in multi-relational schemas

Confidence bands for time series data

Discovering bands from graphs

Approximating the crowd