Skip to main content
Erschienen in: Data Mining and Knowledge Discovery 5-6/2014

01.09.2014

A peek into the black box: exploring classifiers by randomization

verfasst von: Andreas Henelius, Kai Puolamäki, Henrik Boström, Lars Asker, Panagiotis Papapetrou

Erschienen in: Data Mining and Knowledge Discovery | Ausgabe 5-6/2014

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Classifiers are often opaque and cannot easily be inspected to gain understanding of which factors are of importance. We propose an efficient iterative algorithm to find the attributes and dependencies used by any classifier when making predictions. The performance and utility of the algorithm is demonstrated on two synthetic and 26 real-world datasets, using 15 commonly used learning algorithms to generate the classifiers. The empirical investigation shows that the novel algorithm is indeed able to find groupings of interacting attributes exploited by the different classifiers. These groupings allow for finding similarities among classifiers for a single dataset as well as for determining the extent to which different classifiers exploit such interactions in general.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
The algorithm can be downloaded from https://​bitbucket.​org/​aheneliu/​goldeneye/​ (accessed 7 July 2014) or easily installed using the command install_bitbucket from the devtools package (Wickham and Chang 2014) as follows: install_bitbucket(repo = “goldeneye”, username = “aheneliu”).
 
Literatur
Zurück zum Zitat Andrews R, Diederich J, Tickle AB (1995) Survey and critique of techniques for extracting rules from trained artificial neural networks. Knowl Based Syst 8(6):373–389CrossRef Andrews R, Diederich J, Tickle AB (1995) Survey and critique of techniques for extracting rules from trained artificial neural networks. Knowl Based Syst 8(6):373–389CrossRef
Zurück zum Zitat Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and regression trees. Wadsworth, BelmontMATH Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and regression trees. Wadsworth, BelmontMATH
Zurück zum Zitat Chanda P, Cho YR, Zhang A, Ramanathan M (2009) Mining of attribute interactions using information theoretic metrics. In: IEEE International Conference on Data Mining Workshops, pp 350–355 Chanda P, Cho YR, Zhang A, Ramanathan M (2009) Mining of attribute interactions using information theoretic metrics. In: IEEE International Conference on Data Mining Workshops, pp 350–355
Zurück zum Zitat Clark P, Niblett T (1989) The CN2 induction algorithm. Mach Learn 3(4):261–283 Clark P, Niblett T (1989) The CN2 induction algorithm. Mach Learn 3(4):261–283
Zurück zum Zitat Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20:273–293MATH Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20:273–293MATH
Zurück zum Zitat De Bie T (2011a) An information theoretic framework for data mining. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, New York, NY, KDD ’11, pp 564–572 De Bie T (2011a) An information theoretic framework for data mining. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, New York, NY, KDD ’11, pp 564–572
Zurück zum Zitat De Bie T (2011b) Maximum entropy models and subjective interestingness: an application to tiles in binary databases. Data Min Knowl Discov 23(3):407–446CrossRefMATHMathSciNet De Bie T (2011b) Maximum entropy models and subjective interestingness: an application to tiles in binary databases. Data Min Knowl Discov 23(3):407–446CrossRefMATHMathSciNet
Zurück zum Zitat Domingos P, Pazzani MJ (1997) On the optimality of the simple bayesian classifier under zero-one loss. Machine Learning 29(2–3):103–130CrossRefMATH Domingos P, Pazzani MJ (1997) On the optimality of the simple bayesian classifier under zero-one loss. Machine Learning 29(2–3):103–130CrossRefMATH
Zurück zum Zitat Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The weka data mining software: an update. ACM SIGKDD Explor Newsl 11(1):10–18CrossRef Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The weka data mining software: an update. ACM SIGKDD Explor Newsl 11(1):10–18CrossRef
Zurück zum Zitat Hanhijärvi S, Ojala M, Vuokko N, Puolamäki K, Tatti N, Mannila H (2009) Tell me something i don’t know: Randomization strategies for iterative data mining. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, New York, NY, KDD ’09, pp 379–388 Hanhijärvi S, Ojala M, Vuokko N, Puolamäki K, Tatti N, Mannila H (2009) Tell me something i don’t know: Randomization strategies for iterative data mining. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, New York, NY, KDD ’09, pp 379–388
Zurück zum Zitat Henelius A, Korpela J, Puolamäki K (2013) Explaining interval sequences by randomization. In: Blockeel H, Kersting K, Nijssen S, Z̆elezný Filip (eds) Machine Learning and Knowledge Discovery in Databases, Lecture Notes in Computer Science, vol 8188, pp 337–352 Henelius A, Korpela J, Puolamäki K (2013) Explaining interval sequences by randomization. In: Blockeel H, Kersting K, Nijssen S, Z̆elezný Filip (eds) Machine Learning and Knowledge Discovery in Databases, Lecture Notes in Computer Science, vol 8188, pp 337–352
Zurück zum Zitat Ishibuchi H, Nojima Y (2007) Analysis of interpretability-accuracy tradeoff of fuzzy systems by multiobjective fuzzy genetics-based machine learning. Int J Approx Reason 44(1):4–31CrossRefMATHMathSciNet Ishibuchi H, Nojima Y (2007) Analysis of interpretability-accuracy tradeoff of fuzzy systems by multiobjective fuzzy genetics-based machine learning. Int J Approx Reason 44(1):4–31CrossRefMATHMathSciNet
Zurück zum Zitat Jakulin A, Bratko I, Smrke D, Demsar J, Zupan B (2003) Attribute interactions in medical data analysis. In: 9th Conference on Artificial Intelligence in Medicine in Europe, pp 229–238 Jakulin A, Bratko I, Smrke D, Demsar J, Zupan B (2003) Attribute interactions in medical data analysis. In: 9th Conference on Artificial Intelligence in Medicine in Europe, pp 229–238
Zurück zum Zitat Janitza S, Strobl C, Boulesteix AL (2013) An auc-based permutation variable importance measure for random forests. BMC Bioinform 14:119CrossRef Janitza S, Strobl C, Boulesteix AL (2013) An auc-based permutation variable importance measure for random forests. BMC Bioinform 14:119CrossRef
Zurück zum Zitat Johansson U, König R, Niklasson L (2003) Rule extraction from trained neural networks using genetic programming. In: 13th International Conference on Artificial Neural Networks, pp 13–16 Johansson U, König R, Niklasson L (2003) Rule extraction from trained neural networks using genetic programming. In: 13th International Conference on Artificial Neural Networks, pp 13–16
Zurück zum Zitat Misra G, Golshan B, Terzi E (2012) A Framework for Evaluating the Smoothness of Data-Mining Results. In: Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases, vol II, pp 660–675 Misra G, Golshan B, Terzi E (2012) A Framework for Evaluating the Smoothness of Data-Mining Results. In: Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases, vol II, pp 660–675
Zurück zum Zitat Ojala M, Garriga GC (2010) Permutation tests for studying classier performance. J Mach Learn Res 11:1833–1863MATHMathSciNet Ojala M, Garriga GC (2010) Permutation tests for studying classier performance. J Mach Learn Res 11:1833–1863MATHMathSciNet
Zurück zum Zitat Plate T (1999) Accuracy versus interpretability in flexible modeling: implementing a tradeoff using gaussian process models. Behaviormetrika 26:29–50CrossRef Plate T (1999) Accuracy versus interpretability in flexible modeling: implementing a tradeoff using gaussian process models. Behaviormetrika 26:29–50CrossRef
Zurück zum Zitat Pulkkinen P, Koivisto H (2008) Fuzzy classifier identification using decision tree and multiobjective evolutionary algorithms. Int J Approx Reason 48(2):526–543CrossRef Pulkkinen P, Koivisto H (2008) Fuzzy classifier identification using decision tree and multiobjective evolutionary algorithms. Int J Approx Reason 48(2):526–543CrossRef
Zurück zum Zitat Quinlan JR (1986) Induction of decision trees. Mach Learn 1:81–106 Quinlan JR (1986) Induction of decision trees. Mach Learn 1:81–106
Zurück zum Zitat Segal MR, Cummings MP, Hubbard AE (2001) Relating amino acid sequence to phenotype: analysis of peptide-binding data. Biometrics 57(2):632–643CrossRefMATHMathSciNet Segal MR, Cummings MP, Hubbard AE (2001) Relating amino acid sequence to phenotype: analysis of peptide-binding data. Biometrics 57(2):632–643CrossRefMATHMathSciNet
Zurück zum Zitat Strobl C, Boulesteix AL, Zeileis A, Hothorn T (2007) Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinformatics 8(25): Strobl C, Boulesteix AL, Zeileis A, Hothorn T (2007) Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinformatics 8(25):
Zurück zum Zitat Strobl C, Boulesteix AL, Kneib T, Augustin T, Zeileis A (2008) Conditional variable importance for random forests. BMC Bioinform 9:307CrossRef Strobl C, Boulesteix AL, Kneib T, Augustin T, Zeileis A (2008) Conditional variable importance for random forests. BMC Bioinform 9:307CrossRef
Zurück zum Zitat Zacarias OP, Boström H (2013) Comparing support vector regression and random forests for predicting malaria incidence in Mozambique. In: International conference on advances in ICT for Emerging regions, IEEE, pp 217–221 Zacarias OP, Boström H (2013) Comparing support vector regression and random forests for predicting malaria incidence in Mozambique. In: International conference on advances in ICT for Emerging regions, IEEE, pp 217–221
Zurück zum Zitat Zhao Z, Liu H (2007) Searching for interacting features. In: Proceedings of the 20th International Joint Conference on Artificial Intelligence, pp 1156–1161 Zhao Z, Liu H (2007) Searching for interacting features. In: Proceedings of the 20th International Joint Conference on Artificial Intelligence, pp 1156–1161
Zurück zum Zitat Zhao Z, Liu H (2009) Searching for interacting features in subset selection. Intell Data Anal 13(2):207–228 Zhao Z, Liu H (2009) Searching for interacting features in subset selection. Intell Data Anal 13(2):207–228
Metadaten
Titel
A peek into the black box: exploring classifiers by randomization
verfasst von
Andreas Henelius
Kai Puolamäki
Henrik Boström
Lars Asker
Panagiotis Papapetrou
Publikationsdatum
01.09.2014
Verlag
Springer US
Erschienen in
Data Mining and Knowledge Discovery / Ausgabe 5-6/2014
Print ISSN: 1384-5810
Elektronische ISSN: 1573-756X
DOI
https://doi.org/10.1007/s10618-014-0368-8

Weitere Artikel der Ausgabe 5-6/2014

Data Mining and Knowledge Discovery 5-6/2014 Zur Ausgabe