Skip to main content
Erschienen in: Advances in Data Analysis and Classification 4/2018

21.03.2018 | Regular Article

An efficient random forests algorithm for high dimensional data classification

verfasst von: Qiang Wang, Thanh-Tung Nguyen, Joshua Z. Huang, Thuy Thi Nguyen

Erschienen in: Advances in Data Analysis and Classification | Ausgabe 4/2018

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

In this paper, we propose a new random forest (RF) algorithm to deal with high dimensional data for classification using subspace feature sampling method and feature value searching. The new subspace sampling method maintains the diversity and randomness of the forest and enables one to generate trees with a lower prediction error. A greedy technique is used to handle cardinal categorical features for efficient node splitting when building decision trees in the forest. This allows trees to handle very high cardinality meanwhile reducing computational time in building the RF model. Extensive experiments on high dimensional real data sets including standard machine learning data sets and image data sets have been conducted. The results demonstrated that the proposed approach for learning RFs significantly reduced prediction errors and outperformed most existing RFs when dealing with high-dimensional data.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
Zurück zum Zitat Amaratunga D, Cabrera J, Lee YS (2008) Enriched random forests. Bioinformatics 24(18):2010–2014CrossRef Amaratunga D, Cabrera J, Lee YS (2008) Enriched random forests. Bioinformatics 24(18):2010–2014CrossRef
Zurück zum Zitat Banfield RE, Hall LO, Bowyer KW, Kegelmeyer WP (2007) A comparison of decision tree ensemble creation techniques. IEEE Trans Pattern Anal Mach Intell 29:173–180CrossRef Banfield RE, Hall LO, Bowyer KW, Kegelmeyer WP (2007) A comparison of decision tree ensemble creation techniques. IEEE Trans Pattern Anal Mach Intell 29:173–180CrossRef
Zurück zum Zitat Breiman L, Friedman J, Stone CJ, Olshen RA (1984) Classification and regression trees. CRC Press, Boca RatonMATH Breiman L, Friedman J, Stone CJ, Olshen RA (1984) Classification and regression trees. CRC Press, Boca RatonMATH
Zurück zum Zitat Deng H, Runger G (2013) Gene selection with guided regularized random forest. Pattern Recognit 46(12):3483–3489CrossRef Deng H, Runger G (2013) Gene selection with guided regularized random forest. Pattern Recognit 46(12):3483–3489CrossRef
Zurück zum Zitat Dietterich TG (2000) An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization. Mach Learn 40(2):139–157CrossRef Dietterich TG (2000) An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization. Mach Learn 40(2):139–157CrossRef
Zurück zum Zitat Donoho DL et al (2000) High-dimensional data analysis: the curses and blessings of dimensionality. AMS Math Challenges Lecture, pp 1–32 Donoho DL et al (2000) High-dimensional data analysis: the curses and blessings of dimensionality. AMS Math Challenges Lecture, pp 1–32
Zurück zum Zitat Genuer R, Poggi JM, Tuleau-Malot C (2010) Variable selection using random forests. Pattern Recognit Lett 31(14):2225–2236CrossRef Genuer R, Poggi JM, Tuleau-Malot C (2010) Variable selection using random forests. Pattern Recognit Lett 31(14):2225–2236CrossRef
Zurück zum Zitat Georghiades AS, Belhumeur PN, Kriegman D (2001) From few to many: illumination cone models for face recognition under variable lighting and pose. IEEE Trans Pattern Anal Mach Intell 23(6):643–660CrossRef Georghiades AS, Belhumeur PN, Kriegman D (2001) From few to many: illumination cone models for face recognition under variable lighting and pose. IEEE Trans Pattern Anal Mach Intell 23(6):643–660CrossRef
Zurück zum Zitat Ho TK (1998) The random subspace method for constructing decision forests. IEEE Trans Pattern Anal Mach Intell 20(8):832–844CrossRef Ho TK (1998) The random subspace method for constructing decision forests. IEEE Trans Pattern Anal Mach Intell 20(8):832–844CrossRef
Zurück zum Zitat Lepetit V, Fua P (2006) Keypoint recognition using randomized trees. IEEE Trans Pattern Anal Mach Intell 28(9):1465–1479CrossRef Lepetit V, Fua P (2006) Keypoint recognition using randomized trees. IEEE Trans Pattern Anal Mach Intell 28(9):1465–1479CrossRef
Zurück zum Zitat Liaw A, Wiener M (2002) Classification and regression by random forest. R News 2(3):18–22 Liaw A, Wiener M (2002) Classification and regression by random forest. R News 2(3):18–22
Zurück zum Zitat Louppe G, Wehenkel L, Sutera A, Geurts P (2013) Understanding variable importances in forests of randomized trees. In: Advances in neural information processing systems, pp 431–439 Louppe G, Wehenkel L, Sutera A, Geurts P (2013) Understanding variable importances in forests of randomized trees. In: Advances in neural information processing systems, pp 431–439
Zurück zum Zitat Meinshausen N (2012) quantregforest: quantile regression forests. R package version 02-3 Meinshausen N (2012) quantregforest: quantile regression forests. R package version 02-3
Zurück zum Zitat Nguyen TT, Huang J, Nguyen T (2015) Two-level quantile regression forests for bias correction in range prediction. Mach Learn 101(1–3):325–343MathSciNetCrossRef Nguyen TT, Huang J, Nguyen T (2015) Two-level quantile regression forests for bias correction in range prediction. Mach Learn 101(1–3):325–343MathSciNetCrossRef
Zurück zum Zitat Samaria FS, Harter AC (1994) Parameterisation of a stochastic model for human face identification. In: Proceedings of the second IEEE workshop on applications of computer vision. IEEE, pp 138–142 Samaria FS, Harter AC (1994) Parameterisation of a stochastic model for human face identification. In: Proceedings of the second IEEE workshop on applications of computer vision. IEEE, pp 138–142
Zurück zum Zitat Turk M, Pentland A (1991) Eigenfaces for recognition. J Cogn Neurosci 3(1):71–86CrossRef Turk M, Pentland A (1991) Eigenfaces for recognition. J Cogn Neurosci 3(1):71–86CrossRef
Zurück zum Zitat Tuv E, Borisov A, Runger G, Torkkola K (2009) Feature selection with ensembles, artificial variables, and redundancy elimination. J Mach Learn Res 10:1341–1366MathSciNetMATH Tuv E, Borisov A, Runger G, Torkkola K (2009) Feature selection with ensembles, artificial variables, and redundancy elimination. J Mach Learn Res 10:1341–1366MathSciNetMATH
Zurück zum Zitat Viswanathan V, Sen A, Chakraborty S (2011) Stochastic greedy algorithms: a leaning based approach to combinatorial optimization. Int J Adv Softw 4(1 and 2):1–11 Viswanathan V, Sen A, Chakraborty S (2011) Stochastic greedy algorithms: a leaning based approach to combinatorial optimization. Int J Adv Softw 4(1 and 2):1–11
Zurück zum Zitat Xu B, Huang JZ, Williams G, Wang Q, Ye Y (2012) Classifying very high-dimensional data with random forests built from small subspaces. Int J Data Warehous Min 8(2):44–63CrossRef Xu B, Huang JZ, Williams G, Wang Q, Ye Y (2012) Classifying very high-dimensional data with random forests built from small subspaces. Int J Data Warehous Min 8(2):44–63CrossRef
Zurück zum Zitat Ye Y, Wu Q, Zhexue Huang J, Ng MK, Li X (2013) Stratified sampling for feature subspace selection in random forests for high dimensional data. Pattern Recognit 46(3):769–787CrossRef Ye Y, Wu Q, Zhexue Huang J, Ng MK, Li X (2013) Stratified sampling for feature subspace selection in random forests for high dimensional data. Pattern Recognit 46(3):769–787CrossRef
Zurück zum Zitat Zhang J, Marszałek M, Lazebnik S, Schmid C (2007) Local features and kernels for classification of texture and object categories: a comprehensive study. Int J Comput Vis 73(2):213–238CrossRef Zhang J, Marszałek M, Lazebnik S, Schmid C (2007) Local features and kernels for classification of texture and object categories: a comprehensive study. Int J Comput Vis 73(2):213–238CrossRef
Metadaten
Titel
An efficient random forests algorithm for high dimensional data classification
verfasst von
Qiang Wang
Thanh-Tung Nguyen
Joshua Z. Huang
Thuy Thi Nguyen
Publikationsdatum
21.03.2018
Verlag
Springer Berlin Heidelberg
Erschienen in
Advances in Data Analysis and Classification / Ausgabe 4/2018
Print ISSN: 1862-5347
Elektronische ISSN: 1862-5355
DOI
https://doi.org/10.1007/s11634-018-0318-1

Weitere Artikel der Ausgabe 4/2018

Advances in Data Analysis and Classification 4/2018 Zur Ausgabe