Skip to main content
Top
Published in: Annals of Data Science 3/2021

06-06-2019

Analytical Split Value Calculation for Numerical Attributes in Hoeffding Trees with Misclassification-Based Impurity

Authors: Mehran Mirkhan, Maryam Amir Haeri, Mohammad Reza Meybodi

Published in: Annals of Data Science | Issue 3/2021

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Hoeffding tree is a method to incrementally build decision trees. A common approach to handle numerical attributes in Hoeffding trees is to represent their sufficient statistics as Gaussian distributions. Our contribution in this paper is to prove that by using Gaussian distribution as sufficient statistics and misclassification error as impurity measure, there is an analytical method to exactly calculate the best splitting values. Three different approaches for using this theorem are proposed and all three are tested on both synthetic and real datasets. The experiments suggest that this approach can create smaller trees and learn faster and achieve higher accuracy in most problems.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Tidke B, Mehta R (2018) A comprehensive review and open challenges of stream big data. In: Pant M, Ray K, Sharma TK, Rawat S, Bandyopadhyay A (eds) Soft computing: theories and applications. Springer, Berlin, pp 89–99CrossRef Tidke B, Mehta R (2018) A comprehensive review and open challenges of stream big data. In: Pant M, Ray K, Sharma TK, Rawat S, Bandyopadhyay A (eds) Soft computing: theories and applications. Springer, Berlin, pp 89–99CrossRef
2.
go back to reference Crawford SL (1989) Extensions to the cart algorithm. Int J Man-Mach Stud 31(2):197–217CrossRef Crawford SL (1989) Extensions to the cart algorithm. Int J Man-Mach Stud 31(2):197–217CrossRef
3.
go back to reference Utgoff PE, Berkman NC, Clouse JA (1997) Decision tree induction based on efficient tree restructuring. Mach Learn 29(1):5–44CrossRef Utgoff PE, Berkman NC, Clouse JA (1997) Decision tree induction based on efficient tree restructuring. Mach Learn 29(1):5–44CrossRef
4.
go back to reference Domingos P, Hulten G (2000) Mining high-speed data streams. In: Proceedings of the 6th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 71–80 Domingos P, Hulten G (2000) Mining high-speed data streams. In: Proceedings of the 6th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 71–80
5.
go back to reference Hulten G, Spencer L, Domingos P (2001) Mining time-changing data streams. In: Proceedings of the 7th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 97–106 Hulten G, Spencer L, Domingos P (2001) Mining time-changing data streams. In: Proceedings of the 7th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 97–106
6.
go back to reference Zhang P, Zhou C, Wang P, Gao BJ, Zhu X, Guo L (2015) E-tree: an efficient indexing structure for ensemble models on data streams. IEEE Trans Knowl Data Eng 27(2):461–474CrossRef Zhang P, Zhou C, Wang P, Gao BJ, Zhu X, Guo L (2015) E-tree: an efficient indexing structure for ensemble models on data streams. IEEE Trans Knowl Data Eng 27(2):461–474CrossRef
7.
go back to reference Gomes HM, Bifet A, Read J, Barddal JP, Enembreck F, Pfharinger B, Holmes G, Abdessalem T (2017) Adaptive random forests for evolving data stream classification. Mach Learn 106(9–10):1469–1495CrossRef Gomes HM, Bifet A, Read J, Barddal JP, Enembreck F, Pfharinger B, Holmes G, Abdessalem T (2017) Adaptive random forests for evolving data stream classification. Mach Learn 106(9–10):1469–1495CrossRef
8.
go back to reference Bifet A, Zhang J, Fan W, He C, Zhang J, Qian J, Holmes G, Pfahringer B (2017) Extremely fast decision tree mining for evolving data streams. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 1733–1742 Bifet A, Zhang J, Fan W, He C, Zhang J, Qian J, Holmes G, Pfahringer B (2017) Extremely fast decision tree mining for evolving data streams. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 1733–1742
9.
go back to reference Segatori A, Marcelloni F, Pedrycz W (2018) On distributed fuzzy decision trees for big data. IEEE Trans Fuzzy Syst 26(1):174–192CrossRef Segatori A, Marcelloni F, Pedrycz W (2018) On distributed fuzzy decision trees for big data. IEEE Trans Fuzzy Syst 26(1):174–192CrossRef
10.
go back to reference Lo W-T, Chang Y-S, Sheu R-K, Chiu C-C, Yuan S-M (2014) Cudt: a cuda based decision tree algorithm. Sci World J 2014 Article ID 745640 Lo W-T, Chang Y-S, Sheu R-K, Chiu C-C, Yuan S-M (2014) Cudt: a cuda based decision tree algorithm. Sci World J 2014 Article ID 745640
11.
go back to reference De Rosa R, Cesa-Bianchi N (2015) Splitting with confidence in decision trees with application to stream mining. In: 2015 international joint conference on neural networks (IJCNN). IEEE, pp 1–8 De Rosa R, Cesa-Bianchi N (2015) Splitting with confidence in decision trees with application to stream mining. In: 2015 international joint conference on neural networks (IJCNN). IEEE, pp 1–8
12.
go back to reference Rutkowski L, Jaworski M, Pietruczuk L, Duda P (2015) A new method for data stream mining based on the misclassification error. IEEE Trans Neural Netw Learn Syst 26(5):1048–1059CrossRef Rutkowski L, Jaworski M, Pietruczuk L, Duda P (2015) A new method for data stream mining based on the misclassification error. IEEE Trans Neural Netw Learn Syst 26(5):1048–1059CrossRef
13.
go back to reference Jaworski M, Duda P, Rutkowski L (2018) New splitting criteria for decision trees in stationary data streams. IEEE Trans Neural Netw Learn Syst 29(6):2516–2529CrossRef Jaworski M, Duda P, Rutkowski L (2018) New splitting criteria for decision trees in stationary data streams. IEEE Trans Neural Netw Learn Syst 29(6):2516–2529CrossRef
14.
go back to reference Dougherty J, Kohavi R, Sahami M (1995) Supervised and unsupervised discretization of continuous features. In: Machine learning proceedings. Elsevier, pp 194–202 Dougherty J, Kohavi R, Sahami M (1995) Supervised and unsupervised discretization of continuous features. In: Machine learning proceedings. Elsevier, pp 194–202
15.
go back to reference Quinlan JR (1993) C4.5: Programming for machine learning. Morgan Kauffmann 38:48 Quinlan JR (1993) C4.5: Programming for machine learning. Morgan Kauffmann 38:48
16.
go back to reference Ramírez-Gallego S, García S, Mouriño-Talín H, Martínez-Rego D, Bolón-Canedo V, Alonso-Betanzos A, Benítez JM, Herrera F (2016) Data discretization: taxonomy and big data challenge. Wiley Interdiscip Rev Data Min Knowl Discov 6(1):5–21CrossRef Ramírez-Gallego S, García S, Mouriño-Talín H, Martínez-Rego D, Bolón-Canedo V, Alonso-Betanzos A, Benítez JM, Herrera F (2016) Data discretization: taxonomy and big data challenge. Wiley Interdiscip Rev Data Min Knowl Discov 6(1):5–21CrossRef
17.
go back to reference Gama J, Pinto C (2006) Discretization from data streams: applications to histograms and data mining. In: Proceedings of the 2006 ACM symposium on applied computing. ACM, pp 662–667 Gama J, Pinto C (2006) Discretization from data streams: applications to histograms and data mining. In: Proceedings of the 2006 ACM symposium on applied computing. ACM, pp 662–667
18.
go back to reference Kirkby RB (2007) Improving hoeffding trees. Ph.D. dissertation, The University of Waikato Kirkby RB (2007) Improving hoeffding trees. Ph.D. dissertation, The University of Waikato
19.
go back to reference Salperwyck C, Lemaire V (2013) Incremental decision tree based on order statistics. In: The 2013 international joint conference on neural networks (IJCNN). IEEE, pp 1–8 Salperwyck C, Lemaire V (2013) Incremental decision tree based on order statistics. In: The 2013 international joint conference on neural networks (IJCNN). IEEE, pp 1–8
20.
go back to reference Zhang Y, Cheung Y-M (2014) Discretizing numerical attributes in decision tree for big data analysis. In: 2014 IEEE international conference on data mining workshop. IEEE, pp 1150–1157 Zhang Y, Cheung Y-M (2014) Discretizing numerical attributes in decision tree for big data analysis. In: 2014 IEEE international conference on data mining workshop. IEEE, pp 1150–1157
21.
go back to reference Jadhav SA, Kosbatwar S (2016) Concept-adapting very fast decision tree with misclassification error. Int J Adv Res Comput Eng Technol (IJARCET) 5(6):1763–1767 Jadhav SA, Kosbatwar S (2016) Concept-adapting very fast decision tree with misclassification error. Int J Adv Res Comput Eng Technol (IJARCET) 5(6):1763–1767
22.
go back to reference Krawczyk B, Wozniak M (2015) Weighted naive bayes classifier with forgetting for drifting data streams. In: 2015 IEEE international conference on systems, man, and cybernetics. IEEE, pp 2147–2152 Krawczyk B, Wozniak M (2015) Weighted naive bayes classifier with forgetting for drifting data streams. In: 2015 IEEE international conference on systems, man, and cybernetics. IEEE, pp 2147–2152
23.
go back to reference Rutkowski L, Pietruczuk L, Duda P, Jaworski M (2013) Decision trees for mining data streams based on the Mcdiarmid’s bound. IEEE Trans Knowl Data Eng 25(6):1272–1279CrossRef Rutkowski L, Pietruczuk L, Duda P, Jaworski M (2013) Decision trees for mining data streams based on the Mcdiarmid’s bound. IEEE Trans Knowl Data Eng 25(6):1272–1279CrossRef
Metadata
Title
Analytical Split Value Calculation for Numerical Attributes in Hoeffding Trees with Misclassification-Based Impurity
Authors
Mehran Mirkhan
Maryam Amir Haeri
Mohammad Reza Meybodi
Publication date
06-06-2019
Publisher
Springer Berlin Heidelberg
Published in
Annals of Data Science / Issue 3/2021
Print ISSN: 2198-5804
Electronic ISSN: 2198-5812
DOI
https://doi.org/10.1007/s40745-019-00225-4

Other articles of this Issue 3/2021

Annals of Data Science 3/2021 Go to the issue