Skip to main content
Top
Published in: International Journal of Machine Learning and Cybernetics 4/2017

27-02-2016 | Original Article

Critical parameter analysis of Vertical Hoeffding Tree for optimized performance using SAMOA

Authors: Bakshi Rohit Prasad, Sonali Agarwal

Published in: International Journal of Machine Learning and Cybernetics | Issue 4/2017

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Streaming classification of big data is a method under stream data mining that learns from continuous, ordered sequences of data streams coming from diversified sources using limited computing and storage capabilities. SAMOA stands for scalable advanced massive online analysis, is a machine learning framework used to perform distributed data mining over streaming data. Vertical Hoeffding Tree (VHT) under SAMOA is a variant of very fast decision tree used for distributed classification of data streams. The performance of VHT depends on various critical parameters such as tie-threshold, grace value, confidence, split criterion, etc. Although, VHT is widely accepted as an efficient streaming classifier but one of the challenges in streaming classification is varying distribution of incoming data instances with respect to underlying classes in different datasets; therefore performance of VHT varies in different datasets. Therefore, achieving optimal performance from the stream classifier like VHT on different datasets is a challenging task and fixed set of values of critical parameters cannot be preconfigured for various types of datasets. This research work explores the capabilities of VHT streaming classifier of SAMOA in the light of various benchmarking performance statistics such as classification accuracy, kappa and kappa temporal. The work presented here, experimentally identifies suitable values of critical parameters of VHT that yield optimized performance on different datasets. Thus, this analytical study is extremely significant in developing streaming classifiers which achieve optimum performance via parameter tuning at run time.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Show more products
Literature
1.
go back to reference Murdopo A, Severien A, Morales GDF, Bifet A (2013) SAMOA: developer’s guide. Yahoo Labs, Barcelona Murdopo A, Severien A, Morales GDF, Bifet A (2013) SAMOA: developer’s guide. Yahoo Labs, Barcelona
3.
go back to reference Neumeyer L, Robbins B, Nair A, Kesari A (2010) S4: distributed stream computing platform. In: IEEE International conference on data mining workshops (ICDMW). IEEE Press, pp 170–177 Neumeyer L, Robbins B, Nair A, Kesari A (2010) S4: distributed stream computing platform. In: IEEE International conference on data mining workshops (ICDMW). IEEE Press, pp 170–177
6.
go back to reference Prasad BR, Agarwal S (2014) Handling big data stream analytics using SAMOA framework—a practical experience. Int J Database Theory Appl 7(4):197–208CrossRef Prasad BR, Agarwal S (2014) Handling big data stream analytics using SAMOA framework—a practical experience. Int J Database Theory Appl 7(4):197–208CrossRef
7.
go back to reference Domingos P, Hulten G (2000) Mining high-speed data streams. In: 6th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, pp 71–80 Domingos P, Hulten G (2000) Mining high-speed data streams. In: 6th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, pp 71–80
8.
go back to reference Gama J, Fernandes R, Rocha R (2006) Decision trees for mining data streams. Intell Data Anal 10:23–46 Gama J, Fernandes R, Rocha R (2006) Decision trees for mining data streams. Intell Data Anal 10:23–46
9.
go back to reference Yang H, Fong S (2011) Moderated VFDT in stream mining using adaptive tie threshold and incremental pruning. In: Data warehousing and knowledge discovery. Springer, Berlin, Heidelberg, pp 471–483 Yang H, Fong S (2011) Moderated VFDT in stream mining using adaptive tie threshold and incremental pruning. In: Data warehousing and knowledge discovery. Springer, Berlin, Heidelberg, pp 471–483
10.
go back to reference White T (2012) Hadoop: the definitive guide. O’Reilly Media Publishers, Yahoo Press White T (2012) Hadoop: the definitive guide. O’Reilly Media Publishers, Yahoo Press
13.
go back to reference Scott DM (2011) Real-time marketing and PR, revised: how to instantly engage your market, connect with customers, and create products that grow your business now. Wiley Desktop Editions Series. Wiley Scott DM (2011) Real-time marketing and PR, revised: how to instantly engage your market, connect with customers, and create products that grow your business now. Wiley Desktop Editions Series. Wiley
14.
go back to reference Taormina R et al (2015) ANN-based interval forecasting of stream flow discharges using the LUBE method and MOFIPS. Eng Appl Artif Intell 45:429–440CrossRef Taormina R et al (2015) ANN-based interval forecasting of stream flow discharges using the LUBE method and MOFIPS. Eng Appl Artif Intell 45:429–440CrossRef
15.
go back to reference Zhang J et al (2009) Multilayer ensemble pruning via novel multi-sub-swarm particle swarm optimization. J Univ Comput Sci 15(4):840–858 Zhang J et al (2009) Multilayer ensemble pruning via novel multi-sub-swarm particle swarm optimization. J Univ Comput Sci 15(4):840–858
16.
go back to reference Wang WC et al (2015) Improving forecasting accuracy of annual runoff time series using ARIMA based on EEMD decomposition. Water Resour Manage 29(8):2655–2675CrossRef Wang WC et al (2015) Improving forecasting accuracy of annual runoff time series using ARIMA based on EEMD decomposition. Water Resour Manage 29(8):2655–2675CrossRef
17.
go back to reference Zhang SW et al (2009) Dimension reduction using semi-supervised locally linear embedding for plant leaf classification. Lect Notes Comput Sci 5754:948–955CrossRef Zhang SW et al (2009) Dimension reduction using semi-supervised locally linear embedding for plant leaf classification. Lect Notes Comput Sci 5754:948–955CrossRef
18.
go back to reference Wu CL et al (2009) Methods to improve neural network performance in daily flows prediction. J Hydrol 372(1–4):80–93CrossRef Wu CL et al (2009) Methods to improve neural network performance in daily flows prediction. J Hydrol 372(1–4):80–93CrossRef
19.
go back to reference Chau KW et al (2010) A hybrid model coupled with singular spectrum analysis for daily rainfall prediction. J Hydroinform 12(4):458–473CrossRef Chau KW et al (2010) A hybrid model coupled with singular spectrum analysis for daily rainfall prediction. J Hydroinform 12(4):458–473CrossRef
20.
go back to reference Amatriain X (2012) Mining large streams of user data for personalized recommendations. ACM SIGKDD Explor Newsl 14:37–48CrossRef Amatriain X (2012) Mining large streams of user data for personalized recommendations. ACM SIGKDD Explor Newsl 14:37–48CrossRef
22.
24.
go back to reference Marz N, Warren J (2013) Big data: principles and best practices of scalable realtime data systems. Manning Publications, O’Reilly Media Marz N, Warren J (2013) Big data: principles and best practices of scalable realtime data systems. Manning Publications, O’Reilly Media
25.
go back to reference Alberg D, Last M, Kandel A (2012) knowledge discovery in data streams with regression tree methods. Wiley Interdiscip Rev Data Min Knowl Discov 2:69–78CrossRef Alberg D, Last M, Kandel A (2012) knowledge discovery in data streams with regression tree methods. Wiley Interdiscip Rev Data Min Knowl Discov 2:69–78CrossRef
26.
go back to reference Gehrke J, Ramakrishnan R, Ganti V (1998) Rainforest—a framework for fast decision tree construction of large datasets. In: 24th international conference on very large data bases. VLDB, pp 416–427 Gehrke J, Ramakrishnan R, Ganti V (1998) Rainforest—a framework for fast decision tree construction of large datasets. In: 24th international conference on very large data bases. VLDB, pp 416–427
27.
go back to reference Bifet A, Holmes G, Pfahringer B (2010) Leveraging bagging for evolving data streams. In: European conference on machine learning and knowledge discovery in databases. Springer, Berlin, Heidelberg, pp 135–150 Bifet A, Holmes G, Pfahringer B (2010) Leveraging bagging for evolving data streams. In: European conference on machine learning and knowledge discovery in databases. Springer, Berlin, Heidelberg, pp 135–150
29.
go back to reference Gomes JB, Ruiz EM, Sousa PAC (2011) Learning recurring concepts from data streams with a context-aware ensemble. In: ACM symposium on applied computing, pp 994–999 Gomes JB, Ruiz EM, Sousa PAC (2011) Learning recurring concepts from data streams with a context-aware ensemble. In: ACM symposium on applied computing, pp 994–999
30.
go back to reference Giraud-Carrier C (2000) A note on the utility of incremental learning. AI Commun 13:215–223MATH Giraud-Carrier C (2000) A note on the utility of incremental learning. AI Commun 13:215–223MATH
31.
go back to reference Bifet A, Holmes G, Pfahringer B, Kirkby R, Gavalda R (2009) New ensemble methods for evolving data streams. In: 15th ACMSIGKDD international conference on knowledge discovery and data mining. ACM, pp 139–148 Bifet A, Holmes G, Pfahringer B, Kirkby R, Gavalda R (2009) New ensemble methods for evolving data streams. In: 15th ACMSIGKDD international conference on knowledge discovery and data mining. ACM, pp 139–148
32.
33.
go back to reference Kadlec P, Grbic R, Gabrys B (2011) Review of adaptation mechanisms for data-driven soft sensors. Comput Chem Eng 35:1–24CrossRef Kadlec P, Grbic R, Gabrys B (2011) Review of adaptation mechanisms for data-driven soft sensors. Comput Chem Eng 35:1–24CrossRef
34.
go back to reference Moreno-Torres JG, Raeder T, Alaiz-Rodriguez R, Chawla NV, Herrera F (2012) A unifying view on dataset shift in classification. Pattern Recogn 45:521–530CrossRef Moreno-Torres JG, Raeder T, Alaiz-Rodriguez R, Chawla NV, Herrera F (2012) A unifying view on dataset shift in classification. Pattern Recogn 45:521–530CrossRef
35.
go back to reference Gama J, Medas P, Castillo G, Rodrigues P (2004) Learning with drift detection. In: 7th Brazilian symposium on artificial intelligence, pp 286–295 Gama J, Medas P, Castillo G, Rodrigues P (2004) Learning with drift detection. In: 7th Brazilian symposium on artificial intelligence, pp 286–295
36.
go back to reference Kolter J, Maloof M (2007) Dynamic weighted majority: an ensemble method for drifting concepts. J Mach Learn Res 8:2755–2790MATH Kolter J, Maloof M (2007) Dynamic weighted majority: an ensemble method for drifting concepts. J Mach Learn Res 8:2755–2790MATH
37.
go back to reference Ross G, Adams N, Tasoulis D, Hand D (2012) Exponentially weighted moving average charts for detecting concept drift. Pattern Recogn Lett 33:191–198CrossRef Ross G, Adams N, Tasoulis D, Hand D (2012) Exponentially weighted moving average charts for detecting concept drift. Pattern Recogn Lett 33:191–198CrossRef
39.
go back to reference Liu XY, Zhou ZH (2006) Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Trans Knowl Data Eng 18:63–77CrossRef Liu XY, Zhou ZH (2006) Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Trans Knowl Data Eng 18:63–77CrossRef
40.
go back to reference Sun Y, Kamel MS, Wang Y (2006) Boosting for learning multiple classes with imbalanced class distribution. In: International conference on data mining, pp 592–602 Sun Y, Kamel MS, Wang Y (2006) Boosting for learning multiple classes with imbalanced class distribution. In: International conference on data mining, pp 592–602
41.
go back to reference Abe N, Zadrozny B, Langford J (2004) An iterative method for multi-class cost-sensitive learning. In: ACM SIGKDD international conference on knowledge discovery and data mining, pp 3–11 Abe N, Zadrozny B, Langford J (2004) An iterative method for multi-class cost-sensitive learning. In: ACM SIGKDD international conference on knowledge discovery and data mining, pp 3–11
42.
go back to reference Mitsa T (2010) Importance of temporal data mining today. In: Temporal data mining. Chapman and Hall/CRC, Taylor and Francis Group, CRC Press, pp 1–17 Mitsa T (2010) Importance of temporal data mining today. In: Temporal data mining. Chapman and Hall/CRC, Taylor and Francis Group, CRC Press, pp 1–17
43.
go back to reference Bifet A et al (2013) Pitfalls in benchmarking data stream classification and how to avoid them. In: Machine learning and knowledge discovery in databases. Springer, Berlin, Heidelberg, pp 465–479 Bifet A et al (2013) Pitfalls in benchmarking data stream classification and how to avoid them. In: Machine learning and knowledge discovery in databases. Springer, Berlin, Heidelberg, pp 465–479
Metadata
Title
Critical parameter analysis of Vertical Hoeffding Tree for optimized performance using SAMOA
Authors
Bakshi Rohit Prasad
Sonali Agarwal
Publication date
27-02-2016
Publisher
Springer Berlin Heidelberg
Published in
International Journal of Machine Learning and Cybernetics / Issue 4/2017
Print ISSN: 1868-8071
Electronic ISSN: 1868-808X
DOI
https://doi.org/10.1007/s13042-016-0513-3

Other articles of this Issue 4/2017

International Journal of Machine Learning and Cybernetics 4/2017 Go to the issue