Skip to main content
Erschienen in:
Buchtitelbild

2015 | OriginalPaper | Buchkapitel

A Parallel Approach for Decision Trees Learning from Big Data Streams

verfasst von : Ionel Tudor Calistru, Paul Cotofrei, Kilian Stoffel

Erschienen in: Business Information Systems

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

In this paper we introduce PdsCART, a parallel decision tree learning algorithm. There are three characteristics that are important to emphasize and make this algorithm particularly interesting. Firstly, the algorithm we present here can work with streaming data, i.e. one pass over data is sufficient to construct the tree. Secondly, the algorithm is able to process in parallel a larger amount of data stream records and can therefor handle efficiently very large data sets. And thirdly, the algorithm can be implemented in the MapReduce framework. Details about the algorithm and some basic performance results are presented.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
We are following very closely the description of dsCART algorithm done by Leszek Rutkowski, Maciej Jaworski, Lena Pietruczuk and Piotr Duda in [24].
 
2
The standard method of dividing the range of numerical attributes values into bins.
 
3
http://​archive.​ics.​uci.​edu/​ml - for simplicity we have considered only the numerical attributes.
 
Literatur
1.
Zurück zum Zitat Dean, J., Ghemawat, S.: MapReduce: Simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRef Dean, J., Ghemawat, S.: MapReduce: Simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRef
2.
Zurück zum Zitat Breiman, L., Friedman, J., Olshen, R., Stone, C.: Classification and Regression Trees. Chapman & Hall/CRC, New York (1984)MATH Breiman, L., Friedman, J., Olshen, R., Stone, C.: Classification and Regression Trees. Chapman & Hall/CRC, New York (1984)MATH
3.
Zurück zum Zitat Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc., San Francisco (1993) Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc., San Francisco (1993)
4.
Zurück zum Zitat Shafer, C., Agrawal, R., Mehta, M.: SPRINT: a scalable parallel classifier for data mining. In: Proceedings of the 22th International Conference on VLDB, pp. 544–555 (1996) Shafer, C., Agrawal, R., Mehta, M.: SPRINT: a scalable parallel classifier for data mining. In: Proceedings of the 22th International Conference on VLDB, pp. 544–555 (1996)
5.
Zurück zum Zitat Joshi, M., Karypis, G., Kumar, V.: ScalParC: a new scalable and efficient parallel classification algorithm for mining large datasets. In: Proceedings of the 12th International Parallel Processing Symposium, pp. 573–579 (1998) Joshi, M., Karypis, G., Kumar, V.: ScalParC: a new scalable and efficient parallel classification algorithm for mining large datasets. In: Proceedings of the 12th International Parallel Processing Symposium, pp. 573–579 (1998)
6.
Zurück zum Zitat Sreenivas, M., Alsabti, K., Ranka, S.: Parallel out-of-core divide-and-conquer techniques with applications to classification trees. In: The 10th Symposium on Parallel and Distributed Processing, pp. 555–562 (1999) Sreenivas, M., Alsabti, K., Ranka, S.: Parallel out-of-core divide-and-conquer techniques with applications to classification trees. In: The 10th Symposium on Parallel and Distributed Processing, pp. 555–562 (1999)
7.
Zurück zum Zitat Jin, R., Agrawal, G.: Communication and memory efficient parallel decision tree construction. In: Proceedings of the 3rd SIAM International Conference on Data Mining (SDM), pp. 119–129 SIAM, (2003) Jin, R., Agrawal, G.: Communication and memory efficient parallel decision tree construction. In: Proceedings of the 3rd SIAM International Conference on Data Mining (SDM), pp. 119–129 SIAM, (2003)
8.
Zurück zum Zitat Ben-Haim, Y., Tom-Tov, E.: A streaming parallel decision tree algorithm. J. Mach. Learn. Res. 11, 849–872 (2010)MATHMathSciNet Ben-Haim, Y., Tom-Tov, E.: A streaming parallel decision tree algorithm. J. Mach. Learn. Res. 11, 849–872 (2010)MATHMathSciNet
9.
Zurück zum Zitat Srivastava, A., Han, E., Kumar, V., Singh, V.: Parallel formulations of decision-tree classification algorithms. Data Min. Knowl. Discov. 3(3), 237–261 (1999)CrossRef Srivastava, A., Han, E., Kumar, V., Singh, V.: Parallel formulations of decision-tree classification algorithms. Data Min. Knowl. Discov. 3(3), 237–261 (1999)CrossRef
10.
Zurück zum Zitat Amado, N., Gama, J., Silva, F.: Parallel implementation of decision tree learning algorithms. In: Brazdil, P.B., Jorge, A.M. (eds.) EPIA 2001. LNCS (LNAI), vol. 2258, pp. 6–13. Springer, Heidelberg (2001) Amado, N., Gama, J., Silva, F.: Parallel implementation of decision tree learning algorithms. In: Brazdil, P.B., Jorge, A.M. (eds.) EPIA 2001. LNCS (LNAI), vol. 2258, pp. 6–13. Springer, Heidelberg (2001)
11.
Zurück zum Zitat Panda, B., Herbach, J., Basu, S., Bayardo, R.: PLANET Massively parallel learning of tree ensembles with MapReduce. In: Proceedings of VLDB-2009 (2009) Panda, B., Herbach, J., Basu, S., Bayardo, R.: PLANET Massively parallel learning of tree ensembles with MapReduce. In: Proceedings of VLDB-2009 (2009)
12.
Zurück zum Zitat Ye, J., Chow, J.-H., Chen, J., Zheng, Z.: Stochastic gradient boosted distributed decision trees. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management, pp. 2061–2064 (2009) Ye, J., Chow, J.-H., Chen, J., Zheng, Z.: Stochastic gradient boosted distributed decision trees. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management, pp. 2061–2064 (2009)
13.
Zurück zum Zitat Tyree, S., Weinberger, K.Q., Agrawal, K., Paykin, J.: Parallel boosted regression trees for web search ranking. In: Proceedings of the 20th International Conference on World Wide Web, pp. 387–396. ACM (2011) Tyree, S., Weinberger, K.Q., Agrawal, K., Paykin, J.: Parallel boosted regression trees for web search ranking. In: Proceedings of the 20th International Conference on World Wide Web, pp. 387–396. ACM (2011)
14.
Zurück zum Zitat Li, B., Chen, X., Li, M.J., Huang, J.Z., Feng, S.: Scalable random forests for massive data. In: Tan, P.-N., Chawla, S., Ho, C.K., Bailey, J. (eds.) PAKDD 2012, Part I. LNCS, vol. 7301, pp. 135–146. Springer, Heidelberg (2012) CrossRef Li, B., Chen, X., Li, M.J., Huang, J.Z., Feng, S.: Scalable random forests for massive data. In: Tan, P.-N., Chawla, S., Ho, C.K., Bailey, J. (eds.) PAKDD 2012, Part I. LNCS, vol. 7301, pp. 135–146. Springer, Heidelberg (2012) CrossRef
15.
Zurück zum Zitat Rutkowski, L., Jaworski, M., Pietruczuk, L., Duda, P.: A new method for data stream mining based on the misclassification error. IEEE Trans. Neural Netw. Learn. Syst. 26(5), 1048–1059 (2014)CrossRef Rutkowski, L., Jaworski, M., Pietruczuk, L., Duda, P.: A new method for data stream mining based on the misclassification error. IEEE Trans. Neural Netw. Learn. Syst. 26(5), 1048–1059 (2014)CrossRef
16.
Zurück zum Zitat Li, X., Barajas, J.M., Ding, Y.: Collaborative filtering on streaming data with interest-drifting. Intell. Data Anal. 11(1), 75–87 (2007) Li, X., Barajas, J.M., Ding, Y.: Collaborative filtering on streaming data with interest-drifting. Intell. Data Anal. 11(1), 75–87 (2007)
17.
Zurück zum Zitat Domingos, P., Hulten, G.: Mining high-speed data streams. In: Proceedings of the 6th ACM SIGKDD Conference, pp. 71–80 (2000) Domingos, P., Hulten, G.: Mining high-speed data streams. In: Proceedings of the 6th ACM SIGKDD Conference, pp. 71–80 (2000)
18.
Zurück zum Zitat Hulten, G., Spencer, L., Domingos, P.: Mining time-changing data streams. In: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 97–106 (2001) Hulten, G., Spencer, L., Domingos, P.: Mining time-changing data streams. In: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 97–106 (2001)
19.
Zurück zum Zitat Bifet, A., Holmes, G., Pfahringer, G., Kirkby, R., Gavalda, R.: New ensemble methods for evolving data streams. In: Proceedings of the 15th ACM SIGKDD International Conference Knowledge Discovery and Data Mining (2009) Bifet, A., Holmes, G., Pfahringer, G., Kirkby, R., Gavalda, R.: New ensemble methods for evolving data streams. In: Proceedings of the 15th ACM SIGKDD International Conference Knowledge Discovery and Data Mining (2009)
20.
Zurück zum Zitat Bifet, A., Holmes, G., Kirkby, R., Pfahringer, B.: DATA STREAM MINING: A Practical Approach. University of Waikato, New Zealand (2011) Bifet, A., Holmes, G., Kirkby, R., Pfahringer, B.: DATA STREAM MINING: A Practical Approach. University of Waikato, New Zealand (2011)
21.
22.
Zurück zum Zitat Rutkowski, L., Pietruczuk, L., Duda, P., Jaworski, M.: Decision trees for mining data streams based on the McDiarmid’s bound. IEEE Trans. Knowl. Data Eng. 25, 1272–1279 (2013)CrossRef Rutkowski, L., Pietruczuk, L., Duda, P., Jaworski, M.: Decision trees for mining data streams based on the McDiarmid’s bound. IEEE Trans. Knowl. Data Eng. 25, 1272–1279 (2013)CrossRef
23.
Zurück zum Zitat Rutkowski, L., Jaworski, M., Pietruczuk, L., Duda, P.: Decision trees for mining data streams based on the gaussian approximation. IEEE Trans. Knowl. Data Eng. 26, 108–119 (2014)CrossRef Rutkowski, L., Jaworski, M., Pietruczuk, L., Duda, P.: Decision trees for mining data streams based on the gaussian approximation. IEEE Trans. Knowl. Data Eng. 26, 108–119 (2014)CrossRef
24.
Zurück zum Zitat Rutkowski, L., Jaworski, M., Pietruczuk, L., Duda, P.: The CART decision tree for mining data streams. Inf. Sci. 266, 1–15 (2014)CrossRef Rutkowski, L., Jaworski, M., Pietruczuk, L., Duda, P.: The CART decision tree for mining data streams. Inf. Sci. 266, 1–15 (2014)CrossRef
25.
Zurück zum Zitat Bifet, A., Holmes, G., Kirkby, R., Pfahringer, B.: MOA: massive online analysis. J. Mach. Learn. Res. 11, 1601–1604 (2010) Bifet, A., Holmes, G., Kirkby, R., Pfahringer, B.: MOA: massive online analysis. J. Mach. Learn. Res. 11, 1601–1604 (2010)
Metadaten
Titel
A Parallel Approach for Decision Trees Learning from Big Data Streams
verfasst von
Ionel Tudor Calistru
Paul Cotofrei
Kilian Stoffel
Copyright-Jahr
2015
DOI
https://doi.org/10.1007/978-3-319-19027-3_1