Skip to main content
Top

Hint

Swipe to navigate through the chapters of this book

2016 | OriginalPaper | Chapter

21. Classifying Streaming Data

Author : Prof. Max Bramer

Published in: Principles of Data Mining

Publisher: Springer London

Abstract

This chapter is concerned with the classification of streaming data, i.e. data which arrives (generally in large quantities) from some automatic process over a period of days, months, years or potentially forever.
Generating a classification tree for streaming data requires a different approach from the TDIDT algorithm described earlier in this book. The algorithm given here, H-Tree, is a variant of the popular VFDT algorithm which generates a type of decision tree called a Hoeffding Tree. The algorithm is described and explained in detailed with accompanying pseudocode for the benefit of readers who may be interested in developing their own implementations. An example is given to illustrate a way of comparing the rules generated by H-Tree with those from TDIDT.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Footnotes
1
We distinguish between nodes which have or have not previously been split on an attribute. The former are called internal nodes; the latter are called leaf nodes. We will consider the root node not as a third type of node but as an internal node after it has been split on an attribute and a leaf node before that.
 
2
A note on notation. In this chapter array elements are generally shown enclosed in square brackets, e.g. \(\textit{currentAtts}[2]\). However an array containing a number of constant values will generally be denoted by those values separated by commas and enclosed in braces. So \(\textit{currentAtts}[2]\) is \(\{\textit{att1}, \textit{att2}, \textit{att3}, \textit{att5}, \textit{att6}, \textit{att7}\}\).
 
3
The row and column headings are provided to assist the reader only. The table itself has 3 rows and 3 columns.
 
4
Pseudocode fragments are provided for the benefit of readers who may be interested in developing their own implementations of the H-Tree algorithm. Other readers can safely ignore them.
 
5
As initially there are no other nodes, all incoming records will be sorted there.
 
6
In Figures 21.6, 21.8 and 21.9 we depart from our usual notation for trees and show the values that are in the classtotals array for each node.
 
7
Confusion matrices were described in Chapter 7.
 
8
For some practical applications, to have a tree with a smaller number of leaf nodes which predicts the same or almost the same classifications as the complete TDIDT decision tree might be considered preferable, but we will not pursue that issue here.
 
Literature
[1]
go back to reference Domingos, P., & Hulten, G. (2000). Mining high-speed data streams. In Proceedings of the sixth ACM SIGKDD international conference on knowledge discovery and data mining (pp. 71–80). New York: ACM. CrossRef Domingos, P., & Hulten, G. (2000). Mining high-speed data streams. In Proceedings of the sixth ACM SIGKDD international conference on knowledge discovery and data mining (pp. 71–80). New York: ACM. CrossRef
[2]
go back to reference Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58 (301), 13–30. MathSciNetCrossRefMATH Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58 (301), 13–30. MathSciNetCrossRefMATH
Metadata
Title
Classifying Streaming Data
Author
Prof. Max Bramer
Copyright Year
2016
Publisher
Springer London
DOI
https://doi.org/10.1007/978-1-4471-7307-6_21