Abstract
This chapter builds on the description in Chapter
21 of the H-Tree algorithm for classifying
streaming data, i.e. data which arrives (generally in large quantities) from some automatic process over a period of days, months, years or potentially forever. Chapter
21 was concerned with stationary data generated from a fixed causal model; Chapter 22 is concerned with data that is time-dependent, where the underlying model can change from time to time, perhaps seasonally. This phenomenon is known as
concept drift.
The algorithm given here, CDH-Tree, is a variant of the popular CVFDT algorithm which generates a type of decision tree called a Hoeffding Tree. The algorithm is described and explained in detail with accompanying pseudocode for the benefit of readers who may be interested in developing their own implementations. A detailed example using synthetic data is given to illustrate the way in which the classification tree evolves as more and more records are processed in the presence of concept drift.