ABSTRACT
Mining concept drifting data streams is a defining challenge for data mining research. Recent years have seen a large body of work on detecting changes and building prediction models from stream data, with a vague understanding on the types of the concept drifting and the impact of different types of concept drifting on the mining algorithms. In this paper, we first categorize concept drifting into two scenarios: Loose Concept Drifting (LCD) and Rigorous Concept Drifting (RCD), and then propose solutions to handle each of them separately. For LCD data streams, because concepts in adjacent data chunks are sufficiently close to each other, we apply kernel mean matching (KMM) method to minimize the discrepancy of the data chunks in the kernel space. Such a minimization process will produce weighted instances to build classifier ensemble and handle concept drifting data streams. For RCD data streams, because genuine concepts in adjacent data chunks may randomly and rapidly change, we propose a new Optimal Weights Adjustment (OWA) method to determine the optimum weight values for classifiers trained from the most recent (up-to-date) data chunk, such that those classifiers can form an accurate classifier ensemble to predict instances in the yet-to-come data chunk. Experiments on synthetic and real-world datasets will show that weighted instance approach is preferable when the concept drifting is mainly caused by the changing of the class prior probability; whereas the weighted classifier approach is preferable when the concept drifting is mainly triggered by the changing of the conditional probability.
- P. Domingos & G. Hulten. 2000. Mining high-speed data streams, Proc. of KDD. Google ScholarDigital Library
- G. Hulten, L. Spencer, and P. Domingos. 2001. Mining time-changing data streams. In SIGKDD, pages 97--106. Google ScholarDigital Library
- B.Babcock, S.Babu, M.Datar, R.Motawani, and J.Widom. 2002. Models and issues in data stream systems. In PODS. Google ScholarDigital Library
- C. Aggarwal. 2007. Data Streams: Models and Algorithms. Springer. Google ScholarDigital Library
- Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang. 2002. Multi-dimensional regression analysis of time-series data streams. In VLDB, Hongkong, China. Google ScholarDigital Library
- C. Aggarwal, J. Han, J. Wang, and P. S. Yu. 2004. On demand classification of data streams. In Proc. KDD'04. Google ScholarDigital Library
- R. Klinkenberg and T. Joachims.2000. Detecting concept drift with support vector machines. In Proc. ICML. Google ScholarDigital Library
- Y. Yang, X. Wu, and X. Zhu. 2005. Combining proactive and reactive predictions for data streams. In Proc. KDD'05. Google ScholarDigital Library
- J. Gao, W. Fan, and J. Han, 2007. On appropriate assumptions to mine data streams: Analysis and Practice, In Proc. of IEEE ICDM, pp.143--152. Google ScholarDigital Library
- W.Nick Street and YongSeog Kim, 2001, A streaming ensemble algorithm (SEA) for large--scale classification, In Proc. of SIGKDD, pp.377--382. Google ScholarDigital Library
- J. Z. Kolter and M. A. Maloof. 2005. Using additive expert ensembles to cope with concept drift. In Proc. ICML. Google ScholarDigital Library
- M. Scholz and R. Klinkenberg. 2005. An Ensemble Classifier for Drifting Concepts. In Proc. of the 2nd International Workshop on Knowledge Discovery in Data Streams.Google Scholar
- H. Wang, W. Fan, P. Yu, & J. Han. 2003, Mining concept--drifting data streams using ensemble classifiers, in Proc. of KDD. Google ScholarDigital Library
- X. Zhu, P. Zhang, X. Lin, and Y. Shi. 2007. Active Learning from Data Streams. In Proc. of IEEE ICDM. Google ScholarDigital Library
- W. Dai, Q. Yang, G. Xue, and Y. Yu. 2007. Boosting for Transfer Learning, In Proc. of ICML. Google ScholarDigital Library
- H. Shimodaira, 2000. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference, 90,227--244.Google ScholarCross Ref
- M. Sugiyama, & K. Müüller, 2005. Model selection under covariate shift. In Proc. of ICANN. Google ScholarDigital Library
- S. Bickel, M. Brückner, and T. Scheffer. 2007. Discriminative learning for differing training and test distributions, In Proc. of ICML, pages 81 -- 88. Google ScholarDigital Library
- Bickel, S., & Scheffer, T. 2007. Dirichlet-enhanced spam filtering based on biased samples. Advances in Neural Information Processing Systems.Google Scholar
- M. Dudik, R. Schapire, & S. Phillips, 2005. Correcting sample selection bias in maximum entropy density estimation. Advances in Neural Info. Processing Systems.Google Scholar
- J. Huang, A. Smola, A. Gretton, K. Borgwardt, & B. Schöölkopf, 2007. Correcting sample selection bias by unlabeled data. Advances in Neural Info. Proc. Systems.Google Scholar
- K. Tumer & J. Ghosh.1996. Analysis of decision boundaries in linearly combined neural classifiers, Pattern Recognition, 29(2).Google Scholar
- I. Witten & E. Frank. 2005. Data mining: practical machine learning tools and techniques, Morgan Kaufmann. Google ScholarDigital Library
- D. Kifer, S. David, J. Gehrke. 2004, Detecing changes in data streams, in Proc. of VLDB, Toronto, Canada. Google ScholarDigital Library
Index Terms
- Categorizing and mining concept drifting data streams
Recommendations
Mining concept-drifting data streams using ensemble classifiers
KDD '03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data miningRecently, mining data streams with concept drifts for actionable insights has become an important and challenging task for a wide range of applications including credit card fraud protection, target marketing, network intrusion detection, etc. ...
An Aggregate Ensemble for Mining Concept Drifting Data Streams with Noise
PAKDD '09: Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data MiningRecent years have witnessed a large body of research work on mining concept drifting data streams, where a primary assumption is that the up-to-date data chunk and the yet-to-come data chunk share identical distributions, so classifiers with good ...
A Dynamic Weighted Ensemble to Cope with Concept Drifting Classification
ICYCS '08: Proceedings of the 2008 The 9th International Conference for Young Computer ScientistsIn the real world concepts are not stable and change with time and a lot of other hidden factors. Stream classifiers should be sensitive to the drifting of concept in an automatic way. In this paper, we proposed a new weighted majority strategy for the ...
Comments