ABSTRACT
Mining data streams of changing class distributions is important for real-time business decision support. The stream classifier must evolve to reflect the current class distribution. This poses a serious challenge. On the one hand, relying on historical data may increase the chances of learning obsolete models. On the other hand, learning only from the latest data may lead to biased classifiers, as the latest data is often an unrepresentative sample of the current class distribution. The problem is particularly acute in classifying rare events, when, for example, instances of the rare class do not even show up in the most recent training data. In this paper, we use a stochastic model to describe the concept shifting patterns and formulate this problem as an optimization one: from the historical and the current training data that we have observed, find the most-likely current distribution, and learn a classifier based on the most-likely distribution. We derive an analytic solution and approximate this solution with an efficient algorithm, which calibrates the influence of historical data carefully to create an accurate classifier. We evaluate our algorithm with both synthetic and real-world datasets. Our results show that our algorithm produces accurate and efficient classification.
- B. Babcock, S. Babu, M. Datar, R. Motawani, and J. Widom. Models and issues in data stream systems. In PODS, 2002. Google ScholarDigital Library
- Eric Bauer and Ron Kohavi. An empirical comparison of voting classification algorithms: Bagging, boosting, and variants. Machine Learning, 36(1-2):105--139, 1999. Google ScholarDigital Library
- Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang. Multi-dimensional regression analysis of time-series data streams. In VLDB, Hongkong, China, 2002. Google ScholarDigital Library
- Yun Chi, Haixun Wang, Philip S. Yu, and Richard R. Muntz. Moment: Maintaining closed frequent itemsets over a stream sliding window data streams. In ICDM, 2004. Google ScholarDigital Library
- Yun Chi, Philip S. Yu, Haixun Wang, and Richard Muntz. Loadstar: A load shedding scheme for classifying data streams. In SIAM Data Mining, 2005.Google ScholarCross Ref
- Graham Cormode and S. Muthukrishnan. Summarizing and mining skewed data streams. In SDM, 2005.Google ScholarCross Ref
- P. Domingos and G. Hulten. Mining high-speed data streams. In SIGKDD, pages 71--80, Boston, MA, 2000. ACM Press. Google ScholarDigital Library
- Yoav Freund and Robert E. Schapire. Experiments with a new boosting algorithm. In ICML, pages 148--156, 1996.Google ScholarDigital Library
- L. Gao and X. Wang. Continually evaluating similarity-based pattern queries on a streaming time series. In SIGMOD, Madison, Wisconsin, June 2002. Google ScholarDigital Library
- J. Gehrke, V. Ganti, R. Ramakrishnan, and W. Loh. BOAT - optimistic decision tree construction. In SIGMOD, 1999. Google ScholarDigital Library
- M. Greenwald and S. Khanna. Space-efficient online computation of quantile summaries. In SIGMOD, pages 58--66, Santa Barbara, CA, May 2001. Google ScholarDigital Library
- S. Guha, N. Milshra, R. Motwani, and L. O'Callaghan. Clustering data streams. In FOCS, pages 359--366, 2000. Google ScholarDigital Library
- Sudipto Guha and Boulos Harb. Wavelet synopsis for data streams: minimizing non-euclidean error. In KDD, pages 88--97, 2005. Google ScholarDigital Library
- G. Hulten, L. Spencer, and P. Domingos. Mining time-changing data streams. In SIGKDD, pages 97--106, San Francisco, CA, 2001. ACM Press. Google ScholarDigital Library
- Lawrence R. Rabiner. A tutorial on hidden markov models and selected applications in speech recognition. In Readings in speech recognition, pages 267--296. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1990. Google ScholarCross Ref
- W. Nick Street and YongSeog Kim. A streaming ensemble algorithm (SEA) for large-scale classification. In SIGKDD, 2001. Google ScholarDigital Library
- Haixun Wang, Wei Fan, Philip S. Yu, and Jiawei Han. Mining concept-drifting data streams using ensemble classifiers. In SIGKDD, 2003. Google ScholarDigital Library
- Peng Wang, Haixun Wang, Xiaochen Wu, Wei Wang, and Baile Shi .On reducing classifier granularity in mining concept-drifting data streams. In ICDM, 2005. Google ScholarDigital Library
- Ying Yang, Xindong Wu, and Xingquan Zhu. Combining proactive and reactive predictions for data streams. In SIGKDD, pages 710--715, 2005. Google ScholarDigital Library
Index Terms
- Suppressing model overfitting in mining concept-drifting data streams
Recommendations
Mining concept-drifting data streams using ensemble classifiers
KDD '03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data miningRecently, mining data streams with concept drifts for actionable insights has become an important and challenging task for a wide range of applications including credit card fraud protection, target marketing, network intrusion detection, etc. ...
Mining Concept-Drifting and Noisy Data Streams Using Ensemble Classifiers
AICI '09: Proceedings of the 2009 International Conference on Artificial Intelligence and Computational Intelligence - Volume 04Mining concept drifting data stream is a challenging area for data mining research. Recent years have witnessed an averaging ensemble classifier which is based on the learnable assumption, although this ensemble classifier is an efficient algorithm for ...
An adaptive ensemble classifier for mining concept drifting data streams
It is challenging to use traditional data mining techniques to deal with real-time data stream classifications. Existing mining classifiers need to be updated frequently to adapt to the changes in data streams. To address this issue, in this paper we ...
Comments