Article

Suppressing model overfitting in mining concept-drifting data streams

Authors:
Haixun Wang

IBM T. J. Watson Research

IBM T. J. Watson Research
View Profile

,
Jian Yin

IBM T. J. Watson Research

IBM T. J. Watson Research
View Profile

,
Jian Pei

Simon Fraser University, Canada

Simon Fraser University, Canada
View Profile

,
Philip S. Yu

IBM T. J. Watson Research

IBM T. J. Watson Research
View Profile

,
Jeffrey Xu Yu

Chinese University of Hong Kong

Chinese University of Hong Kong
View Profile

KDD '06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data miningAugust 2006Pages 736–741https://doi.org/10.1145/1150402.1150496

Published:20 August 2006Publication History

KDD '06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 736–741

ABSTRACT

Mining data streams of changing class distributions is important for real-time business decision support. The stream classifier must evolve to reflect the current class distribution. This poses a serious challenge. On the one hand, relying on historical data may increase the chances of learning obsolete models. On the other hand, learning only from the latest data may lead to biased classifiers, as the latest data is often an unrepresentative sample of the current class distribution. The problem is particularly acute in classifying rare events, when, for example, instances of the rare class do not even show up in the most recent training data. In this paper, we use a stochastic model to describe the concept shifting patterns and formulate this problem as an optimization one: from the historical and the current training data that we have observed, find the most-likely current distribution, and learn a classifier based on the most-likely distribution. We derive an analytic solution and approximate this solution with an efficient algorithm, which calibrates the influence of historical data carefully to create an accurate classifier. We evaluate our algorithm with both synthetic and real-world datasets. Our results show that our algorithm produces accurate and efficient classification.

References

B. Babcock, S. Babu, M. Datar, R. Motawani, and J. Widom. Models and issues in data stream systems. In PODS, 2002. Google ScholarDigital Library
Eric Bauer and Ron Kohavi. An empirical comparison of voting classification algorithms: Bagging, boosting, and variants. Machine Learning, 36(1-2):105--139, 1999. Google ScholarDigital Library
Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang. Multi-dimensional regression analysis of time-series data streams. In VLDB, Hongkong, China, 2002. Google ScholarDigital Library
Yun Chi, Haixun Wang, Philip S. Yu, and Richard R. Muntz. Moment: Maintaining closed frequent itemsets over a stream sliding window data streams. In ICDM, 2004. Google ScholarDigital Library
Yun Chi, Philip S. Yu, Haixun Wang, and Richard Muntz. Loadstar: A load shedding scheme for classifying data streams. In SIAM Data Mining, 2005.Google ScholarCross Ref
Graham Cormode and S. Muthukrishnan. Summarizing and mining skewed data streams. In SDM, 2005.Google ScholarCross Ref
P. Domingos and G. Hulten. Mining high-speed data streams. In SIGKDD, pages 71--80, Boston, MA, 2000. ACM Press. Google ScholarDigital Library
Yoav Freund and Robert E. Schapire. Experiments with a new boosting algorithm. In ICML, pages 148--156, 1996.Google ScholarDigital Library
L. Gao and X. Wang. Continually evaluating similarity-based pattern queries on a streaming time series. In SIGMOD, Madison, Wisconsin, June 2002. Google ScholarDigital Library
J. Gehrke, V. Ganti, R. Ramakrishnan, and W. Loh. BOAT - optimistic decision tree construction. In SIGMOD, 1999. Google ScholarDigital Library
M. Greenwald and S. Khanna. Space-efficient online computation of quantile summaries. In SIGMOD, pages 58--66, Santa Barbara, CA, May 2001. Google ScholarDigital Library
S. Guha, N. Milshra, R. Motwani, and L. O'Callaghan. Clustering data streams. In FOCS, pages 359--366, 2000. Google ScholarDigital Library
Sudipto Guha and Boulos Harb. Wavelet synopsis for data streams: minimizing non-euclidean error. In KDD, pages 88--97, 2005. Google ScholarDigital Library
G. Hulten, L. Spencer, and P. Domingos. Mining time-changing data streams. In SIGKDD, pages 97--106, San Francisco, CA, 2001. ACM Press. Google ScholarDigital Library
Lawrence R. Rabiner. A tutorial on hidden markov models and selected applications in speech recognition. In Readings in speech recognition, pages 267--296. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1990. Google ScholarCross Ref
W. Nick Street and YongSeog Kim. A streaming ensemble algorithm (SEA) for large-scale classification. In SIGKDD, 2001. Google ScholarDigital Library
Haixun Wang, Wei Fan, Philip S. Yu, and Jiawei Han. Mining concept-drifting data streams using ensemble classifiers. In SIGKDD, 2003. Google ScholarDigital Library
Peng Wang, Haixun Wang, Xiaochen Wu, Wei Wang, and Baile Shi .On reducing classifier granularity in mining concept-drifting data streams. In ICDM, 2005. Google ScholarDigital Library
Ying Yang, Xindong Wu, and Xingquan Zhu. Combining proactive and reactive predictions for data streams. In SIGKDD, pages 710--715, 2005. Google ScholarDigital Library

Index Terms

Suppressing model overfitting in mining concept-drifting data streams
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by classification
    2. Machine learning approaches
      1. Classification and regression trees
2. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Mining concept-drifting data streams using ensemble classifiers
KDD '03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining

Recently, mining data streams with concept drifts for actionable insights has become an important and challenging task for a wide range of applications including credit card fraud protection, target marketing, network intrusion detection, etc. ...
Read More
Mining Concept-Drifting and Noisy Data Streams Using Ensemble Classifiers
AICI '09: Proceedings of the 2009 International Conference on Artificial Intelligence and Computational Intelligence - Volume 04

Mining concept drifting data stream is a challenging area for data mining research. Recent years have witnessed an averaging ensemble classifier which is based on the learnable assumption, although this ensemble classifier is an efficient algorithm for ...
Read More
An adaptive ensemble classifier for mining concept drifting data streams

It is challenging to use traditional data mining techniques to deal with real-time data stream classifications. Existing mining classifiers need to be updated frequently to adapt to the changes in data streams. To address this issue, in this paper we ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
August 2006
986 pages
ISBN:1595933395
DOI:10.1145/1150402
Conference Chair:
Tina Eliassi-Rad
LLNL
,
General Chair:
Lyle Ungar
University of Pennsylvania
,
Program Chairs:
Mark Craven
University of Wisconsin
,
Dimitrios Gunopulos
University of California, Riverside
Copyright © 2006 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 20 August 2006
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
classifier
classifier ensemble
concept drift
data streams
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate1,133of8,635submissions,13%
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 42
  Total Citations
  View Citations
- 1,151
  Total Downloads
- Downloads (Last 12 months)5
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Suppressing model overfitting in mining concept-drifting data streams

KDD '06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Mining concept-drifting data streams using ensemble classifiers

Mining Concept-Drifting and Noisy Data Streams Using Ensemble Classifiers

An adaptive ensemble classifier for mining concept drifting data streams

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Suppressing model overfitting in mining concept-drifting data streams

KDD '06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Mining concept-drifting data streams using ensemble classifiers

Mining Concept-Drifting and Noisy Data Streams Using Ensemble Classifiers

An adaptive ensemble classifier for mining concept drifting data streams

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media