skip to main content
10.1145/1401890.1401987acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Categorizing and mining concept drifting data streams

Authors Info & Claims
Published:24 August 2008Publication History

ABSTRACT

Mining concept drifting data streams is a defining challenge for data mining research. Recent years have seen a large body of work on detecting changes and building prediction models from stream data, with a vague understanding on the types of the concept drifting and the impact of different types of concept drifting on the mining algorithms. In this paper, we first categorize concept drifting into two scenarios: Loose Concept Drifting (LCD) and Rigorous Concept Drifting (RCD), and then propose solutions to handle each of them separately. For LCD data streams, because concepts in adjacent data chunks are sufficiently close to each other, we apply kernel mean matching (KMM) method to minimize the discrepancy of the data chunks in the kernel space. Such a minimization process will produce weighted instances to build classifier ensemble and handle concept drifting data streams. For RCD data streams, because genuine concepts in adjacent data chunks may randomly and rapidly change, we propose a new Optimal Weights Adjustment (OWA) method to determine the optimum weight values for classifiers trained from the most recent (up-to-date) data chunk, such that those classifiers can form an accurate classifier ensemble to predict instances in the yet-to-come data chunk. Experiments on synthetic and real-world datasets will show that weighted instance approach is preferable when the concept drifting is mainly caused by the changing of the class prior probability; whereas the weighted classifier approach is preferable when the concept drifting is mainly triggered by the changing of the conditional probability.

References

  1. P. Domingos & G. Hulten. 2000. Mining high-speed data streams, Proc. of KDD. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. G. Hulten, L. Spencer, and P. Domingos. 2001. Mining time-changing data streams. In SIGKDD, pages 97--106. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. B.Babcock, S.Babu, M.Datar, R.Motawani, and J.Widom. 2002. Models and issues in data stream systems. In PODS. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. C. Aggarwal. 2007. Data Streams: Models and Algorithms. Springer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang. 2002. Multi-dimensional regression analysis of time-series data streams. In VLDB, Hongkong, China. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. C. Aggarwal, J. Han, J. Wang, and P. S. Yu. 2004. On demand classification of data streams. In Proc. KDD'04. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. R. Klinkenberg and T. Joachims.2000. Detecting concept drift with support vector machines. In Proc. ICML. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Y. Yang, X. Wu, and X. Zhu. 2005. Combining proactive and reactive predictions for data streams. In Proc. KDD'05. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. J. Gao, W. Fan, and J. Han, 2007. On appropriate assumptions to mine data streams: Analysis and Practice, In Proc. of IEEE ICDM, pp.143--152. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. W.Nick Street and YongSeog Kim, 2001, A streaming ensemble algorithm (SEA) for large--scale classification, In Proc. of SIGKDD, pp.377--382. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. Z. Kolter and M. A. Maloof. 2005. Using additive expert ensembles to cope with concept drift. In Proc. ICML. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. M. Scholz and R. Klinkenberg. 2005. An Ensemble Classifier for Drifting Concepts. In Proc. of the 2nd International Workshop on Knowledge Discovery in Data Streams.Google ScholarGoogle Scholar
  13. H. Wang, W. Fan, P. Yu, & J. Han. 2003, Mining concept--drifting data streams using ensemble classifiers, in Proc. of KDD. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. X. Zhu, P. Zhang, X. Lin, and Y. Shi. 2007. Active Learning from Data Streams. In Proc. of IEEE ICDM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. W. Dai, Q. Yang, G. Xue, and Y. Yu. 2007. Boosting for Transfer Learning, In Proc. of ICML. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. H. Shimodaira, 2000. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference, 90,227--244.Google ScholarGoogle ScholarCross RefCross Ref
  17. M. Sugiyama, & K. Müüller, 2005. Model selection under covariate shift. In Proc. of ICANN. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. S. Bickel, M. Brückner, and T. Scheffer. 2007. Discriminative learning for differing training and test distributions, In Proc. of ICML, pages 81 -- 88. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Bickel, S., & Scheffer, T. 2007. Dirichlet-enhanced spam filtering based on biased samples. Advances in Neural Information Processing Systems.Google ScholarGoogle Scholar
  20. M. Dudik, R. Schapire, & S. Phillips, 2005. Correcting sample selection bias in maximum entropy density estimation. Advances in Neural Info. Processing Systems.Google ScholarGoogle Scholar
  21. J. Huang, A. Smola, A. Gretton, K. Borgwardt, & B. Schöölkopf, 2007. Correcting sample selection bias by unlabeled data. Advances in Neural Info. Proc. Systems.Google ScholarGoogle Scholar
  22. K. Tumer & J. Ghosh.1996. Analysis of decision boundaries in linearly combined neural classifiers, Pattern Recognition, 29(2).Google ScholarGoogle Scholar
  23. I. Witten & E. Frank. 2005. Data mining: practical machine learning tools and techniques, Morgan Kaufmann. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. D. Kifer, S. David, J. Gehrke. 2004, Detecing changes in data streams, in Proc. of VLDB, Toronto, Canada. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Categorizing and mining concept drifting data streams

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
        August 2008
        1116 pages
        ISBN:9781605581934
        DOI:10.1145/1401890
        • General Chair:
        • Ying Li,
        • Program Chairs:
        • Bing Liu,
        • Sunita Sarawagi

        Copyright © 2008 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 24 August 2008

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        KDD '08 Paper Acceptance Rate118of593submissions,20%Overall Acceptance Rate1,133of8,635submissions,13%

        Upcoming Conference

        KDD '24

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader