skip to main content
10.1145/1141277.1141429acmconferencesArticle/Chapter ViewAbstractPublication PagessacConference Proceedingsconference-collections
Article

Discretization from data streams: applications to histograms and data mining

Published:23 April 2006Publication History

ABSTRACT

In this paper we propose a new method to perform incremental discretization. The basic idea is to perform the task in two layers. The first layer receives the sequence of input data and keeps some statistics on the data using many more intervals than required. Based on the statistics stored by the first layer, the second layer creates the final discretization. The proposed architecture processes streaming examples in a single scan, in constant time and space even for infinite sequences of examples. We experimentally demonstrate that incremental discretization is able to maintain the performance of learning algorithms in comparison to a batch discretization. The proposed method is much more appropriate in incremental learning, and in problems where data flows continuously, as in most of the recent data mining applications.

References

  1. M. Afonso-Dias, J. Simoes, and C. Pinto. Gis/spatial analysis in fishery and aquatic sciences. In Proceedings 2th International Symposium on GIS/Spatial Analysis in Fishery and Aquatic Sciences, pages 323--340. Saitama, Japan, 2004.Google ScholarGoogle Scholar
  2. C. L. Blake and C. J. Merz. UCI repository of machine learning databases, 1998.Google ScholarGoogle Scholar
  3. H. Bock and E. Diday. Analysis of symbolic data: Exploratory Methods for Extracting Statistical Information from Complex Data. Springer Verlag, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Marc Boulle. Khiops: A statistical discretization method of continuous attributes. Machine Learning, 55(1):53--69, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. B. Chiu, E. Keogh, and S. Lonardi. Probabilistic discovery of time series motifs. In Proceedings 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 493--498. ACM Press, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Pedro Domingos and Michael J. Pazzani. On the optimality of the simple bayesian classifier under zero-one loss. Machine Learning, 29(2--3):103--130, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. James Dougherty, Ron Kohavi, and Mehran Sahami. Supervised and unsupervised discretization of continuous features. In Proceedings 12th International Conference on Machine Learning, pages 194--202. Morgan and Kaufmann, 1995.Google ScholarGoogle ScholarCross RefCross Ref
  8. Anastasios Doulamis and Nikolaos Doulamis. Fuzzy histograms for efficient visual content representation: Application to content-based image retrieval. In IEEE International Conference on Multimedia and Expo (ICME'01), page 227. IEEE Press, 2001.Google ScholarGoogle Scholar
  9. Tapio Elomaa and Juho Rousu. Necessary and suficient pre-processing in numerical range discretization. Knowledge and Information Systems (2003) 5: 162 182, 5:162--182, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. U. M. Fayyad and K. B Irani. Multi-interval discretization of continuous valued attributes for classification learning. In 13 International Joint Conference on Artificial Intelligence, pages 1022--1027. Morgan Kaufmann, 1993.Google ScholarGoogle Scholar
  11. Usama M. Fayyad and Keki B. Irani. On the handling of continuous-valued attributes in decision tree generation. Machine Learning, 8:87--102, 1992. Google ScholarGoogle ScholarCross RefCross Ref
  12. P. Gibbons, Y. Matias, and V. Poosala. Fast incremental maintenance of approximate histograms. ACM Transactions on Database Systems, 5:1--33, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Raul Giraldez, Jesus S. Aguilar-Ruiz, Jose C. Riquelme, Francisco J. Ferrer-Troyano, and Domingo S. Rodriguez-Baena. Discretization oriented to decision rule generation. In 6th International Conference on Knowledge-Based Intelligent Information Engineering Systems, pages 275--279. IOS Press, 2002.Google ScholarGoogle Scholar
  14. Sudipto Guha, Nick Koudas, and Kyuseok Shim. Data-streams and histograms. In STOC '01: Proceedings of the thirty-third annual ACM symposium on Theory of computing, pages 471--475. ACM Press, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. R. Kerber. Chimerge discretization of numeric attributes. In Proceeding of the 10th International Conference on Artificial Intelligence, pages 123--128, 1991.Google ScholarGoogle Scholar
  16. M. J. Pazzani. An iterative improvement approach for the discretization of numeric attributes in bayesian classifiers. In Proceedings of the First International Conference on Knowledge Discovery and Data Mining, 1995.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. D. D. Pestana and S. F. Velosa. Introdução à Probabilidade e à Estatística. Fundação Calouste Gulbenkian, 2002.Google ScholarGoogle Scholar
  18. Ian H. Witten and Eibe Frank. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann Publishers, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Ying Yang. Discretization for Naive-Bayes Learning. PhD thesis, School of Computer Science and Software Engineering of Monash University, July 2003.Google ScholarGoogle Scholar

Index Terms

  1. Discretization from data streams: applications to histograms and data mining

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          SAC '06: Proceedings of the 2006 ACM symposium on Applied computing
          April 2006
          1967 pages
          ISBN:1595931082
          DOI:10.1145/1141277

          Copyright © 2006 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 23 April 2006

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • Article

          Acceptance Rates

          Overall Acceptance Rate1,650of6,669submissions,25%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader