Article

Discretization from data streams: applications to histograms and data mining

Authors:
João Gama

Univ. do Porto, Porto, Portugal

Univ. do Porto, Porto, Portugal
View Profile

,
Carlos Pinto

Univ. do Algarve, Porto, Portugal

Univ. do Algarve, Porto, Portugal
View Profile

SAC '06: Proceedings of the 2006 ACM symposium on Applied computingApril 2006Pages 662–667https://doi.org/10.1145/1141277.1141429

Published:23 April 2006Publication History

SAC '06: Proceedings of the 2006 ACM symposium on Applied computing

Pages 662–667

ABSTRACT

In this paper we propose a new method to perform incremental discretization. The basic idea is to perform the task in two layers. The first layer receives the sequence of input data and keeps some statistics on the data using many more intervals than required. Based on the statistics stored by the first layer, the second layer creates the final discretization. The proposed architecture processes streaming examples in a single scan, in constant time and space even for infinite sequences of examples. We experimentally demonstrate that incremental discretization is able to maintain the performance of learning algorithms in comparison to a batch discretization. The proposed method is much more appropriate in incremental learning, and in problems where data flows continuously, as in most of the recent data mining applications.

References

M. Afonso-Dias, J. Simoes, and C. Pinto. Gis/spatial analysis in fishery and aquatic sciences. In Proceedings 2th International Symposium on GIS/Spatial Analysis in Fishery and Aquatic Sciences, pages 323--340. Saitama, Japan, 2004.Google Scholar
C. L. Blake and C. J. Merz. UCI repository of machine learning databases, 1998.Google Scholar
H. Bock and E. Diday. Analysis of symbolic data: Exploratory Methods for Extracting Statistical Information from Complex Data. Springer Verlag, 2000. Google ScholarDigital Library
Marc Boulle. Khiops: A statistical discretization method of continuous attributes. Machine Learning, 55(1):53--69, 2004. Google ScholarDigital Library
B. Chiu, E. Keogh, and S. Lonardi. Probabilistic discovery of time series motifs. In Proceedings 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 493--498. ACM Press, 2003. Google ScholarDigital Library
Pedro Domingos and Michael J. Pazzani. On the optimality of the simple bayesian classifier under zero-one loss. Machine Learning, 29(2--3):103--130, 1997. Google ScholarDigital Library
James Dougherty, Ron Kohavi, and Mehran Sahami. Supervised and unsupervised discretization of continuous features. In Proceedings 12th International Conference on Machine Learning, pages 194--202. Morgan and Kaufmann, 1995.Google ScholarCross Ref
Anastasios Doulamis and Nikolaos Doulamis. Fuzzy histograms for efficient visual content representation: Application to content-based image retrieval. In IEEE International Conference on Multimedia and Expo (ICME'01), page 227. IEEE Press, 2001.Google Scholar
Tapio Elomaa and Juho Rousu. Necessary and suficient pre-processing in numerical range discretization. Knowledge and Information Systems (2003) 5: 162 182, 5:162--182, 2003. Google ScholarDigital Library
U. M. Fayyad and K. B Irani. Multi-interval discretization of continuous valued attributes for classification learning. In 13 International Joint Conference on Artificial Intelligence, pages 1022--1027. Morgan Kaufmann, 1993.Google Scholar
Usama M. Fayyad and Keki B. Irani. On the handling of continuous-valued attributes in decision tree generation. Machine Learning, 8:87--102, 1992. Google ScholarCross Ref
P. Gibbons, Y. Matias, and V. Poosala. Fast incremental maintenance of approximate histograms. ACM Transactions on Database Systems, 5:1--33, 2002. Google ScholarDigital Library
Raul Giraldez, Jesus S. Aguilar-Ruiz, Jose C. Riquelme, Francisco J. Ferrer-Troyano, and Domingo S. Rodriguez-Baena. Discretization oriented to decision rule generation. In 6th International Conference on Knowledge-Based Intelligent Information Engineering Systems, pages 275--279. IOS Press, 2002.Google Scholar
Sudipto Guha, Nick Koudas, and Kyuseok Shim. Data-streams and histograms. In STOC '01: Proceedings of the thirty-third annual ACM symposium on Theory of computing, pages 471--475. ACM Press, 2001. Google ScholarDigital Library
R. Kerber. Chimerge discretization of numeric attributes. In Proceeding of the 10th International Conference on Artificial Intelligence, pages 123--128, 1991.Google Scholar
M. J. Pazzani. An iterative improvement approach for the discretization of numeric attributes in bayesian classifiers. In Proceedings of the First International Conference on Knowledge Discovery and Data Mining, 1995.Google ScholarDigital Library
D. D. Pestana and S. F. Velosa. Introdução à Probabilidade e à Estatística. Fundação Calouste Gulbenkian, 2002.Google Scholar
Ian H. Witten and Eibe Frank. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann Publishers, 1999. Google ScholarDigital Library
Ying Yang. Discretization for Naive-Bayes Learning. PhD thesis, School of Computer Science and Software Engineering of Monash University, July 2003.Google Scholar

Index Terms

Discretization from data streams: applications to histograms and data mining

Recommendations

Mining frequent patterns across multiple data streams
CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge management

Mining frequent patterns from data streams has drawn increasing attention in recent years. However, previous mining algorithms were all focused on a single data stream. In many emerging applications, it is of critical importance to combine multiple data ...
Read More
Mining Recent Frequent Itemsets in Data Streams
FSKD '08: Proceedings of the 2008 Fifth International Conference on Fuzzy Systems and Knowledge Discovery - Volume 04

Mining frequent itemsets in data streams is a hot research topic in recent years. Due to the continuous, high-speed and unbounded properties of data streams, traditional algorithms on static dataset are not suitable for mining in data streams. In this ...
Read More
Interactive mining of high utility patterns over data streams

High utility pattern (HUP) mining over data streams has become a challenging research issue in data mining. When a data stream flows through, the old information may not be interesting in the current time period. Therefore, incremental HUP mining is ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SAC '06: Proceedings of the 2006 ACM symposium on Applied computing
April 2006
1967 pages
ISBN:1595931082
DOI:10.1145/1141277
Conference Chair:
Hisham M. Haddad
Kennesaw State University, Kennesaw, Georgia
Copyright © 2006 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 23 April 2006
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate1,650of6,669submissions,25%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 69
  Total Citations
  View Citations
- 823
  Total Downloads
- Downloads (Last 12 months)26
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Discretization from data streams: applications to histograms and data mining

SAC '06: Proceedings of the 2006 ACM symposium on Applied computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Mining frequent patterns across multiple data streams

Mining Recent Frequent Itemsets in Data Streams

Interactive mining of high utility patterns over data streams

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Discretization from data streams: applications to histograms and data mining

SAC '06: Proceedings of the 2006 ACM symposium on Applied computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Mining frequent patterns across multiple data streams

Mining Recent Frequent Itemsets in Data Streams

Interactive mining of high utility patterns over data streams

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media