ABSTRACT
In this paper we propose a new method to perform incremental discretization. The basic idea is to perform the task in two layers. The first layer receives the sequence of input data and keeps some statistics on the data using many more intervals than required. Based on the statistics stored by the first layer, the second layer creates the final discretization. The proposed architecture processes streaming examples in a single scan, in constant time and space even for infinite sequences of examples. We experimentally demonstrate that incremental discretization is able to maintain the performance of learning algorithms in comparison to a batch discretization. The proposed method is much more appropriate in incremental learning, and in problems where data flows continuously, as in most of the recent data mining applications.
- M. Afonso-Dias, J. Simoes, and C. Pinto. Gis/spatial analysis in fishery and aquatic sciences. In Proceedings 2th International Symposium on GIS/Spatial Analysis in Fishery and Aquatic Sciences, pages 323--340. Saitama, Japan, 2004.Google Scholar
- C. L. Blake and C. J. Merz. UCI repository of machine learning databases, 1998.Google Scholar
- H. Bock and E. Diday. Analysis of symbolic data: Exploratory Methods for Extracting Statistical Information from Complex Data. Springer Verlag, 2000. Google ScholarDigital Library
- Marc Boulle. Khiops: A statistical discretization method of continuous attributes. Machine Learning, 55(1):53--69, 2004. Google ScholarDigital Library
- B. Chiu, E. Keogh, and S. Lonardi. Probabilistic discovery of time series motifs. In Proceedings 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 493--498. ACM Press, 2003. Google ScholarDigital Library
- Pedro Domingos and Michael J. Pazzani. On the optimality of the simple bayesian classifier under zero-one loss. Machine Learning, 29(2--3):103--130, 1997. Google ScholarDigital Library
- James Dougherty, Ron Kohavi, and Mehran Sahami. Supervised and unsupervised discretization of continuous features. In Proceedings 12th International Conference on Machine Learning, pages 194--202. Morgan and Kaufmann, 1995.Google ScholarCross Ref
- Anastasios Doulamis and Nikolaos Doulamis. Fuzzy histograms for efficient visual content representation: Application to content-based image retrieval. In IEEE International Conference on Multimedia and Expo (ICME'01), page 227. IEEE Press, 2001.Google Scholar
- Tapio Elomaa and Juho Rousu. Necessary and suficient pre-processing in numerical range discretization. Knowledge and Information Systems (2003) 5: 162 182, 5:162--182, 2003. Google ScholarDigital Library
- U. M. Fayyad and K. B Irani. Multi-interval discretization of continuous valued attributes for classification learning. In 13 International Joint Conference on Artificial Intelligence, pages 1022--1027. Morgan Kaufmann, 1993.Google Scholar
- Usama M. Fayyad and Keki B. Irani. On the handling of continuous-valued attributes in decision tree generation. Machine Learning, 8:87--102, 1992. Google ScholarCross Ref
- P. Gibbons, Y. Matias, and V. Poosala. Fast incremental maintenance of approximate histograms. ACM Transactions on Database Systems, 5:1--33, 2002. Google ScholarDigital Library
- Raul Giraldez, Jesus S. Aguilar-Ruiz, Jose C. Riquelme, Francisco J. Ferrer-Troyano, and Domingo S. Rodriguez-Baena. Discretization oriented to decision rule generation. In 6th International Conference on Knowledge-Based Intelligent Information Engineering Systems, pages 275--279. IOS Press, 2002.Google Scholar
- Sudipto Guha, Nick Koudas, and Kyuseok Shim. Data-streams and histograms. In STOC '01: Proceedings of the thirty-third annual ACM symposium on Theory of computing, pages 471--475. ACM Press, 2001. Google ScholarDigital Library
- R. Kerber. Chimerge discretization of numeric attributes. In Proceeding of the 10th International Conference on Artificial Intelligence, pages 123--128, 1991.Google Scholar
- M. J. Pazzani. An iterative improvement approach for the discretization of numeric attributes in bayesian classifiers. In Proceedings of the First International Conference on Knowledge Discovery and Data Mining, 1995.Google ScholarDigital Library
- D. D. Pestana and S. F. Velosa. Introdução à Probabilidade e à Estatística. Fundação Calouste Gulbenkian, 2002.Google Scholar
- Ian H. Witten and Eibe Frank. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann Publishers, 1999. Google ScholarDigital Library
- Ying Yang. Discretization for Naive-Bayes Learning. PhD thesis, School of Computer Science and Software Engineering of Monash University, July 2003.Google Scholar
Index Terms
- Discretization from data streams: applications to histograms and data mining
Recommendations
Mining frequent patterns across multiple data streams
CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge managementMining frequent patterns from data streams has drawn increasing attention in recent years. However, previous mining algorithms were all focused on a single data stream. In many emerging applications, it is of critical importance to combine multiple data ...
Mining Recent Frequent Itemsets in Data Streams
FSKD '08: Proceedings of the 2008 Fifth International Conference on Fuzzy Systems and Knowledge Discovery - Volume 04Mining frequent itemsets in data streams is a hot research topic in recent years. Due to the continuous, high-speed and unbounded properties of data streams, traditional algorithms on static dataset are not suitable for mining in data streams. In this ...
Interactive mining of high utility patterns over data streams
High utility pattern (HUP) mining over data streams has become a challenging research issue in data mining. When a data stream flows through, the old information may not be interesting in the current time period. Therefore, incremental HUP mining is ...
Comments