A discretization algorithm based on Class-Attribute Contingency Coefficient
Introduction
With the rapid development of information technology, electronic storage devices are widely used to record transactions. Since people are often unable to extract useful knowledge from such huge datasets, data mining [16] has become a research focus in recent years. Among the several functions of data mining, classification is crucially important and has been applied successfully to several areas such as automatic text summarization and categorization [17], [38], image classification [15], and virus detection of new malicious emails [31]. Although real-word data mining tasks often involve continuous attributes, some classification algorithms such as AQ [18], [26], CLIP [6], [7] and CN2 [8] can only handle categorical attribute, while others can handle continuous attributes but would perform better on categorical attributes [36]. To deal with this problem, a lot of discretization algorithms have been proposed [11], [12], [22], [28].
Discretization is a technique to partition continuous attributes into a finite set of adjacent intervals in order to generate attributes with a small number of distinct values. Assuming that a dataset consisting of M examples and S target classes, a discretization algorithm would discretize the continuous attribute A in this dataset into n discrete intervals {[d0, d1], (d1, d2], … , (dn−1, dn]}, where d0 is the minimal value and dn is the maximal value of attribute A. Such a discrete result {[d0, d1], (d1, d2], … , (dn−1, dn]} is called a discretization scheme D on attribute A. This discretization scheme should keep the high interdependency between the discrete attribute and the target class to carefully avoid changing the distribution of the original data [2], [25], [33].
Discretization is usually performed prior to the learning process and has played an important role in data mining and knowledge discovery. The modern classification systems such as CLIP4 [7] had also implemented some discretization algorithms as built-in functions. A good discretization algorithm not only can produce a concise summarization of continuous attributes to help the experts and users understand the data more easily, but also make learning more accurate and faster [24]. There are five different axes by which the proposed discretization algorithms can be classified [24]: supervised versus unsupervised, static versus dynamic, global versus local, top-down (splitting) versus bottom-up (merging), and direct versus incremental.
- 1.
Supervised methods discretize attributes with the consideration of class information, while unsupervised methods do not.
- 2.
Dynamic methods consider the interdependence among the features attributes and discretize continuous attributes when a classifier is being built. On the contrary, the static methods consider attributes in an isolated way and the discretization is completed prior to the learning task.
- 3.
Global methods, which use total instances to generate the discretization scheme, are usually associated with static methods. On the contrary, local methods are usually associated with dynamic approaches in which only parts of instances are used for discretization.
- 4.
Bottom-up methods start with the complete list of all continuous values of the attribute as cut-points and remove some of them by merging intervals in each step. Top-down methods start with an empty list of cut-points and add new ones in each step.
- 5.
Direct methods, such as Equal Width and Equal Frequency [5], require users to decide on the number of intervals k and then discretize the continuous attributes into k intervals simultaneously. On the other hand, incremental methods begin with a simple discretization scheme and pass through a refinement process although some of them may require a stopping criterion to terminate the discretization.
In recent years, many researchers put their attentions on developing the dynamic discretization algorithms for some particular learning algorithms. For example, Berzal et al. [14] built multi-way decision trees by using a dynamic discretization method in each internal node to reduce the size of the resulting decision trees. Their experiments showed that the accuracy of these compact decision trees was also preserved. Wu et al. [36] defined a distributional index and then proposed a dynamic discretization algorithm to enhance the decision accuracy of naı¨ve Bayes classifiers. However, the advantage of static approaches as opposed to dynamic approaches is the independence from the learning algorithms [24]. In other words, a dataset discretized by a static discretization algorithm can be used in any classification algorithms that deal with discrete attributes. Besides, since the bottom-up method starts with the complete list of all continuous values of the attribute as cut-points, and then remove some of them by merging intervals in each step, its computational complexity is usually worse than the top-down method. For example, the time complexity for discretizing a single attribute in Extended Chi2, which is the newest bottom-up method, is O(km log m) [33], while that of the newest top-down method CAIM is O(m log m) [21], where m is the number of distinct values of the discretized attribute and k is the number of incremental steps. This condition will get worse when the difference between the number of values in a continuous attribute and the number of produced intervals is large. Supposing that a continuous attribute contains 1000 different values and this attribute is discretized into 50 intervals, in general, a top-down approach requires only 50 steps, but a bottom-up approach would need 950 steps. Finally, supervised discretization algorithms are expected to lead to better performance as compared to unsupervised ones since they take the class information into account. Based on the above-mentioned reasons, we aimed at developing a static, global, incremental, supervised and top-down discretization algorithm. For the rest of the present paper, our discussion of proposed discretization algorithms will follow the axis of top-down versus bottom-up. More detailed discussions about the five axes can be found in [24].
CAIM is the newest top-down discretization algorithm. In comparison with six state-of-the-art top-down discretization algorithms, experiments showed that on the average, CAIM can generate a better discretization scheme. These experiments also showed that a classification algorithm, which uses CAIM as a preprocessor to discretize the training data, can on the average, produce the least number of rules and reach the highest classification accuracy [21]. However, the general goals of a discretization algorithm should be: (a) generating a high quality discretization scheme to help the experts understand the data more easily (the quality of a discretization scheme can be measured byCAIR criterion which is discussed in Section 2); (b) the generated discretization scheme should lead to the improvement of accuracy and the efficiency of a learning algorithm (for a decision tree algorithm, the efficiency is evaluated by the number of rules and training time); and, (c) the discretization process should be as fast as possible. Although CAIM outperforms the other top-down methods in these aspects, it still has two drawbacks. First of all, CAIM algorithm gives a high factor to the number of generated intervals when it discretizes an attribute. Thus, CAIM usually generate a simple discretization scheme in which the number of intervals is very close to the number of target classes. Secondly, for each discretized interval, CAIM considers only the class with the most samples and ignores all the other target classes. Such a consideration would decrease the quality of the produced discretization scheme in some cases. The two observations motivated us to propose our Class-Attribute Contingency Coefficient (CACC) discretization algorithm. The detailed discussions and examples about CAIM were presented in Section 2.3.
CACC is inspired by the contingency coefficient. The main contribution of CACC is that it can generate a better discretization scheme (i.e., higher cair value) and its discretization scheme can lead to the improvement of classifier accuracy like that of C5.0. As regards to the time complexity of discretization, the number of generated rules and the execution time of a classifier, our approach also achieved promising results. The rest of the paper is organized as follows. In Section 2, we review some related works. Section 3 presents our Class-Attribute Contingency Coefficient discretization algorithm. The experimental comparisons of seven discretization algorithms on 13 real datasets and four artificial datasets and a further evaluation of CACC are presented in Section 4. Finally, the conclusions are presented in Section 5.
Section snippets
Related works
In this section, we review some of the related works. Since we evaluated the performance of several discretization algorithms in Section 4 by using the famous classification algorithm C5.0, we first gave a brief introduction of classification in Section 2.1. Then, we reviewed the proposed discretization algorithms on the axis of top-down versus bottom-up in Section 2.2. Finally, the detailed discussions of CAIM were given in Section 2.3.
Class-Attribute Contingency Coefficient discretization algorithm
As stated in the Introduction, a good discretization algorithm should generate a discretization scheme which maintains a high interdependence between the target class and the discretized attribute. As described in Section 2.3, CAIM gives a high factor to the number of generated intervals and does not consider the data distribution, whereas CDD can suffer from the overfitting problem. Thus, both methods could generate irrational discrete results in some cases. Let us review the age dataset in
Performance analysis
In this section, we compare the following seven discretization algorithms in Microsoft Visual C++ 6.0 for performance analysis.
- 1.
Equal Width and Equal Frequency: two typical unsupervised top-down methods;
- 2.
CACC: the method proposed in this paper;
- 3.
CAIM: the newest top-down method;
- 4.
IEM: a famous and widely used top-down method;
- 5.
ChiMerge: a typical bottom-up method;
- 6.
Extended Chi2: the newest bottom-up approach.
Among the seven discretization algorithms, Equal Width, Equal Frequency and ChiMerge require the
Conclusions
Discretization algorithms have played an important role in data mining and knowledge discovery. They not only produce a concise summarization of continuous attributes to help the experts understand the data more easily, but also make learning more accurate and faster. In this paper, we proposed a static, global, incremental, supervised and top-down discretization algorithm called CACC, in order to raise the quality of the generated discretization scheme by extending the idea of contingency
Acknowledgement
This research was partially supported by the National Science Council of Republic of China under Grant No. NSC 96-2520-S-024-001. We are grateful to the authors of the CAIM algorithm because of its important role on our paper. We are also grateful to Chih-Wei Ku and Ya-Ru Yang for the help of experiments in this paper.
References (38)
- et al.
Newspaper demand prediction and replacement model based on fuzzy clustering and rules
Information Sciences
(2007) - et al.
CLIP4: hybrid inductive machine learning algorithm that generates inequality rules
Information Science
(2004) - et al.
Decision-tree instance-space decomposition with grouped gain-ratio
Information Sciences
(2007) - et al.
Building multi-way decision trees with numerical attributes
Information Sciences
(2004) - et al.
A coloring fuzzy graph approach for image classification
Information Sciences
(2006) - et al.
Estimating sentence types in computer related new product bulletins using a decision tree
Information Sciences
(2004) - et al.
Classification methods in the detection of new malicious emails
Information Sciences
(2005) - et al.
A self-adaptive migration model genetic algorithm for data mining applications
Information Sciences
(2007) An extension of the naive Bayesian classifier
Information Sciences
(2006)- et al.
Computing with words for text processing: an approach to the text categorization
Information Sciences
(2006)
Khiops: a statistical discretization method of continuous attributes
Machine Learning
Class-dependent discretization for inductive learning from continuous and mixed mode data
IEEE Transactions on Pattern Analysis and Machine Intelligence
Information discovery through hierarchical maximum entropy discretization and synthesis
Hybrid inductive machine learning: an overview of clip algorithms
The CN2 algorithm
Machine Learning
Statistical comparisons of classifiers over multiple data sets
Journal of Machine Learning Research
On the handling of continuous-valued attributes in decision tree generation
Machine Learning
Cited by (189)
A Max-Relevance-Min-Divergence criterion for data discretization with applications on naive Bayes
2024, Pattern RecognitionSurvey:Time-series data preprocessing: A survey and an empirical analysis
2024, Journal of Engineering Research (Kuwait)A semi-supervised adaptive discriminative discretization method improving discrimination power of regularized naive Bayes
2023, Expert Systems with ApplicationsNon-parametric discretization for probabilistic labeled data
2022, Pattern Recognition LettersCharacterizing the impact of discrete indicators to correct for endogeneity in discrete choice models
2022, Journal of Choice Modelling