Elsevier

Information Sciences

Volume 178, Issue 3, 1 February 2008, Pages 714-731
Information Sciences

A discretization algorithm based on Class-Attribute Contingency Coefficient

https://doi.org/10.1016/j.ins.2007.09.004Get rights and content

Abstract

Discretization algorithms have played an important role in data mining and knowledge discovery. They not only produce a concise summarization of continuous attributes to help the experts understand the data more easily, but also make learning more accurate and faster. In this paper, we propose a static, global, incremental, supervised and top-down discretization algorithm based on Class-Attribute Contingency Coefficient. Empirical evaluation of seven discretization algorithms on 13 real datasets and four artificial datasets showed that the proposed algorithm could generate a better discretization scheme that improved the accuracy of classification. As to the execution time of discretization, the number of generated rules, and the training time of C5.0, our approach also achieved promising results.

Introduction

With the rapid development of information technology, electronic storage devices are widely used to record transactions. Since people are often unable to extract useful knowledge from such huge datasets, data mining [16] has become a research focus in recent years. Among the several functions of data mining, classification is crucially important and has been applied successfully to several areas such as automatic text summarization and categorization [17], [38], image classification [15], and virus detection of new malicious emails [31]. Although real-word data mining tasks often involve continuous attributes, some classification algorithms such as AQ [18], [26], CLIP [6], [7] and CN2 [8] can only handle categorical attribute, while others can handle continuous attributes but would perform better on categorical attributes [36]. To deal with this problem, a lot of discretization algorithms have been proposed [11], [12], [22], [28].

Discretization is a technique to partition continuous attributes into a finite set of adjacent intervals in order to generate attributes with a small number of distinct values. Assuming that a dataset consisting of M examples and S target classes, a discretization algorithm would discretize the continuous attribute A in this dataset into n discrete intervals {[d0, d1], (d1, d2],  , (dn−1, dn]}, where d0 is the minimal value and dn is the maximal value of attribute A. Such a discrete result {[d0, d1], (d1, d2],  , (dn−1, dn]} is called a discretization scheme D on attribute A. This discretization scheme should keep the high interdependency between the discrete attribute and the target class to carefully avoid changing the distribution of the original data [2], [25], [33].

Discretization is usually performed prior to the learning process and has played an important role in data mining and knowledge discovery. The modern classification systems such as CLIP4 [7] had also implemented some discretization algorithms as built-in functions. A good discretization algorithm not only can produce a concise summarization of continuous attributes to help the experts and users understand the data more easily, but also make learning more accurate and faster [24]. There are five different axes by which the proposed discretization algorithms can be classified [24]: supervised versus unsupervised, static versus dynamic, global versus local, top-down (splitting) versus bottom-up (merging), and direct versus incremental.

  • 1.

    Supervised methods discretize attributes with the consideration of class information, while unsupervised methods do not.

  • 2.

    Dynamic methods consider the interdependence among the features attributes and discretize continuous attributes when a classifier is being built. On the contrary, the static methods consider attributes in an isolated way and the discretization is completed prior to the learning task.

  • 3.

    Global methods, which use total instances to generate the discretization scheme, are usually associated with static methods. On the contrary, local methods are usually associated with dynamic approaches in which only parts of instances are used for discretization.

  • 4.

    Bottom-up methods start with the complete list of all continuous values of the attribute as cut-points and remove some of them by merging intervals in each step. Top-down methods start with an empty list of cut-points and add new ones in each step.

  • 5.

    Direct methods, such as Equal Width and Equal Frequency [5], require users to decide on the number of intervals k and then discretize the continuous attributes into k intervals simultaneously. On the other hand, incremental methods begin with a simple discretization scheme and pass through a refinement process although some of them may require a stopping criterion to terminate the discretization.

In recent years, many researchers put their attentions on developing the dynamic discretization algorithms for some particular learning algorithms. For example, Berzal et al. [14] built multi-way decision trees by using a dynamic discretization method in each internal node to reduce the size of the resulting decision trees. Their experiments showed that the accuracy of these compact decision trees was also preserved. Wu et al. [36] defined a distributional index and then proposed a dynamic discretization algorithm to enhance the decision accuracy of naı¨ve Bayes classifiers. However, the advantage of static approaches as opposed to dynamic approaches is the independence from the learning algorithms [24]. In other words, a dataset discretized by a static discretization algorithm can be used in any classification algorithms that deal with discrete attributes. Besides, since the bottom-up method starts with the complete list of all continuous values of the attribute as cut-points, and then remove some of them by merging intervals in each step, its computational complexity is usually worse than the top-down method. For example, the time complexity for discretizing a single attribute in Extended Chi2, which is the newest bottom-up method, is O(km log m) [33], while that of the newest top-down method CAIM is O(m log m) [21], where m is the number of distinct values of the discretized attribute and k is the number of incremental steps. This condition will get worse when the difference between the number of values in a continuous attribute and the number of produced intervals is large. Supposing that a continuous attribute contains 1000 different values and this attribute is discretized into 50 intervals, in general, a top-down approach requires only 50 steps, but a bottom-up approach would need 950 steps. Finally, supervised discretization algorithms are expected to lead to better performance as compared to unsupervised ones since they take the class information into account. Based on the above-mentioned reasons, we aimed at developing a static, global, incremental, supervised and top-down discretization algorithm. For the rest of the present paper, our discussion of proposed discretization algorithms will follow the axis of top-down versus bottom-up. More detailed discussions about the five axes can be found in [24].

CAIM is the newest top-down discretization algorithm. In comparison with six state-of-the-art top-down discretization algorithms, experiments showed that on the average, CAIM can generate a better discretization scheme. These experiments also showed that a classification algorithm, which uses CAIM as a preprocessor to discretize the training data, can on the average, produce the least number of rules and reach the highest classification accuracy [21]. However, the general goals of a discretization algorithm should be: (a) generating a high quality discretization scheme to help the experts understand the data more easily (the quality of a discretization scheme can be measured byCAIR criterion which is discussed in Section 2); (b) the generated discretization scheme should lead to the improvement of accuracy and the efficiency of a learning algorithm (for a decision tree algorithm, the efficiency is evaluated by the number of rules and training time); and, (c) the discretization process should be as fast as possible. Although CAIM outperforms the other top-down methods in these aspects, it still has two drawbacks. First of all, CAIM algorithm gives a high factor to the number of generated intervals when it discretizes an attribute. Thus, CAIM usually generate a simple discretization scheme in which the number of intervals is very close to the number of target classes. Secondly, for each discretized interval, CAIM considers only the class with the most samples and ignores all the other target classes. Such a consideration would decrease the quality of the produced discretization scheme in some cases. The two observations motivated us to propose our Class-Attribute Contingency Coefficient (CACC) discretization algorithm. The detailed discussions and examples about CAIM were presented in Section 2.3.

CACC is inspired by the contingency coefficient. The main contribution of CACC is that it can generate a better discretization scheme (i.e., higher cair value) and its discretization scheme can lead to the improvement of classifier accuracy like that of C5.0. As regards to the time complexity of discretization, the number of generated rules and the execution time of a classifier, our approach also achieved promising results. The rest of the paper is organized as follows. In Section 2, we review some related works. Section 3 presents our Class-Attribute Contingency Coefficient discretization algorithm. The experimental comparisons of seven discretization algorithms on 13 real datasets and four artificial datasets and a further evaluation of CACC are presented in Section 4. Finally, the conclusions are presented in Section 5.

Section snippets

Related works

In this section, we review some of the related works. Since we evaluated the performance of several discretization algorithms in Section 4 by using the famous classification algorithm C5.0, we first gave a brief introduction of classification in Section 2.1. Then, we reviewed the proposed discretization algorithms on the axis of top-down versus bottom-up in Section 2.2. Finally, the detailed discussions of CAIM were given in Section 2.3.

Class-Attribute Contingency Coefficient discretization algorithm

As stated in the Introduction, a good discretization algorithm should generate a discretization scheme which maintains a high interdependence between the target class and the discretized attribute. As described in Section 2.3, CAIM gives a high factor to the number of generated intervals and does not consider the data distribution, whereas CDD can suffer from the overfitting problem. Thus, both methods could generate irrational discrete results in some cases. Let us review the age dataset in

Performance analysis

In this section, we compare the following seven discretization algorithms in Microsoft Visual C++ 6.0 for performance analysis.

  • 1.

    Equal Width and Equal Frequency: two typical unsupervised top-down methods;

  • 2.

    CACC: the method proposed in this paper;

  • 3.

    CAIM: the newest top-down method;

  • 4.

    IEM: a famous and widely used top-down method;

  • 5.

    ChiMerge: a typical bottom-up method;

  • 6.

    Extended Chi2: the newest bottom-up approach.

Among the seven discretization algorithms, Equal Width, Equal Frequency and ChiMerge require the

Conclusions

Discretization algorithms have played an important role in data mining and knowledge discovery. They not only produce a concise summarization of continuous attributes to help the experts understand the data more easily, but also make learning more accurate and faster. In this paper, we proposed a static, global, incremental, supervised and top-down discretization algorithm called CACC, in order to raise the quality of the generated discretization scheme by extending the idea of contingency

Acknowledgement

This research was partially supported by the National Science Council of Republic of China under Grant No. NSC 96-2520-S-024-001. We are grateful to the authors of the CAIM algorithm because of its important role on our paper. We are also grateful to Chih-Wei Ku and Ya-Ru Yang for the help of experiments in this paper.

References (38)

  • C.L. Blake, C.J. Merz, UCI repository of machine learning databases, University of California, Department of...
  • M. Boulle

    Khiops: a statistical discretization method of continuous attributes

    Machine Learning

    (2004)
  • J.Y. Ching et al.

    Class-dependent discretization for inductive learning from continuous and mixed mode data

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (1995)
  • D. Chiu et al.

    Information discovery through hierarchical maximum entropy discretization and synthesis

  • K.J. Cios et al.

    Hybrid inductive machine learning: an overview of clip algorithms

  • P. Clark et al.

    The CN2 algorithm

    Machine Learning

    (1989)
  • J. Demsar

    Statistical comparisons of classifiers over multiple data sets

    Journal of Machine Learning Research

    (2006)
  • J. Dougherty, R. Kohavi, M. Sahami, Supervised and unsupervised discretization of continuous features, in: Proceeding...
  • U.M. Fayyad et al.

    On the handling of continuous-valued attributes in decision tree generation

    Machine Learning

    (1992)
  • Cited by (189)

    View all citing articles on Scopus
    View full text