Skip to main content

2013 | Buch

Principles of Data Mining

insite
SUCHEN

Über dieses Buch

Data Mining, the automatic extraction of implicit and potentially useful information from data, is increasingly used in commercial, scientific and other application areas.

Principles of Data Mining explains and explores the principal techniques of Data Mining: for classification, association rule mining and clustering. Each topic is clearly explained and illustrated by detailed worked examples, with a focus on algorithms rather than mathematical formalism. It is written for readers without a strong background in mathematics or statistics, and any formulae used are explained in detail.

This second edition has been expanded to include additional chapters on using frequent pattern trees for Association Rule Mining, comparing classifiers, ensemble classification and dealing with very large volumes of data.

Principles of Data Mining aims to help general readers develop the necessary understanding of what is inside the 'black box' so they can use commercial data mining packages discriminatingly, as well as enabling advanced readers or academic researchers to understand or contribute to future technical advances in the field.

Suitable as a textbook to support courses at undergraduate or postgraduate levels in a wide range of subjects including Computer Science, Business Studies, Marketing, Artificial Intelligence, Bioinformatics and Forensic Science.

Inhaltsverzeichnis

Frontmatter
1. Introduction to Data Mining
Abstract
This chapter gives a brief overview of the field of Data Mining. The topics covered are the data explosion, the knowledge discovery process, applications of data mining, labelled and unlabelled data, supervised learning: classification and numerical prediction, and unsupervised learning: association rules and clustering.
Max Bramer
2. Data for Data Mining
Abstract
This chapter introduces the standard formulation for the data input to data mining algorithms that will be assumed throughout this book. It goes on to distinguish between different types of variable and to consider issues relating to the preparation of data prior to use, particularly the presence of missing data values and noise. The UCI Repository of datasets is introduced.
Max Bramer
3. Introduction to Classification: Naïve Bayes and Nearest Neighbour
Abstract
This chapter introduces classification, one of the most common data mining tasks. Two classification algorithms are described in detail: the Naïve Bayes algorithm, which uses probability theory to find the most likely of the possible classifications, and Nearest Neighbour classification, which estimates the classification of an unseen instance using the classification of the instances ‘closest’ to it. These two methods generally assume that all the attributes are categorical and continuous, respectively.
Max Bramer
4. Using Decision Trees for Classification
Abstract
This chapter introduces the TDIDT (Top-Down Induction of Decision Trees) algorithm for inducing classification rules via the intermediate representation of a decision tree. The algorithm can always be applied provided the ‘adequacy condition’ holds for the instances in the training set. The chapter ends by distinguishing three types of reasoning: deduction, abduction and induction.
Max Bramer
5. Decision Tree Induction: Using Entropy for Attribute Selection
Abstract
This chapter examines some alternative strategies for selecting attributes at each stage of the TDIDT decision tree generation algorithm and compares the size of the resulting trees for a number of datasets. The risk of obtaining decision trees that are entirely meaningless is highlighted, pointing to the importance of a good choice of attribute selection strategy. One of the most widely used strategies is based on minimising entropy (or equivalently maximising information gain) and this approach is illustrated in detail.
Max Bramer
6. Decision Tree Induction: Using Frequency Tables for Attribute Selection
Abstract
This chapter describes an alternative method of calculating the average entropy of the training (sub)sets resulting from splitting on an attribute, which uses frequency tables. It is shown to be equivalent to the method used in Chapter 5 but requires less computation. Two alternative attribute selection criteria, the Gini Index of Diversity and the χ 2 statistic, are illustrated and it is shown how they can also be calculated using a frequency table.
The important issue of inductive bias is introduced. This leads to a description of a further attribute selection criterion, Gain Ratio, which was introduced as a way of overcoming the bias of the entropy minimisation method, which is undesirable for some datasets.
Max Bramer
7. Estimating the Predictive Accuracy of a Classifier
Abstract
This chapter is concerned with estimating the performance of a classifier (of any kind). Three methods are described for estimating a classifier’s predictive accuracy. The first of these is to divide the data available into a training set used for generating the classifier and a test set used for evaluating its performance. The other methods are k-fold cross-validation and its extreme form N-fold (or leave-one-out) cross-validation.
A statistical measure of the accuracy of an estimate formed using any of these methods, known as standard error is introduced. Experiments to estimate the predictive accuracy of the classifiers generated for various datasets are described, including datasets with missing attribute values. Finally a tabular way of presenting classifier performance information called a confusion matrix is introduced, together with the notion of true and false positive and negative classifications.
Max Bramer
8. Continuous Attributes
Abstract
This chapter looks at the question of how to convert a continuous attribute to a categorical one, a process known as discretisation. This is important as many data mining algorithms, including TDIDT, require all attributes to take categorical values.
Two different types of discretisation are distinguished, known as local and global discretisation. The process of extending the TDIDT algorithm by adding local discretisation of continuous attributes is illustrated in detail, followed by a description of the ChiMerge algorithm for global discretisation. The effectiveness of the two methods is compared for the TDIDT algorithm for a number of datasets.
Max Bramer
9. Avoiding Overfitting of Decision Trees
Abstract
This chapter begins by examining techniques for dealing with clashes (i.e. inconsistent instances) in a training set. This leads to a discussion of methods for avoiding or reducing overfitting of a decision tree to training data. Overfitting arises when a decision tree is excessively dependent on irrelevant features of the training data with the result that its predictive power for unseen instances is reduced.
Two approaches to avoiding overfitting are distinguished: pre-pruning (generating a tree with fewer branches than would otherwise be the case) and post-pruning (generating a tree in full and then removing parts of it). Results are given for pre-pruning using either a size or a maximum depth cutoff. A method of post-pruning a decision tree based on comparing the static and backed-up estimated error rates at each node is also described.
Max Bramer
10. More About Entropy
Abstract
This chapter returns to the subject of the entropy of a training set. It explains the concept of entropy in detail using the idea of coding information using bits. The important result that when using the TDIDT algorithm information gain must be positive or zero is discussed, followed by the use of information gain as a method of feature reduction for classification tasks.
Max Bramer
11. Inducing Modular Rules for Classification
Abstract
This chapter begins by considering a method of post-pruning decision rules generated via a decision tree, which has the property that the pruned rules will not generally fit together to form a tree. Rules of this kind are known as modular rules. When using modular rules to classify unseen test data a conflict resolution strategy is needed and several possibilities for this are discussed. The use of a decision tree as an intermediate representation for rules is identified as a source of overfitting.
The Prism algorithm induces modular classification rules directly from a training set. Prism is described in detail, followed by a discussion of its performance as a classification algorithm relative to TDIDT.
Max Bramer
12. Measuring the Performance of a Classifier
Abstract
This chapter looks at the use of true and false positive and negative classifications as a better way of measuring the performance of a classifier than predictive accuracy alone. Other performance measures can be derived from these four basic ones, including true positive rate (or hit rate), false positive rate (or false alarm rate), precision, accuracy and F1 score.
The values of true positive rate and false positive rate are often represented diagrammatically by a ROC graph. Joining the points on a ROC graph to form a ROC curve can often give insight into the best way of tuning a classifier. A Euclidean distance measure of the difference between a given classifier and the performance of a hypothetical perfect classifier is described.
Max Bramer
13. Dealing with Large Volumes of Data
Abstract
This chapter is concerned with issues relating to large volumes of data, in particular the ability of classification algorithms to scale up to be usable for such volumes.
Some of the ways in which a classification task could be distributed over a local area network of personal computers are described and a case study using an extended version of the Prism rule induction algorithm known as PMCRI is presented. Techniques for evaluating a distributed system of this kind are then illustrated.
The issue of streaming data is also considered, leading to a discussion of a classification algorithm that lends itself well to an incremental approach: the Naïve Bayes classifier.
Max Bramer
14. Ensemble Classification
Abstract
This chapter is concerned with ensemble classification, i.e. using a set of classifiers to classify unseen data rather than just a single one. The classifiers in the ensemble all predict the correct classification of each unseen instance and their predictions are then combined using some form of voting system.
The idea of a random forest of classifiers is introduced and issues relating to the selection of a different training set and/or a different set of attributes from a given dataset when constructing each of the classifiers are discussed.
A number of alternative ways of combining the classifications produced by an ensemble of classifiers are considered. The chapter concludes with a brief discussion of a distributed processing approach to dealing with the large amount of computation often required to generate an ensemble.
Max Bramer
15. Comparing Classifiers
Abstract
This chapter considers how to compare the performance of alternative classifiers across a range of datasets. The commonly used paired t-test is described and illustrated with worked examples, leading to the use of confidence intervals when the predictive accuracies of two classifiers are found to be significantly different.
Pitfalls involved in comparing classifiers are discussed, leading to alternative ways of comparing their performance that do not rely on comparisons of predictive accuracy.
Max Bramer
16. Association Rule Mining I
Abstract
This chapter looks at the problem of finding any rules of interest that can be derived from a given dataset, not just classification rules as before. This is known as Association Rule Mining or Generalised Rule Induction. A number of measures of rule interestingness are defined and criteria for choosing between measures are discussed. An algorithm for finding the best N rules that can be generated from a dataset using the J-measure of the information content of a rule and a ‘beam search’ strategy is described.
Max Bramer
17. Association Rule Mining II
Abstract
This chapter is concerned with a special form of Association Rule Mining known as Market Basket Analysis, the most common application of which is to relate the purchases made by the customers in a shop. An approach to finding rules of this kind, with support and confidence measures above specified thresholds, is described. This is based on the idea of supported itemsets. The Apriori algorithm for finding supported itemsets is described in detail. Further rule interestingness measures, lift and leverage, which can be used to reduce the number of rules generated are introduced.
Max Bramer
18. Association Rule Mining III: Frequent Pattern Trees
Abstract
This chapter introduces the FP-growth algorithm for extracting frequent itemsets from a database of transactions. First the database is processed to produce a data structure called a FP-tree, then the tree is processed recursively by constructing a sequence of reduced trees known as conditional FP-trees, from which the frequent itemsets are extracted. The algorithm has the very desirable feature of requiring only two scans through the database.
Max Bramer
19. Clustering
Abstract
This chapter continues with the theme of extracting information from unlabelled data. Clustering is concerned with grouping together objects that are similar to each other and dissimilar to objects belonging to other clusters.
There are many methods of clustering. Two of the most widely used, k-means clustering and hierarchical clustering are described in detail.
Max Bramer
20. Text Mining
Abstract
This chapter looks at a particular type of classification task, where the objects are text documents. A method of processing the documents for use by the classification algorithms given earlier in this book using a bag-of-words representation is described.
An important special case of text classification arises when the documents are web pages. The automatic classification of web pages is known as hypertext categorisation. The differences between standard text classification and hypertext categorisation are illustrated and issues relating to the latter are discussed.
Max Bramer
Backmatter
Metadaten
Titel
Principles of Data Mining
verfasst von
Max Bramer
Copyright-Jahr
2013
Verlag
Springer London
Electronic ISBN
978-1-4471-4884-5
Print ISBN
978-1-4471-4883-8
DOI
https://doi.org/10.1007/978-1-4471-4884-5

Neuer Inhalt