nach oben

2013 | Buch

Kapitel lesen Erstes Kapitel lesen

Principles of Data Mining

verfasst von: Max Bramer

Verlag: Springer London

Buchreihe : Undergraduate Topics in Computer Science

Enthalten in: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

Einloggen, um Zugang zu erhalten

Über dieses Buch

Data Mining, the automatic extraction of implicit and potentially useful information from data, is increasingly used in commercial, scientific and other application areas.

Principles of Data Mining explains and explores the principal techniques of Data Mining: for classification, association rule mining and clustering. Each topic is clearly explained and illustrated by detailed worked examples, with a focus on algorithms rather than mathematical formalism. It is written for readers without a strong background in mathematics or statistics, and any formulae used are explained in detail.

This second edition has been expanded to include additional chapters on using frequent pattern trees for Association Rule Mining, comparing classifiers, ensemble classification and dealing with very large volumes of data.

Principles of Data Mining aims to help general readers develop the necessary understanding of what is inside the 'black box' so they can use commercial data mining packages discriminatingly, as well as enabling advanced readers or academic researchers to understand or contribute to future technical advances in the field.

Suitable as a textbook to support courses at undergraduate or postgraduate levels in a wide range of subjects including Computer Science, Business Studies, Marketing, Artificial Intelligence, Bioinformatics and Forensic Science.

Inhaltsverzeichnis

Frontmatter

1. Introduction to Data Mining

Abstract

This chapter gives a brief overview of the field of Data Mining. The topics covered are the data explosion, the knowledge discovery process, applications of data mining, labelled and unlabelled data, supervised learning: classification and numerical prediction, and unsupervised learning: association rules and clustering.

Max Bramer

2. Data for Data Mining

Abstract

This chapter introduces the standard formulation for the data input to data mining algorithms that will be assumed throughout this book. It goes on to distinguish between different types of variable and to consider issues relating to the preparation of data prior to use, particularly the presence of missing data values and noise. The UCI Repository of datasets is introduced.

Max Bramer

3. Introduction to Classification: Naïve Bayes and Nearest Neighbour

Abstract

This chapter introduces classification, one of the most common data mining tasks. Two classification algorithms are described in detail: the Naïve Bayes algorithm, which uses probability theory to find the most likely of the possible classifications, and Nearest Neighbour classification, which estimates the classification of an unseen instance using the classification of the instances ‘closest’ to it. These two methods generally assume that all the attributes are categorical and continuous, respectively.

Max Bramer

4. Using Decision Trees for Classification

Abstract

This chapter introduces the TDIDT (Top-Down Induction of Decision Trees) algorithm for inducing classification rules via the intermediate representation of a decision tree. The algorithm can always be applied provided the ‘adequacy condition’ holds for the instances in the training set. The chapter ends by distinguishing three types of reasoning: deduction, abduction and induction.

Max Bramer

5. Decision Tree Induction: Using Entropy for Attribute Selection

Abstract

This chapter examines some alternative strategies for selecting attributes at each stage of the TDIDT decision tree generation algorithm and compares the size of the resulting trees for a number of datasets. The risk of obtaining decision trees that are entirely meaningless is highlighted, pointing to the importance of a good choice of attribute selection strategy. One of the most widely used strategies is based on minimising entropy (or equivalently maximising information gain) and this approach is illustrated in detail.

Max Bramer

6. Decision Tree Induction: Using Frequency Tables for Attribute Selection

Abstract

This chapter describes an alternative method of calculating the average entropy of the training (sub)sets resulting from splitting on an attribute, which uses frequency tables. It is shown to be equivalent to the method used in Chapter 5 but requires less computation. Two alternative attribute selection criteria, the Gini Index of Diversity and the χ ² statistic, are illustrated and it is shown how they can also be calculated using a frequency table.

The important issue of inductive bias is introduced. This leads to a description of a further attribute selection criterion, Gain Ratio, which was introduced as a way of overcoming the bias of the entropy minimisation method, which is undesirable for some datasets.

Max Bramer

7. Estimating the Predictive Accuracy of a Classifier

Abstract

This chapter is concerned with estimating the performance of a classifier (of any kind). Three methods are described for estimating a classifier’s predictive accuracy. The first of these is to divide the data available into a training set used for generating the classifier and a test set used for evaluating its performance. The other methods are k-fold cross-validation and its extreme form N-fold (or leave-one-out) cross-validation.

A statistical measure of the accuracy of an estimate formed using any of these methods, known as standard error is introduced. Experiments to estimate the predictive accuracy of the classifiers generated for various datasets are described, including datasets with missing attribute values. Finally a tabular way of presenting classifier performance information called a confusion matrix is introduced, together with the notion of true and false positive and negative classifications.

Max Bramer

8. Continuous Attributes

Abstract

This chapter looks at the question of how to convert a continuous attribute to a categorical one, a process known as discretisation. This is important as many data mining algorithms, including TDIDT, require all attributes to take categorical values.

Two different types of discretisation are distinguished, known as local and global discretisation. The process of extending the TDIDT algorithm by adding local discretisation of continuous attributes is illustrated in detail, followed by a description of the ChiMerge algorithm for global discretisation. The effectiveness of the two methods is compared for the TDIDT algorithm for a number of datasets.

Max Bramer

9. Avoiding Overfitting of Decision Trees

Abstract

This chapter begins by examining techniques for dealing with clashes (i.e. inconsistent instances) in a training set. This leads to a discussion of methods for avoiding or reducing overfitting of a decision tree to training data. Overfitting arises when a decision tree is excessively dependent on irrelevant features of the training data with the result that its predictive power for unseen instances is reduced.

Two approaches to avoiding overfitting are distinguished: pre-pruning (generating a tree with fewer branches than would otherwise be the case) and post-pruning (generating a tree in full and then removing parts of it). Results are given for pre-pruning using either a size or a maximum depth cutoff. A method of post-pruning a decision tree based on comparing the static and backed-up estimated error rates at each node is also described.

Max Bramer

10. More About Entropy

Abstract

This chapter returns to the subject of the entropy of a training set. It explains the concept of entropy in detail using the idea of coding information using bits. The important result that when using the TDIDT algorithm information gain must be positive or zero is discussed, followed by the use of information gain as a method of feature reduction for classification tasks.

Max Bramer

11. Inducing Modular Rules for Classification

Abstract

This chapter begins by considering a method of post-pruning decision rules generated via a decision tree, which has the property that the pruned rules will not generally fit together to form a tree. Rules of this kind are known as modular rules. When using modular rules to classify unseen test data a conflict resolution strategy is needed and several possibilities for this are discussed. The use of a decision tree as an intermediate representation for rules is identified as a source of overfitting.

The Prism algorithm induces modular classification rules directly from a training set. Prism is described in detail, followed by a discussion of its performance as a classification algorithm relative to TDIDT.

Max Bramer

12. Measuring the Performance of a Classifier

Abstract

This chapter looks at the use of true and false positive and negative classifications as a better way of measuring the performance of a classifier than predictive accuracy alone. Other performance measures can be derived from these four basic ones, including true positive rate (or hit rate), false positive rate (or false alarm rate), precision, accuracy and F1 score.

The values of true positive rate and false positive rate are often represented diagrammatically by a ROC graph. Joining the points on a ROC graph to form a ROC curve can often give insight into the best way of tuning a classifier. A Euclidean distance measure of the difference between a given classifier and the performance of a hypothetical perfect classifier is described.

Max Bramer

13. Dealing with Large Volumes of Data

Abstract

This chapter is concerned with issues relating to large volumes of data, in particular the ability of classification algorithms to scale up to be usable for such volumes.

Some of the ways in which a classification task could be distributed over a local area network of personal computers are described and a case study using an extended version of the Prism rule induction algorithm known as PMCRI is presented. Techniques for evaluating a distributed system of this kind are then illustrated.

The issue of streaming data is also considered, leading to a discussion of a classification algorithm that lends itself well to an incremental approach: the Naïve Bayes classifier.

Max Bramer

14. Ensemble Classification

Abstract

This chapter is concerned with ensemble classification, i.e. using a set of classifiers to classify unseen data rather than just a single one. The classifiers in the ensemble all predict the correct classification of each unseen instance and their predictions are then combined using some form of voting system.

The idea of a random forest of classifiers is introduced and issues relating to the selection of a different training set and/or a different set of attributes from a given dataset when constructing each of the classifiers are discussed.

A number of alternative ways of combining the classifications produced by an ensemble of classifiers are considered. The chapter concludes with a brief discussion of a distributed processing approach to dealing with the large amount of computation often required to generate an ensemble.

Max Bramer

15. Comparing Classifiers

Abstract

This chapter considers how to compare the performance of alternative classifiers across a range of datasets. The commonly used paired t-test is described and illustrated with worked examples, leading to the use of confidence intervals when the predictive accuracies of two classifiers are found to be significantly different.

Pitfalls involved in comparing classifiers are discussed, leading to alternative ways of comparing their performance that do not rely on comparisons of predictive accuracy.

Max Bramer

16. Association Rule Mining I

Abstract

This chapter looks at the problem of finding any rules of interest that can be derived from a given dataset, not just classification rules as before. This is known as Association Rule Mining or Generalised Rule Induction. A number of measures of rule interestingness are defined and criteria for choosing between measures are discussed. An algorithm for finding the best N rules that can be generated from a dataset using the J-measure of the information content of a rule and a ‘beam search’ strategy is described.

Max Bramer

17. Association Rule Mining II

Abstract

This chapter is concerned with a special form of Association Rule Mining known as Market Basket Analysis, the most common application of which is to relate the purchases made by the customers in a shop. An approach to finding rules of this kind, with support and confidence measures above specified thresholds, is described. This is based on the idea of supported itemsets. The Apriori algorithm for finding supported itemsets is described in detail. Further rule interestingness measures, lift and leverage, which can be used to reduce the number of rules generated are introduced.

Max Bramer

18. Association Rule Mining III: Frequent Pattern Trees

Abstract

This chapter introduces the FP-growth algorithm for extracting frequent itemsets from a database of transactions. First the database is processed to produce a data structure called a FP-tree, then the tree is processed recursively by constructing a sequence of reduced trees known as conditional FP-trees, from which the frequent itemsets are extracted. The algorithm has the very desirable feature of requiring only two scans through the database.

Max Bramer

19. Clustering

Abstract

This chapter continues with the theme of extracting information from unlabelled data. Clustering is concerned with grouping together objects that are similar to each other and dissimilar to objects belonging to other clusters.

There are many methods of clustering. Two of the most widely used, k-means clustering and hierarchical clustering are described in detail.

Max Bramer

20. Text Mining

Abstract

This chapter looks at a particular type of classification task, where the objects are text documents. A method of processing the documents for use by the classification algorithms given earlier in this book using a bag-of-words representation is described.

An important special case of text classification arises when the documents are web pages. The automatic classification of web pages is known as hypertext categorisation. The differences between standard text classification and hypertext categorisation are illustrated and issues relating to the latter are discussed.

Max Bramer

Backmatter

Titel: Principles of Data Mining
verfasst von: Max Bramer
Verlag: Springer London
Electronic ISBN: 978-1-4471-4884-5
Print ISBN: 978-1-4471-4883-8
DOI: https://doi.org/10.1007/978-1-4471-4884-5

Neuer Inhalt

Bildnachweise

VDI-Icon, Profil Icon, inhalt2, Springer Professional Modul/© Springer Fachmedien Wiesbaden GmbH, Nachhaltigkeitsaward Key Visual/© Cometis AG/Global ESG Monitor | Daniel Rupp | Generiert mit KI, Search Icon, Banner Hanser, Bunte Männchen, die Kunden darstelle, werden von einem riesigen Magneten angezogen. /© Oleksiy Mark, Interview Level Ten PPA Bild/© LevelTen, Barbara Liebermeister/© Barbara Liebermeister, Zeitschrift Wissensmanagement Cover, PatentFit-Logo/© Springer Fachmedien Wiesbaden GmbH, Zukunftswerkstatt Sales Excellence 2024/© AndreyPopov / Getty Images / iStock, 2023_Antrieb/© supervisuell, ATZ-Webinar: Prototypenfreie Entwicklung durch Offline- und Driver-in-the-Loop-HiL-Tests /© (c) VI-grade

Springer Professional

Über dieses Buch

Inhaltsverzeichnis

Frontmatter

1. Introduction to Data Mining

2. Data for Data Mining

3. Introduction to Classification: Naïve Bayes and Nearest Neighbour

4. Using Decision Trees for Classification

5. Decision Tree Induction: Using Entropy for Attribute Selection

6. Decision Tree Induction: Using Frequency Tables for Attribute Selection

7. Estimating the Predictive Accuracy of a Classifier

8. Continuous Attributes

9. Avoiding Overfitting of Decision Trees

10. More About Entropy

11. Inducing Modular Rules for Classification

12. Measuring the Performance of a Classifier

13. Dealing with Large Volumes of Data

14. Ensemble Classification

15. Comparing Classifiers

16. Association Rule Mining I

17. Association Rule Mining II

18. Association Rule Mining III: Frequent Pattern Trees

19. Clustering

20. Text Mining

Backmatter

Neuer Inhalt

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.