nach oben

2010 | Buch

Data Mining and Knowledge Discovery Handbook

herausgegeben von: Oded Maimon, Lior Rokach

Verlag: Springer US

Enthalten in: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

Einloggen, um Zugang zu erhalten

Über dieses Buch

Knowledge Discovery demonstrates intelligent computing at its best, and is the most desirable and interesting end-product of Information Technology. To be able to discover and to extract knowledge from data is a task that many researchers and practitioners are endeavoring to accomplish. There is a lot of hidden knowledge waiting to be discovered – this is the challenge created by today’s abundance of data.

Data Mining and Knowledge Discovery Handbook, Second Edition organizes the most current concepts, theories, standards, methodologies, trends, challenges and applications of data mining (DM) and knowledge discovery in databases (KDD) into a coherent and unified repository. This handbook first surveys, then provides comprehensive yet concise algorithmic descriptions of methods, including classic methods plus the extensions and novel methods developed recently. This volume concludes with in-depth descriptions of data mining applications in various interdisciplinary industries including finance, marketing, medicine, biology, engineering, telecommunications, software, and security.

Data Mining and Knowledge Discovery Handbook, Second Edition is designed for research scientists, libraries and advanced-level students in computer science and engineering as a reference. This handbook is also suitable for professionals in industry, for computing applications, information systems management, and strategic research management.

Inhaltsverzeichnis

Frontmatter

1. Introduction to Knowledge Discovery and Data Mining

Knowledge Discovery in Databases

(KDD) is an automatic, exploratory analysis and modeling of large data repositories. KDD is the organized process of identifying valid, novel, useful, and understandable patterns from large and complex data sets.

Data Mining

(DM) is the core of the KDD process, involving the inferring of algorithms that explore the data, develop the model and discover previously unknown patterns. The model is used for understanding phenomena from the data, analysis and prediction.

Oded Maimon, Lior Rokach

Preprocessing Methods

Frontmatter

2. Data Cleansing: A Prelude to Knowledge Discovery

This chapter analyzes the problem of data cleansing and the identification of potential errors in data sets. The differing views of data cleansing are surveyed and reviewed and a brief overview of existing data cleansing tools is given. A general framework of the data cleansing process is presented as well as a set of general methods that can be used to address the problem. The applicable methods include statistical outlier detection, pattern matching,clustering, and Data Mining techniques. The experimental results of applying these methods to a real world data set are also given. Finally, research directions necessary to further address the data cleansing problem are discussed.

Jonathan I. Maletic, Andrian Marcus

3. Handling Missing Attribute Values

In this chapter methods of handling missing attribute values in Data Mining are described. These methods are categorized into sequential and parallel. In sequential methods, missing attribute values are replaced by known values first, as a preprocessing, then the knowledge is acquired for a data set with all known attribute values. In parallel methods, there is no preprocessing, i.e., knowledge is acquired directly from the original data sets. In this chapter the main emphasis is put on rule induction. Methods of handling attribute values for decision tree generation are only briefly summarized.

Jerzy W. Grzymala-Busse, Witold J. Grzymala-Busse

4. Geometric Methods for Feature Extraction and Dimensional Reduction - A Guided Tour

We give a tutorial overview of several geometric methods for feature extractionand dimensional reduction. We divide the methods into projective methods and methods thatmodel the manifold on which the data lies. For projective methods, we review projectionpursuit, principal component analysis (PCA), kernel PCA, probabilistic PCA, and orientedPCA; and for the manifold methods, we review multidimensional scaling (MDS), landmarkMDS, Isomap, locally linear embedding, Laplacian eigenmaps and spectral clustering. TheNyström method, which links several of the algorithms, is also reviewed. The goal is to providea self-contained review of the concepts and mathematics underlying these algorithms.

Christopher J.C. Burges

5. Dimension Reduction and Feature Selection

Data Mining algorithms search for meaningful patterns in raw data sets. The Data Mining process requires high computational cost when dealing with large data sets. Reducing dimensionality (the number of attributed or the number of records) can effectively cut this cost. This chapter focuses a pre-processing step which removes dimension from a given data set before it is fed to a data mining algorithm. This work explains how it is often possible to reduce dimensionality with minimum loss of information. Clear dimension reduction taxonomy is described and techniques for dimension reduction are presented theoretically.

Barak Chizi, Oded Maimon

6. Discretization Methods

Data-mining applications often involve quantitative data. However, learning from quantitative data is often less effective and less efficient than learning from qualitative data. Discretization addresses this issue by transforming quantitative data into qualitative data. This chapter presents a comprehensive introduction to discretization. It clarifies the definition of discretization. It provides a taxonomy of discretization methods together with a survey of major discretization methods. It also discusses issues that affect the design and application of discretization methods.

Ying Yang, Geoffrey I. Webb, Xindong Wu

7. Outlier Detection

Outlier detection is a primary step in many data-mining applications. We present several methods for outlier detection, while distinguishing between univariate vs. multivariate techniques and parametric vs. nonparametric procedures. In presence of outliers, special attention should be taken to assure the robustness of the used estimators. Outlier detection for Data Mining is often based on distance measures, clustering and spatial methods.

Irad Ben-Gal

Supervised Methods

Frontmatter

8. Supervised Learning

This chapter summarizes the fundamental aspects of supervised methods. The chapter provides an overview of concepts from various interrelated fields used in subsequent chapters. It presents basic definitions and arguments from the supervised machine learning literature and considers various issues, such as performance evaluation techniques and challenges for data mining tasks.

Lior Rokach, Oded Maimon

9. Classification Trees

Decision Trees are considered to be one of the most popular approaches for representing classifiers. Researchers from various disciplines such as statistics, machine learning, pattern recognition, and Data Mining have dealt with the issue of growing a decision tree from available data. This paper presents an updated survey of current methods for constructing decision tree classifiers in a top-down manner. The chapter suggests a unified algorithmic framework for presenting these algorithms and describes various splitting criteria and pruning methodologies.

Lior Rokach, Oded Maimon

10. Bayesian Networks

Bayesian networks are today one of the most promising approaches to Data Mining and knowledge discovery in databases. This chapter reviews the fundamental aspects of Bayesian networks and some of their technical aspects, with a particular emphasis on the methods to induce Bayesian networks from different types of data. Basic notions are illustrated through the detailed descriptions of two Bayesian network applications: one to survey data and one to marketing data.

Paola Sebastiani, Maria M. Abad, Marco F. Ramoni

11. Data Mining within a Regression Framework

Regression analysis can imply a far wider range of statistical procedures than often appreciated. In this chapter, a number of common Data Mining procedures are discussed within a regression framework. These include non-parametric smoothers, classification and regression trees, bagging, and random forests. In each case, the goal is to characterize one or more of the distributional features of a response conditional on a set of predictors.

Richard A. Berk

12. Support Vector Machines

Support Vector Machines (SVMs) are a set of related methods for supervised learning, applicable to both classification and regression problems. A SVM classifiers creates a maximum-margin hyperplane that lies in a transformed input space and splits the example classes, while maximizing the distance to the nearest cleanly split examples. The parameters of the solution hyperplane are derived from a quadratic programming optimization problem. Here, we provide several formulations, and discuss some key concepts.

Armin Shmilovici

13. Rule Induction

This chapter begins with a brief discussion of some problems associated with input data. Then different rule types are defined. Three representative rule induction methods: LEM1, LEM2, and AQ are presented. An idea of a classification system, where rule sets are utilized to classify new cases, is introduced. Methods to evaluate an error rate associated with classification of unseen cases using the rule set are described. Finally, some more advanced methods are listed.

Jerzy W. Grzymala-Busse

Unsupervised Methods

Frontmatter

14. A survey of Clustering Algorithms

This chapter presents a tutorial overview of the main clustering methods used in Data Mining. The goal is to provide a self-contained review of the concepts and the mathematics underlying clustering techniques. The chapter begins by providing measures and criteria that are used for determining whether two objects are similar or dissimilar. Then the clustering methods are presented, divided into: hierarchical, partitioning, density-based, model-based, grid-based, and soft-computing methods. Following the methods, the challenges of performing clustering in large data sets are discussed. Finally, the chapter presents how to determine the number of clusters.

Lior Rokach

15. Association Rules

Association rules are rules of the kind “70% of the customers who buy vine and cheese also buy grapes”. While the traditional field of application is market basket analysis, association rule mining has been applied to various fields since then, which has led to a number of important modifications and extensions. We discuss the most frequently applied approach that is central to many extensions, the Apriori algorithm, and briefly review some applications to other data types, well-known problems of rule evaluation via support and confidence, and extensions of or alternatives to the standard framework.

Frank Höppner

16. Frequent Set Mining

Frequent sets lie at the basis of many Data Mining algorithms. As a result, hundreds of algorithms have been proposed in order to solve the frequent set mining problem. In this chapter, we attempt to survey the most successful algorithms and techniques that try to solve this problem efficiently.

Bart Goethals

17. Constraint-based Data Mining

Knowledge Discovery in Databases (KDD) is a complex interactive process. The promising theoretical framework of inductive databases considers this is essentially a querying process. It is enabled by a query language which can deal either with raw data or patterns which hold in the data. Mining patterns turns to be the so-called inductive query evaluation process for which constraint-based Data Mining techniques have to be designed. An inductive query specifies declaratively the desired constraints and algorithms are used to compute the patterns satisfying the constraints in the data. We survey important results of this active research domain. This chapter emphasizes a real breakthrough for hard problems concerning local pattern mining under various constraints and it points out the current directions of research as well.

Jean-Francois Boulicaut, Baptiste Jeudy

18. Link Analysis

Link analysis is a collection of techniques that operate on data that can be represented as nodes and links. This chapter surveys a variety of techniques including subgraph matching, finding cliques and K-plexes, maximizing spread of influence, visualization, finding hubs and authorities, and combining with traditional techniques (classification, clustering, etc). It also surveys applications including social network analysis, viral marketing, Internet search, fraud detection, and crime prevention.

Steve Donoho

Soft Computing Methods

Frontmatter

19. A Review of Evolutionary Algorithms for Data Mining

Evolutionary Algorithms (EAs) are stochastic search algorithms inspired by the process of neo-Darwinian evolution. The motivation for applying EAs to data mining is that they are robust, adaptive search techniques that perform a global search in the solution space. This chapter first presents a brief overview of EAs, focusing mainly on two kinds of EAs, viz. Genetic Algorithms (GAs) and Genetic Programming (GP). Then the chapter reviews the main concepts and principles used by EAs designed for solving several data mining tasks, namely: discovery of classification rules, clustering, attribute selection and attribute construction. Finally, it discusses Multi-Objective EAs, based on the concept of Pareto dominance, and their use in several data mining tasks.

Alex A. Freitas

20. A Review of Reinforcement Learning Methods

Reinforcement-Learning is learning how to best-react to situations, through trial and error. In the Machine-Learning community Reinforcement-Learning is researched with respect to artificial (machine) decision-makers, referred to as agents. The agents are assumed to be situated within an environment which behaves as a Markov Decision Process. This chapter provides a brief introduction to Reinforcement-Learning, and establishes its relation to Data-Mining. Specifically, the Reinforcement-Learning problem is defined; a few key ideas for solving it are described; the relevance to Data-Mining is explained; and an instructive example is presented.

Oded Maimon, Shahar Cohen

21. Neural Networks For Data Mining

Neural networks have become standard and important tools for data mining. This chapter provides an overview of neural network models and their applications to data mining tasks. We provide historical development of the field of neural networks and present three important classes of neural models including feedforward multilayer networks, Hopfield networks, and Kohonen’s self-organizing maps. Modeling issues and applications of these models for data mining are discussed.

G. Peter Zhang

22. Granular Computing and Rough Sets - An Incremental Development

This chapter gives an overview and refinement of recent works on binary granular computing. For comparison and contrasting, granulation and partition are examined in parallel from the prospect of rough Set theory (RST).The key strength of RST is its capability in representing and processing knowledge in table formats. Even though such capabilities, for general granulation, are not available, this chapter illustrates and refines some such capability for binary granulation. In rough set theory, quotient sets, table representations, and concept hierarchy trees are all set theoretical, while in binary granulation, they are special kind of pretopological spaces, which is equivalent to a binary relation Here a pretopological space means a space that is equipped with a neighborhood system (NS). A NS is similar to the classical NS of a topological space, but without any axioms attached to it

Tsau Young (’T. Y.’) Lin, Churn-Jung Liau

23. Pattern Clustering Using a Swarm Intelligence Approach

Clustering aims at representing large datasets by a fewer number of prototypes or clusters. It brings simplicity in modeling data and thus plays a central role in the process of knowledge discovery and data mining. Data mining tasks, in these days, require fast and accurate partitioning of huge datasets, which may come with a variety of attributes or features. This, in turn, imposes severe computational requirements on the relevant clustering techniques. A family of bio-inspired algorithms, well-known as Swarm Intelligence (SI) has recently emerged that meets these requirements and has successfully been applied to a number of real world clustering problems. This chapter explores the role of SI in clustering different kinds of datasets. It finally describes a new SI technique for partitioning a linearly non-separable dataset into an optimal number of clusters in the kernel- induced feature space. Computer simulations undertaken in this research have also been provided to demonstrate the effectiveness of the proposed algorithm.

Swagatam Das, Ajith Abraham

24. Using Fuzzy Logic in Data Mining

In this chapter we discuss how fuzzy logic extends the envelop of the main data mining tasks: clustering, classification, regression and association rules. We begin by presenting a formulation of the data mining using fuzzy logic attributes. Then, for each task, we provide a survey of the main algorithms and a detailed description (i.e. pseudo-code) of the most popular algorithms. However this chapter will not profoundly discuss neuro-fuzzy techniques, assuming that there will be a dedicated chapter for this issue.

Lior Rokach

Supporting Methods

Frontmatter

25. Statistical Methods for Data Mining

The aim of this chapter is to present the main statistical issues in Data Mining (DM) and Knowledge Data Discovery (KDD) and to examine whether traditional statistics approach and methods substantially differ from the new trend of KDD and DM. We address and emphasize some central issues of statistics which are highly relevant to DM and have much to offer to DM

Yoav Benjamini, Moshe Leshno

26. Logics for Data Mining

Systems of formal (symbolic) logic suitable for Data Mining are presented, main stress being put to various kinds of generalized quantifiers.

Petr Hájek

27. Wavelet Methods in Data Mining

Recently there has been significant development in the use of wavelet methods in various Data Mining processes. This article presents general overview of their applications in Data Mining. It first presents a high-level data-mining framework in which the overall process is divided into smaller components. It reviews applications of wavelets for each component. It discusses the impact of wavelets on Data Mining research and outlines potential future research directions and applications.

Tao Li, Sheng Ma, Mitsunori Ogihara

28. Fractal Mining - Self Similarity-based Clustering and its Applications

Self-similarity is the property of being invariant with respect to the scale used to look at the data set. Self-similarity can be measured using the fractal dimension. Fractal dimension is an important charactaristics for many complex systems and can serve as a powerful representation technique. In this chapter, we present a new clustering algorithm, based on self-similarity properties of the data sets, and also its applications to other fields in Data Mining, such as projected clustering and trend analysis. Clustering is a widely used knowledge discovery technique. The new algorithm which we call Fractal Clustering (FC) places points incrementally in the cluster for which the change in the fractal dimension after adding the point is the least. This is a very natural way of clustering points, since points in the same clusterhave a great degree of self-similarity among them (and much less self-similarity with respect to points in other clusters). FC requires one scan of the data, is suspendable at will, providing the best answer possible at that point, and is incremental. We show via experiments that FC effectively deals with large data sets, high-dimensionality and noise and is capable of recognizing clusters of arbitrary shape.

Daniel Barbara, Ping Chen

29. Visual Analysis of Sequences Using Fractal Geometry

Sequence analysis is a challenging task in the data mining arena, relevant for many practical domains. We propose a novel method for visual analysis and classification of sequences based on Iterated Function System (IFS). IFS is utilized to produce a fractal representation of sequences. The proposed method offers an effective tool for visual detection of sequence patterns influencing a target attribute, and requires no understanding of mathematical or statistical algorithms. Moreover, it enables to detect sequence patterns of any length, without predefining the sequence pattern length. It also enables to visually distinguish between different sequence patterns in cases of reoccurrence of categories within a sequence. Our proposed method provides another significant added value by enabling the visual detection of rare and missing sequences per target class.

Noa Ruschin Rimini, Oded Maimon

30. Interestingness Measures - On Determining What Is Interesting

As the size of databases increases, the sheer number of mined from them can easily overwhelm users of the KDD process. Users run the KDD process because they are overloaded by data. To be successful, the KDD process needs to extract

interesting

patterns from large masses of data. In this chapter we examine methods of tackling this challenge: how to identify

interesting

patterns.

Sigal Sahar

31. Quality Assessment Approaches in Data Mining

The

Data Mining process

encompasses many different specific techniques and algorithms that can be used to analyze the data and derive the discovered knowledge. An important problem regarding the results of the Data Mining process is the development of efficient indicators of assessing the quality of the results of the analysis. This, the quality assessment problem, is a cornerstone issue of the whole process because: i)

The analyzed data may hide interesting patterns

that the Data Mining methods are called to reveal. Due to the size of the data, the requirement for automatically evaluating the validity of the extracted patterns is stronger than ever.

ii)

A number of algorithms and techniques have been proposed

which under different assumptions can lead to different results. iii)

The number of patterns generated during the Data Mining process

is very large but only a few of these patterns are likely to be of any interest to the domain expert who is analyzing the data. In this chapter we will introduce the main concepts and quality criteria in Data Mining. Also we will present an overview of approaches that have been proposed in the literature for evaluating the Data Mining results.

Maria Halkidi, Michalis Vazirgiannis

32. Data Mining Model Comparison

The aim of this contribution is to illustrate the role of statistical models and, more generally, of statistics, in choosing a Data Mining model. After a preliminary introduction on the distinction between Data Mining and statistics, we will focus on the issue of how to choose a Data Mining methodology. This well illustrates how statistical thinking can bring real added value to a Data Mining analysis, as otherwise it becomes rather difficult to make a reasoned choice. In the third part of the paper we will present, by means of a case study in credit risk management, how Data Mining and statistics can profitably interact.

Paolo Giudici

33. Data Mining Query Languages

Many Data Mining algorithms enable to extract different types of patterns from data (e.g., local patterns like itemsets and association rules, models like classifiers). To support the whole knowledge discovery process, we need for integrated systems which can deal either with patterns and data. The inductive database approach has emerged as an unifying framework for such systems. Following this database perspective, knowledge discovery processes become querying processes for which query languages have to be designed. In the prolific field of association rule mining, different proposals of query languages have been made to support the more or less declarative specification of both data and pattern manipulations. In this chapter, we survey some of these proposals. It enables to identify nowadays shortcomings and to point out some promising directions of research in this area.

Jean-Francois Boulicaut, Cyrille Masson

Advanced Methods

Frontmatter

34. Mining Multi-label Data

A large body of research in supervised learning deals with the analysis of

single-label

data, where training examples are associated with a single label λ from a set of disjoint labels

. However, training examples in several application domains are often associated with a

set

of labels

Y ⊆ L

. Such data are called

multi-label

Textual data, such as documents and web pages, are frequently annotated with more than a single label. For example, a news article concerning the reactions of the Christian church to the release of the “Da Vinci Code” film can be labeled as both

religion

and

movies

. The categorization of textual data is perhaps the dominant multi-label application.

Grigorios Tsoumakas, Ioannis Katakis, Ioannis Vlahavas

35. Privacy in Data Mining

In this chapter we describe the main tools for privacy in data mining. We present an overview of the tools for protecting data, and then we focus on protection procedures. Information loss and disclosure risk measures are also described.

Vicenç Torra

36. Meta-Learning - Concepts and Techniques

The field of meta-learning has as one of its primary goals the understanding of the interaction between the mechanism of learning and the concrete contexts in which that mechanism is applicable. The field has seen a continuous growth in the past years with interesting new developments in the construction of practical model-selection assistants, task-adaptive learners, and a solid conceptual framework. In this chapter we give an overview of different techniques necessary to build meta-learning systems. We begin by describing an idealized meta-learning architecture comprising a variety of relevant component techniques. We then look at how each technique has been studied and implemented by previous research. In addition we show how meta-learning has already been identified as an important component in real-world applications.

Ricardo Vilalta, Christophe Giraud-Carrier, Pavel Brazdil

37. Bias vs Variance Decomposition for Regression and Classification

In this chapter, the important concepts of bias and variance are introduced. After an intuitive introduction to the bias/variance tradeoff, we discuss the bias/variance decompositions of the mean square error (in the context of regression problems) and of the mean misclassification error (in the context of classification problems). Then, we carry out a small empirical study providing some insight about how the parameters of a learning algorithm influence bias and variance.

Pierre Geurts

38. Mining with Rare Cases

Rare cases are often the most interesting cases. For example, in medical diagnosis one is typically interested in identifying relatively rare diseases, such as cancer, rather than more frequently occurring ones, such as the common cold. In this chapter we discuss the role of rare cases in Data Mining. Specific problems associated with mining rare cases are discussed, followed by a description of methods for addressing these problems.

Gary M. Weiss

39. Data Stream Mining

Data mining is concerned with the process of computationally extracting hidden knowledge structures represented in models and patterns from large data repositories. It is an interdisciplinary field of study that has its roots in databases, statistics, machine learning, and data visualization. Data mining has emerged as a direct outcome of the

data explosion

that resulted from the success in database and data warehousing technologies over the past two decades (Fayyad, 1997,Fayyad, 1998,Kantardzic, 2003).

Mohamed Medhat Gaber, Arkady Zaslavsky, Shonali Krishnaswamy

40. Mining Concept-Drifting Data Streams

Knowledge discovery from infinite data streams is an important and difficult task. We are facing two challenges, the overwhelming volume and the concept drifts of the streaming data. In this chapter, we introduce a general framework for mining concept-drifting data streams using weighted ensemble classifiers. We train an ensemble of classification models, such as C4.5, RIPPER, naive Bayesian, etc., from sequential chunks of the data stream. The classifiers in the ensemble are judiciously weighted based on their expected classification accuracy on the test data under the time-evolving environment. Thus, the ensemble approach improves both the efficiency in learning the model and the accuracy in performing classification. Our empirical study shows that the proposed methods have substantial advantage over single-classifier approaches in prediction accuracy, and the ensemble framework is effective for a variety of classification models.

Haixun Wang, Philip S. Yu, Jiawei Han

41. Mining High-Dimensional Data

With the rapid growth of computational biology and e-commerce applications, high-dimensional data becomes very common. Thus, mining high-dimensional data is an urgent problem of great practical importance. However, there are some unique challenges for mining data of high dimensions, including (1) the curse of dimensionality and more crucial (2) the meaningfulness of the similarity measure in the high dimension space. In this chapter, we present several state-of-art techniques for analyzing high-dimensional data, e.g., frequent pattern mining, clustering, and classification. We will discuss how these methods deal with the challenges of high dimensionality.

Wei Wang, Jiong Yang

42. Text Mining and Information Extraction

Text Mining is the automatic discovery of new, previously unknown information, by automatic analysis of various textual resources. Text mining starts by extracting facts and events from textual sources and then enables forming new hypotheses that are further explored by traditional Data Mining and data analysis methods. In this chapter we will define text mining and describe the three main approaches for performing information extraction. In addition, we will describe how we can visually display and analyze the outcome of the information extraction process.

Moty Ben-Dov, Ronen Feldman

43. Spatial Data Mining

Spatial Data Mining is the process of discovering interesting and previously unknown, but potentially useful patterns from large spatial datasets. Extracting interesting and useful patterns from spatial datasets is more difficult than extracting the corresponding patterns from traditional numeric and categorical data due to the complexity of spatial data types, spatial relationships, and spatial autocorrelation. This chapter provides an overview on the unique features that distinguish spatial data mining from classical Data Mining, and presents major accomplishments of spatial Data Mining research.

Shashi Shekhar, Pusheng Zhang, Yan Huang

44. Spatio-temporal clustering

Spatio-temporal clustering is a process of grouping objects based on their spatial and temporal similarity. It is relatively new subfield of data mining which gained high popularity especially in geographic information sciences due to the pervasiveness of all kinds of location-based or environmental devices that record position, time or/and environmental properties of an object or set of objects in real-time. As a consequence, different types and large amounts of spatio-temporal data became available that introduce new challenges to data analysis and require novel approaches to knowledge discovery. In this chapter we concentrate on the spatio-temporal clustering in geographic space. First, we provide a classification of different types of spatio-temporal data. Then, we focus on one type of spatio-temporal clustering - trajectory clustering, provide an overview of the state-of-the-art approaches and methods of spatio-temporal clustering and finally present several scenarios in different application domains such as movement, cellular networks and environmental studies.

Slava Kisilevich, Florian Mansmann, Mirco Nanni, Salvatore Rinzivillo

45. Data Mining for Imbalanced Datasets: An Overview

A dataset is imbalanced if the classification categories are not approximately equally represented. Recent years brought increased interest in applying machine learning techniques to difficult “real-world” problems, many of which are characterized by imbalanced data. Additionally the distribution of the testing data may differ from that of the training data, and the true misclassification costs may be unknown at learning time. Predictive accuracy, a popular choice for evaluating performance of a classifier, might not be appropriate when the data is imbalanced and/or the costs of different errors vary markedly. In this Chapter, we discuss some of the sampling techniques used for balancing the datasets, and the performance measures more appropriate for mining imbalanced datasets.

Nitesh V. Chawla

46. Relational Data Mining

Data Mining algorithms look for patterns in data. While most existing Data Mining approaches look for patterns in a single data table, relational Data Mining (RDM) approaches look for patterns that involve multiple tables (relations) from a relational database. In recent years, the most common types of patterns and approaches considered in Data Mining have been extended to the relational case and RDM now encompasses relational association rule discovery and relational decision tree induction, among others. RDM approaches have been successfully applied to a number of problems in a variety of areas, most notably in the area of bioinformatics. This chapter provides a brief introduction to RDM.

Sašo Džeroski

47. Web Mining

The World-Wide Web provides every internet citizen with access to an abundance of information, but it becomes increasingly difficult to identify the relevant pieces of information. Research in web mining tries to address this problem by applying techniques from data mining and machine learning to Web data and documents. This chapter provides a brief overview of web mining techniques and research areas, most notably hypertext classification, wrapper induction, recommender systems and web usage mining.

Johannes Fürnkranz

48. A Review of Web Document Clustering Approaches

Nowadays, the Internet has become the largest data repository, facing the problem of information overload. Though, the web search environment is not ideal. The existence of an abundance of information, in combination with the dynamic and heterogeneous nature of the Web, makes information retrieval a difficult process for the average user. It is a valid requirement then the development of techniques that can help the users effectively organize and browse the available information, with the ultimate goal of satisfying their information need. Cluster analysis, which deals with the organization of a collection of objects into cohesive groups, can play a very important role towards the achievement of this objective. In this chapter, we present an exhaustive survey of web document clustering approaches available on the literature, classified into three main categories: text-based, link-based and hybrid. Furthermore, we present a thorough comparison of the algorithms based on the various facets of their features and functionality. Finally, based on the review of the different approaches we conclude that although clustering has been a topic for the scientific community for three decades, there are still many open issues that call for more research.

Nora Oikonomakou, Michalis Vazirgiannis

49. Causal Discovery

Many algorithms have been proposed for learning a causal network from data. It has been shown, however, that learning all the conditional independencies in a probability distribution is a NP-hard problem. In this chapter, we present an alternative method for learning a causal network from data. Our approach is novel in that it learns functional dependencies in the sample distribution rather than probabilistic independencies. Our method is based on the fact that functional dependency logically implies probabilistic conditional independency. The effectiveness of the proposed approach is explicitly demonstrated using fifteen real-world datasets.

Hong Yao, Cory J. Butz, Howard J. Hamilton

50. Ensemble Methods in Supervised Learning

The idea of ensemble methodology is to build a predictive model by integrating multiple models. It is well-known that ensemble methods can be used for improving prediction performance. In this chapter we provide an overview of ensemble methods in classification tasks. We present all important types of ensemble methods including boosting and bagging. Combining methods and modeling issues such as ensemble diversity and ensemble size are discussed.

Lior Rokach

51. Data Mining using Decomposition Methods

The idea of decomposition methodology is to break down a complex Data Mining task into several smaller, less complex and more manageable, sub-tasks that are solvable by using existing tools, then joining their solutions together in order to solve the original problem. In this chapter we provide an overview of decomposition methods in classification tasks with emphasis on elementary decomposition methods. We present the main properties that characterize various decomposition frameworks and the advantages of using these framework. Finally we discuss the uniqueness of decomposition methodology as opposed to other closely related fields, such as ensemble methods and distributed data mining.

Lior Rokach, Oded Maimon

52. Information Fusion - Methods and Aggregation Operators

Information fusion techniques are commonly applied in Data Mining and Knowledge Discovery. In this chapter, we will give an overview of such applications considering their three main uses. This is, we consider fusion methods for data preprocessing, model building and information extraction. Some aggregation operators (i.e. particular fusion methods) and their properties are briefly described as well.

Vicenç Torra

53. Parallel and Grid-Based Data Mining – Algorithms, Models and Systems for High-Performance KDD

Data Mining often is a computing intensive and time requiring process. For this reason, several Data Mining systems have been implemented on parallel computing platforms to achieve high performance in the analysis of large data sets. Moreover, when large data repositories are coupled with geographical distribution of data, users and systems, more sophisticated technologies are needed to implement high-performance distributed KDD systems. Since computational Grids emerged as privileged platforms for distributed computing, a growing number of Grid-based KDD systems has been proposed. In this chapter we first discuss different ways to exploit parallelism in the main Data Mining techniques and algorithms, then we discuss Grid-based KDD systems. Finally, we introduce the Knowledge Grid, an environment which makes use of standard Grid middleware to support the development of parallel and distributed knowledge discovery applications.

Antonio Congiusta, Domenico Talia, Paolo Trunfio

54. Collaborative Data Mining

Collaborative Data Mining is a setting where the Data Mining effort is distributed to multiple collaborating agents – human or software. The objective of the collaborative Data Mining effort is to produce solutions to the tackled Data Mining problem which are considered better by some metric, with respect to those solutions that would have been achieved by individual, non-collaborating agents. The solutions require evaluation, comparison, and approaches for combination. Collaboration requires communication, and implies some form of community. The human form of collaboration is a social task. Organizing communities in an effective manner is non-trivial and often requires well defined roles and processes. Data Mining, too, benefits from a standard process. This chapter explores the standard Data Mining process CRISP-DM utilized in a collaborative setting.

Steve Moyle

55. Organizational Data Mining

Many organizations today possess substantial quantities of business information but have very little real business knowledge. A recent survey of 450 business executives reported that managerial intuition and instinct are more prevalent than hard facts in driving organizational decisions. To reverse this trend, businesses of all sizes would be well advised to adopt Organizational Data Mining (ODM). ODM is defined as leveraging Data Mining tools and technologies to enhance the decision-making process by transforming data into valuable and actionable knowledge to gain a competitive advantage. ODM has helped many organizations optimize internal resource allocations while better understanding and responding to the needs of their customers. The fundamental aspects of ODM can be categorized into Artificial Intelligence (AI), Information Technology (IT), and Organizational Theory (OT), with OT being the key distinction between ODM and Data Mining. In this chapter, we introduce ODM, explain its unique characteristics, and report on the current status of ODM research. Next we illustrate how several leading organizations have adopted ODM and are benefiting from it. Then we examine the evolution of ODM to the present day and conclude our chapter by contemplating ODM’s challenging yet opportunistic future.

Hamid R. Nemati, Christopher D. Barko

56. Mining Time Series Data

Much of the world’s supply of data is in the form of time series. In the last decade, there has been an explosion of interest in mining time series data. A number of new algorithms have been introduced to classify, cluster, segment, index, discover rules, and detect anomalies/novelties in time series. While these many different techniques used to solve these problems use a multitude of different techniques, they all have one common factor; they require some high level representation of the data, rather than the original raw data. These high level representations are necessary as a feature extraction step, or simply to make the storage, transmission, and computation of massive dataset feasible. A multitude of representations have been proposed in the literature, including spectral transforms, wavelets transforms, piecewise polynomials, eigenfunctions, and symbolic mappings. This chapter gives a high-level survey of time series Data Mining tasks, with an emphasis on time series representations.

Chotirat Ann Ratanamahatana, Jessica Lin, Dimitrios Gunopulos, Eamonn Keogh, Michail Vlachos, Gautam Das

Applications

Frontmatter

57. Multimedia Data Mining

*Each chapter should be preceded by an abstract (10–15 lines long) that summarizes the content. The abstract will appear

online

at www.SpringerLink.com and be available with unrestricted access. This allows unregistered users to read the abstract as a teaser for the complete chapter. As a general rule the abstracts will not appear in the printed version of your book unless it is the style of your particular book or that of the series to which your book belongs. Please use the ’starred’ version of the new Springer

abstract

command for typesetting the text of the online abstracts (cf. source file of this chapter template

abstract

) and include them with the source files of your manuscript. Use the plain abstract command if the

abstract

is also to appear in the printed version of the book.

Zhongfei (Mark) Zhang, Ruofei Zhang

58. Data Mining in Medicine

Extensive amounts of data stored in medical databases require the development of specialized tools for accessing the data, data analysis, knowledge discovery, and effective use of stored knowledge and data. This chapter focuses on Data Mining methods and tools for knowledge discovery. The chapter sketches the selected Data Mining techniques, and illustrates their applicability to medical diagnostic and prognostic problems.

Nada Lavrač, Blaž Zupan

59. Learning Information Patterns in Biological Databases - Stochastic Data Mining

This chapter aims at developing the computational theory for modeling patterns and their hierarchical coordination within biological sequences. With the exception of the promoters and enhancers, the functional significance of the non-coding DNA is not well understood. Scientists are now discovering that specific regions of non-coding DNA interact with the cellular machinery and help bring about the expression of genes. Our premise is that it is possible to study the arrangements of patterns in biological sequences through machine learning algorithms. As the biological database continue their exponential growth, it becomes feasible to apply

in-silico

Data Mining algorithms to discover interesting patterns of motif arrangements and the frequency of their re-iteration. A systematic procedure for achieving this goal is presented.

Gautam B. Singh

60. Data Mining for Financial Applications

This chapter describes Data Mining in finance by discussing financial tasks, specifics of methodologies and techniques in this Data Mining area. It includes time dependence, data selection, forecast horizon, measures of success, quality of patterns, hypothesis evaluation, problem ID, method profile, attribute-based and relational methodologies. The second part of the chapter discusses Data Mining models and practice in finance. It covers use of neural networks in portfolio management, design of interpretable trading rules and discovering money laundering schemes using decision rules and relational Data Mining methodology.

Boris Kovalerchuk, Evgenii Vityaev

61. Data Mining for Intrusion Detection

Data Mining Techniques have been successfully applied in many different fields including marketing, manufacturing, fraud detection and network management. Over the past years there is a lot of interest in security technologies such as intrusion detection, cryptography, authentication and firewalls. This chapter discusses the application of Data Mining techniques to computer security. Conclusions are drawn and directions for future research are suggested.

Anoop Singhal, Sushil Jajodia

62. Data Mining for CRM

Data Mining technology allows marketing organizations to better understand their customers and respond to their needs. This chapter describes how Data Mining can be combined with customer relationship management to help drive improved interactions with customers. An example showing how to use Data Mining to drive customer acquisition activities is presented.

Kurt Thearling

63. Data Mining for Target Marketing

Targeting is the core of marketing management. It is concerned with offering the right product/service to the customer at the right time and using the proper channel. In this chapter we discuss how Data Mining modeling and analysis can support targeting applications. We focus on three types of targeting models: continuous-choice models, discrete-choice models and in-market timing models, discussing alternative modeling for each application and decision making. We also discuss a range of pitfalls that one needs to be aware of in implementing a data mining solution for a targeting problem.

Nissan Levin, Jacob Zahavi

64. NHECD - Nano Health and Environmental Commented Database

The impact of nanoparticles on health and the environment is a significant research subject, driving increasing interest from the scientific community, regulatory bodies and the general public.We present a smart repository system with text and data mining for this domain. The growing body of knowledge in this area, consisting of scientific papers and other types of publications (such as surveys and whitepapers) emphasize the need for a methodology to alleviate the complexity of reviewing all the available information and discovering all the underlying facts, using data mining algorithms and methods.

The European Commission-funded project NHECD (whose full name is “Creation of a critical and commented database on the health, safety and environmental impact of nanoparticles”) converts the unstructured body of knowledge produced by the different groups of users (such as researchers and regulators) into a repository of scientific papers and reviews

augmented

by layers of information extracted from the papers. Towards this end we use taxonomies built by domain experts and metadata, using advanced methodologies.We implement algorithms for textual information extraction, graph mining and table information extraction. Rating and relevance assessment of the papers are also part of the system. The project is composed of two major layers, a backend consisting of all the above taxonomies, algorithms and methods, and a frontend consisting of a query and navigation system. The frontend has web interface which address the needs (and knowledge) of the different user groups. Documentum, a content management system (CMS), is the backbone of the backend process component. The frontend is a customized application built using an open source CMS. It is designed to take advantage of the taxonomies and metadata for search and navigation, while allowing the user to query the system, taking advantage of the extracted information.

Oded Maimon, Abel Browarnik

Software

Frontmatter

65. Commercial Data Mining Software

This chapter discusses selected commercial software for data mining, supercomputing data mining, text mining, and web mining. The selected software are compared with their features and also applied to available data sets. The software for data mining are SAS Enterprise Miner, Megaputer PolyAnalyst 5.0, PASW (formerly SPSS Clementine), IBM Intelligent Miner, and BioDiscovery GeneSight. The software for supercomputing are Avizo by Visualization Science Group and JMP Genomics from SAS Institute. The software for text mining are SAS Text Miner and Megaputer PolyAnalyst 5.0. The software for web mining are Megaputer PolyAnalyst and SPSS Clementine . Background on related literature and software are presented. Screen shots of each of the selected software are presented, as are conclusions and future directions.

Qingyu Zhang, Richard S. Segall

66. Weka-A Machine Learning Workbench for Data Mining

The Weka workbench is an organized collection of state-of-the-art machine learning algorithms and data preprocessing tools. The basic way of interacting with these methods is by invoking them from the command line. However, convenient interactive graphical user interfaces are provided for data exploration, for setting up large-scale experiments on distributed computing platforms, and for designing configurations for streamed data processing. These interfaces constitute an advanced environment for experimental data mining. The system is written in Java and distributed under the terms of the GNU General Public License.

Eibe Frank, Mark Hall, Geoffrey Holmes, Richard Kirkby, Bernhard Pfahringer, Ian H. Witten, Len Trigg

Backmatter

Titel: Data Mining and Knowledge Discovery Handbook
herausgegeben von: Oded Maimon
Lior Rokach
Verlag: Springer US
Electronic ISBN: 978-0-387-09823-4
Print ISBN: 978-0-387-09822-7
DOI: https://doi.org/10.1007/978-0-387-09823-4