nach oben

2005 | Buch

Data Mining and Knowledge Discovery Handbook

herausgegeben von: Oded Maimon, Lior Rokach

Verlag: Springer US

Enthalten in: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

Einloggen, um Zugang zu erhalten

Über dieses Buch

Data Mining and Knowledge Discovery Handbook organizes all major concepts, theories, methodologies, trends, challenges and applications of data mining (DM) and knowledge discovery in databases (KDD) into a coherent and unified repository.

This book first surveys, then provides comprehensive yet concise algorithmic descriptions of methods, including classic methods plus the extensions and novel methods developed recently. This volume concludes with in-depth descriptions of data mining applications in various interdisciplinary industries including finance, marketing, medicine, biology, engineering, telecommunications, software, and security.

Data Mining and Knowledge Discovery Handbook is designed for research scientists and graduate-level students in computer science and engineering. This book is also suitable for professionals in fields such as computing applications, information systems management, and strategic research management.

Inhaltsverzeichnis

Frontmatter

Introduction to Knowledge Discovery in Databases

Chapter 1. Introduction to Knowledge Discovery in Databases

Oded Maimon, Lior Rokach

Preprocessing Methods

Chapter 2. Data Cleansing

This chapter analyzes the problem of data cleansing and the identification of potential errors in data sets. The differing views of data cleansing are surveyed and reviewed and a brief overview of existing data cleansing tools is given. A general framework of the data cleansing process is presented as well as a set of general methods that can be used to address the problem. The applicable methods include statistical outlier detection, pattern matching, clustering, and Data Mining techniques. The experimental results of applying these methods to a real world data set are also given. Finally, research directions necessary to further address the data cleansing problem are discussed.

Jonathan I. Maletic, Andrian Marcus

Chapter 3. Handling Missing Attribute Values

In this chapter methods of handling missing attribute values in Data Mining are described. These methods are categorized into sequential and parallel. In sequential methods, missing attribute values are replaced by known values first, as a preprocessing, then the knowledge is acquired for a data set with all known attribute values. In parallel methods, there is no preprocessing, i.e., knowledge is acquired directly from the original data sets. In this chapter the main emphasis is put on rule induction. Methods of handling attribute values for decision tree generation are only briefly summarized.

Jerzy W. Grzymala-Busse, Witold J. Grzymala-Busse

Chapter 4. Geometric Methods for Feature Extraction and Dimensional Reduction

We give a tutorial overview of several geometric methods for feature extraction and dimensional reduction. We divide the methods into projective methods and methods that model the manifold on which the data lies. For projective methods, we review projection pursuit, principal component analysis (PCA), kernel PCA, probabilistic PCA, and oriented PCA; and for the manifold methods, we review multidimensional scaling (MDS), landmark MDS, Isomap, locally linear embedding, Laplacian eigenmaps and spectral clustering. The Nyström method, which links several of the algorithms, is also reviewed. The goal is to provide a self-contained review of the concepts and mathematics underlying these algorithms.

Christopher J. C. Burges

Chapter 5. Dimension Reduction and Feature Selection

Data Mining algorithms search for meaningful patterns in raw data sets. The Data Mining process requires high computational cost when dealing with large data sets. Reducing dimensionality (the number of attributed or the number of records) can effectively cut this cost. This chapter focuses a pre-processing step which removes dimension from a given data set before it is fed to a data mining algorithm. This work explains how it is often possible to reduce dimensionality with minimum loss of information. Clear dimension reduction taxonomy is described and techniques for dimension reduction are presented theoretically.

Barak Chizi, Oded Maimon

Chapter 6. Discretization Methods

Data-mining applications often involve quantitative data. However, learning from quantitative data is often less effective and less efficient than learning from qualitative data. Discretization addresses this issue by transforming quantitative data into qualitative data. This chapter presents a comprehensive introduction to discretization. It clarifies the definition of discretization. It provides a taxonomy of discretization methods together with a survey of major discretization methods. It also discusses issues that affect the design and application of discretization methods.

Ying Yang, Geoffrey I. Webb, Xindong Wu

Chapter 7. Outlier Detection

Outlier detection is a primary step in many data-mining applications. We present several methods for outlier detection, while distinguishing between univariate vs. multivariate techniques and parametric vs. nonparametric procedures. In presence of outliers, special attention should be taken to assure the robustness of the used estimators. Outlier detection for Data Mining is often based on distance measures, clustering and spatial methods.

Irad Ben-Gal

Supervised Methods

Chapter 8. Introduction to Supervised Methods

This chapter summarizes the fundamental aspects of supervised methods. The chapter provides an overview of concepts from various interrelated fields used in subsequent chapters. It presents basic definitions and arguments from the supervised machine learning literature and considers various issues, such as performance evaluation techniques and challenges for data mining tasks.

Oded Maimon, Lior Rokach

Chapter 9. Decision Trees

Decision Trees are considered to be one of the most popular approaches for representing classifiers. Researchers from various disciplines such as statistics, machine learning, pattern recognition, and Data Mining have dealt with the issue of growing a decision tree from available data. This paper presents an updated survey of current methods for constructing decision tree classifiers in a top-down manner. The chapter suggests a unified algorithmic framework for presenting these algorithms and describes various splitting criteria and pruning methodologies.

Lior Rokach, Oded Maimon

Chapter 10. Bayesian Networks

Bayesian networks are today one of the most promising approaches to Data Mining and knowledge discovery in databases. This chapter reviews the fundamental aspects of Bayesian networks and some of their technical aspects, with a particular emphasis on the methods to induce Bayesian networks from different types of data. Basic notions are illustrated through the detailed descriptions of two Bayesian network applications: one to survey data and one to marketing data.

Paola Sebastiani, Maria M. Abad, Marco F. Ramoni

Chapter 11. Data Mining within a Regression Framework

Regression analysis can imply a far wider range of statistical procedures than often appreciated. In this chapter, a number of common Data Mining procedures are discussed within a regression framework. These include non-parametric smoothers, classification and regression trees, bagging, and random forests. In each case, the goal is to characterize one or more of the distributional features of a response conditional on a set of predictors.

Richard A. Berk

Chapter 12. Support Vector Machines

Support Vector Machines (SVMs) are a set of related methods for supervised learning, applicable to both classification and regression problems. A SVM classifiers creates a maximum-margin hyperplane that lies in a transformed input space and splits the example classes, while maximizing the distance to the nearest cleanly split examples. The parameters of the solution hyperplane are derived from a quadratic programming optimization problem. Here, we provide several formulations, and discuss some key concepts.

Armin Shmilovici

Chapter 13. Rule Induction

This chapter begins with a brief discussion of some problems associated with input data. Then different rule types are defined. Three representative rule induction methods: LEM1, LEM2, and AQ are presented. An idea of a classification system, where rule sets are utilized to classify new cases, is introduced. Methods to evaluate an error rate associated with classification of unseen cases using the rule set are described. Finally, some more advanced methods are listed.

Jerzy W. Grzymala-Busse

Unsupervised Methods

Chapter 14. Visualization and Data Mining for High Dimensional Datasets

Visualization provides insight through images and can be considered as a collection of application specific mappings:

ProblemDomain → VisualRange

. For the visualization of multivariate problems a multidimensional system of parallel coordinates (abbr. ∥-coords) is constructed which induces a one-to-one mapping between subsets of N-space and subsets of 2-space. The result is a rigorous methodology for doing and seeing N-dimensional geometry. Starting with an the overview of the mathematical foundations, it is seen that the display of high-dimensional datasets and search for multivariate relations among the variables is transformed into a 2-D pattern recognition problem. This is the basis for the application to Visual Data Mining which is illustrated with a real dataset of VLSI (Very Large Scale Integration — “chip”) production. Then a recent geometric classifier is presented and applied to 3 real datasets. The results compared to those of 23 other classifiers have the least error. The algorithm, has quadratic computational complexity in the size and number of parameters, provides comprehensible and explicit rules, does dimensionality selection — where the minimal set of original variables required to state the rule is found, and orders these variables so as to optimize the clarity of separation between the designated set and its complement. Finally a simple visual economic model of a real country is constructed and analyzed in order to illustrate the special strength of ∥-coords in modeling multivariate relations by means of hypersurfaces.

Alfred Inselberg

Chapter 15. Clustering Methods

This chapter presents a tutorial overview of the main clustering methods used in Data Mining. The goal is to provide a self-contained review of the concepts and the mathematics underlying clustering techniques. The chapter begins by providing measures and criteria that are used for determining whether two objects are similar or dissimilar. Then the clustering methods are presented, divided into: hierarchical, partitioning, density-based, model-based, grid-based, and soft-computing methods. Following the methods, the challenges of performing clustering in large data sets are discussed. Finally, the chapter presents how to determine the number of clusters.

Lior Rokach, Oded Maimon

Chapter 16. Association Rules

Association rules are rules of the kind “70% of the customers who buy vine and cheese also buy grapes”. While the traditional field of application is market basket analysis, association rule mining has been applied to various fields since then, which has led to a number of important modifications and extensions. We discuss the most frequently applied approach that is central to many extensions, the Apriori algorithm, and briefly review some applications to other data types, well-known problems of rule evaluation via support and confidence, and extensions of or alternatives to the standard framework.

Frank Höppner

Chapter 17. Frequent Set Mining

Frequent sets lie at the basis of many Data Mining algorithms. As a result, hundreds of algorithms have been proposed in order to solve the frequent set mining problem. In this chapter, we attempt to survey the most successful algorithms and techniques that try to solve this problem efficiently.

Bart Goethals

Chapter 18. Constraint-Based Data Mining

Knowledge Discovery in Databases (KDD) is a complex interactive process. The promising theoretical framework of inductive databases considers this is essentially a querying process. It is enabled by a query language which can deal either with raw data or patterns which hold in the data. Mining patterns turns to be the so-called inductive query evaluation process for which constraint-based Data Mining techniques have to be designed. An inductive query specifies declara-tively the desired constraints and algorithms are used to compute the patterns satisfying the constraints in the data. We survey important results of this active research domain. This chapter emphasizes a real breakthrough for hard problems concerning local pattern mining under various constraints and it points out the current directions of research as well.

Jean-Francois Boulicaut, Baptiste Jeudy

Chapter 19. Link Analysis

Link analysis is a collection of techniques that operate on data that can be represented as nodes and links. This chapter surveys a variety of techniques including subgraph matching, finding cliques and K-plexes, maximizing spread of influence, visualization, finding hubs and authorities, and combining with traditional techniques (classification, clustering, etc). It also surveys applications including social network analysis, viral marketing, Internet search, fraud detection, and crime prevention.

Steve Donoho

Soft Computing Methods

Chapter 20. Evolutionary Algorithms for Data Mining

Evolutionary Algorithms (EAs) are stochastic search algorithms inspired by the process of Darwinian evolution. The motivation for applying EAs to Data Mining is that they are robust, adaptive search techniques that perform a global search in the solution space. This chapter reviews mainly two kinds of EAs, viz. Genetic Algorithms (GAs) and Genetic Programming (GP), and discusses how EAs can be applied to several Data Mining tasks, namely: discovery of classification rules, clustering, attribute selection and attribute construction. It also discusses the basic idea of Multi-Objective EAs, based on the concept of Pareto dominance, which also has applications in Data Mining.

Alex A. Freitas

Chapter 21. Reinforcement-Learning: An Overview from a Data Mining Perspective

Reinforcement-Learning is learning how to best-react to situations, through trial and error. In the Machine-Learning community Reinforcement-Learning is researched with respect to artificial (machine) decision-makers, referred to as agents. The agents are assumed to be situated within an environment which behaves as a Markov Decision Process. This chapter provides a brief introduction to Reinforcement-Learning, and establishes its relation to Data-Mining. Specifically, the Reinforcement-Learning problem is defined; a few key ideas for solving it are described; the relevance to Data-Mining is explained; and an instructive example is presented.

Shahar Cohen, Oded Maimon

Chapter 22. Neural Networks

Neural networks have become standard and important tools for data mining. This chapter provides an overview of neural network models and their applications to Data Mining tasks. We provide historical development of the field of neural networks and present three important classes of neural models including feedforward multilayer networks, Hopfield networks, and Kohonen’s self-organizing maps. Modeling issues and applications of these models for Data Mining are discussed.

Peter G. Zhang

Chapter 23. On the Use of Fuzzy Logic in Data Mining

In this chapter we describe some basic concepts from fuzzy logic and how their applicability to Data Mining. First we discuss some basic terms from fuzzy set theory and fuzzy logic. Then, we provide examples that show how fuzzy sets and fuzzy logic can be applied best to discover knowledge from a given database.

Joseph Komem, Moti Schneider

Chapter 24. Granular Computing and Rough Sets

This chapter gives an overview and refinement of recent works on binary granular computing. For comparison and contrasting, granulation and partition are examined in parallel from the prospect of rough Set theory (RST).The key strength of RST is its capability in representing and processing knowledge in table formats. Even though such capabilities, for general granulation, are not available, this chapter illustrates and refines some such capability for binary granulation. In rough set theory, quotient sets, table representations, and concept hierarchy trees are all set theoretical, while in binary granulation, they are special kind of pretopological spaces, which is equivalent to a binary relation Here a pretopological space means a space that is equipped with a neighborhood system (NS). A NS is similar to the classical NS of a topological space, but without any axioms attached to it

Tsau Young (’T. Y.’) Lin, Churn-Jung Liau

Supporting Methods

Chapter 25. Statistical Methods for Data Mining

The aim of this chapter is to present the main statistical issues in Data Mining (DM) and Knowledge Data Discovery (KDD) and to examine whether traditional statistics approach and methods substantially differ from the new trend of KDD and DM. We address and emphasize some central issues of statistics which are highly relevant to DM and have much to offer to DM.

Yoav Benjamini, Moshe Leshno

Chapter 26. Logics for Data Mining

Systems of formal (symbolic) logic suitable for Data Mining are presented, main stress being put to various kinds of generalized quantifiers.

Petr Hájek

Chapter 27. Wavelet Methods in Data Mining

Recently there has been significant development in the use of wavelet methods in various Data Mining processes. This article presents general overview of their applications in Data Mining. It first presents a high-level data-mining framework in which the overall process is divided into smaller components. It reviews applications of wavelets for each component. It discusses the impact of wavelets on Data Mining research and outlines potential future research directions and applications.

Tao Li, Sheng Ma, Mitsunori Ogihara

Chapter 28. Fractal Mining

Self-similarity is the property of being invariant with respect to the scale used to look at the data set. Self-similarity can be measured using the fractal dimension. Fractal dimension is an important charactaristics for many complex systems and can serve as a powerful representation technique. In this chapter, we present a new clustering algorithm, based on self-similarity properties of the data sets, and also its applications to other fields in Data Mining, such as projected clustering and trend analysis. Clustering is a widely used knowledge discovery technique. The new algorithm which we call Fractal Clustering (FC) places points incrementally in the cluster for which the change in the fractal dimension after adding the point is the least. This is a very natural way of clustering points, since points in the same clusterhave a great degree of self-similarity among them (and much less self-similarity with respect to points in other clusters). FC requires one scan of the data, is suspendable at will, providing the best answer possible at that point, and is incremental. We show via experiments that FC effectively deals with large data sets, high-dimensionality and noise and is capable of recognizing clusters of arbitrary shape.

Daniel Barbara, Ping Chen

Chapter 29. Interesting Measures

As the size of databases increases, the sheer number of mined from them can easily overwhelm users of the KDD process. Users run the KDD process because they are overloaded by data. To be successful, the KDD process needs to extract

interesting

patterns from large masses of data. In this chapter we examine methods of tackling this challenge: how to identify

interesting

patterns.

Sigal Sahar

Chapter 30. Quality Assessment Approaches in Data Mining

The

Data Mining process

encompasses many different specific techniques and algorithms that can be used to analyze the data and derive the discovered knowledge. An important problem regarding the results of the Data Mining process is the development of efficient indicators of assessing the quality of the results of the analysis. This, the quality assessment problem, is a cornerstone issue of the whole process because: i)

The analyzed data may hide interesting patterns

that the Data Mining methods are called to reveal. Due to the size of the data, the requirement for automatically evaluating the validity of the extracted patterns is stronger than ever.

ii)

A number of algorithms and techniques have been proposed

which under different assumptions can lead to different results. iii)

The number of patterns generated during the Data Mining process

is very large but only a few of these patterns are likely to be of any interest to the domain expert who is analyzing the data. In this chapter we will introduce the main concepts and quality criteria in Data Mining. Also we will present an overview of approaches that have been proposed in the literature for evaluating the Data Mining results.

Maria Halkidi, Michalis Vazirgiannis

Chapter 31. Data Mining Model Comparison

The aim of this contribution is to illustrate the role of statistical models and, more generally, of statistics, in choosing a Data Mining model. After a preliminary introduction on the distinction between Data Mining and statistics, we will focus on the issue of how to choose a Data Mining methodology. This well illustrates how statistical thinking can bring real added value to a Data Mining analysis, as otherwise it becomes rather difficult to make a reasoned choice. In the third part of the paper we will present, by means of a case study in credit risk management, how Data Mining and statistics can profitably interact.

Paolo Giudici

Chapter 32. Data Mining Query Languages

Many Data Mining algorithms enable to extract different types of patterns from data (e.g., local patterns like itemsets and association rules, models like classifiers). To support the whole knowledge discovery process, we need for integrated systems which can deal either with patterns and data. The inductive database approach has emerged as an unifying framework for such systems. Following this database perspective, knowledge discovery processes become querying processes for which query languages have to be designed. In the prolific field of association rule mining, different proposals of query languages have been made to support the more or less declarative specification of both data and pattern manipulations. In this chapter, we survey some of these proposals. It enables to identify nowadays shortcomings and to point out some promising directions of research in this area.

Jean-Francois Boulicaut, Cyrille Masson

Advanced Methods

Chapter 33. Meta-Learning

The field of meta-learning has as one of its primary goals the understanding of the interaction between the mechanism of learning and the concrete contexts in which that mechanism is applicable. The field has seen a continuous growth in the past years with interesting new developments in the construction of practical model-selection assistants, task-adaptive learners, and a solid conceptual framework. In this chapter we give an overview of different techniques necessary to build meta-learning systems. We begin by describing an idealized meta-learning architecture comprising a variety of relevant component techniques. We then look at how each technique has been studied and implemented by previous research. In addition we show how meta-learning has already been identified as an important component in real-world applications.

Ricardo Vilalta, Christophe Giraud-Carrier, Pavel Brazdil

Chapter 34. Bias vs Variance Decomposition for Regression and Classification

In this chapter, the important concepts of bias and variance are introduced. After an intuitive introduction to the bias/variance tradeoff, we discuss the bias/variance decompositions of the mean square error (in the context of regression problems) and of the mean misclassification error (in the context of classification problems). Then, we carry out a small empirical study providing some insight about how the parameters of a learning algorithm influence bias and variance.

Pierre Geurts

Chapter 35. Mining with Rare Cases

Rare cases are often the most interesting cases. For example, in medical diagnosis one is typically interested in identifying relatively rare diseases, such as cancer, rather than more frequently occurring ones, such as the common cold. In this chapter we discuss the role of rare cases in Data Mining. Specific problems associated with mining rare cases are discussed, followed by a description of methods for addressing these problems.

Gary M. Weiss

Chapter 36. Mining Data Streams

Knowledge discovery from infinite data streams is an important and difficult task. We are facing two challenges, the overwhelming volume and the concept drifts of the streaming data. In this chapter, we introduce a general framework for mining concept-drifting data streams using weighted ensemble classifiers. We train an ensemble of classification models, such as C4.5, RIPPER, naive Bayesian, etc., from sequential chunks of the data stream. The classifiers in the ensemble are judiciously weighted based on their expected classification accuracy on the test data under the time-evolving environment. Thus, the ensemble approach improves both the efficiency in learning the model and the accuracy in performing classification. Our empirical study shows that the proposed methods have substantial advantage over single-classifier approaches in prediction accuracy, and the ensemble framework is effective for a variety of classification models.

Haixun Wang, Philip S. Yu, Jiawei Han

Chapter 37. Mining High-Dimensional Data

With the rapid growth of computational biology and e-commerce applications, high-dimensional data becomes very common. Thus, mining high-dimensional data is an urgent problem of great practical importance. However, there are some unique challenges for mining data of high dimensions, including (1) the curse of dimensionality and more crucial (2) the meaningfulness of the similarity measure in the high dimension space. In this chapter, we present several state-of-art techniques for analyzing high-dimensional data, e.g., frequent pattern mining, clustering, and classification. We will discuss how these methods deal with the challenges of high dimensionality.

Wei Wang, Jiong Yang

Chapter 38. Text Mining and Information Extraction

Text Mining is the automatic discovery of new, previously unknown information, by automatic analysis of various textual resources. Text mining starts by extracting facts and events from textual sources and then enables forming new hypotheses that are further explored by traditional Data Mining and data analysis methods. In this chapter we will define text mining and describe the three main approaches for performing information extraction. In addition, we will describe how we can visually display and analyze the outcome of the information extraction process.

Moty Ben-Dov, Ronen Feldman

Chapter 39. Spatial Data Mining

Spatial Data Mining is the process of discovering interesting and previously unknown, but potentially useful patterns from large spatial datasets. Extracting interesting and useful patterns from spatial datasets is more difficult than extracting the corresponding patterns from traditional numeric and categorical data due to the complexity of spatial data types, spatial relationships, and spatial autocorrelation. This chapter provides an overview on the unique features that distinguish spatial data mining from classical Data Mining, and presents major accomplishments of spatial Data Mining research.

Shashi Shekhar, Pusheng Zhang, Yan Huang

Chapter 40. Data Mining for Imbalanced Datasets: An Overview

A dataset is imbalanced if the classification categories are not approximately equally represented. Recent years brought increased interest in applying machine learning techniques to difficult “real-world” problems, many of which are characterized by imbalanced data. Additionally the distribution of the testing data may differ from that of the training data, and the true misclassification costs may be unknown at learning time. Predictive accuracy, a popular choice for evaluating performance of a classifier, might not be appropriate when the data is imbalanced and/or the costs of different errors vary markedly. In this Chapter, we discuss some of the sampling techniques used for balancing the datasets, and the performance measures more appropriate for mining imbalanced datasets.

Nitesh V. Chawla

Chapter 41. Relational Data Mining

Data Mining algorithms look for patterns in data. While most existing Data Mining approaches look for patterns in a single data table, relational Data Mining (RDM) approaches look for patterns that involve multiple tables (relations) from a relational database. In recent years, the most common types of patterns and approaches considered in Data Mining have been extended to the relational case and RDM now encompasses relational association rule discovery and relational decision tree induction, among others. RDM approaches have been successfully applied to a number of problems in a variety of areas, most notably in the area of bioinformatics. This chapter provides a brief introduction to RDM.

Sašo Džeroski

Chapter 42. Web Mining

The World-Wide Web provides every internet citizen with access to an abundance of information, but it becomes increasingly difficult to identify the relevant pieces of information. Research in web mining tries to address this problem by applying techniques from data mining and machine learning Web data and documents. This chapter provides a brief overview of web mining techniques and research areas, most notably hypertext classification, wrapper induction, recommender systems and web usage mining.

Johannes Fürnkranz

Chapter 43. A Review of Web Document Clustering Approaches

Nowadays, the Internet has become the largest data repository, facing the problem of information overload. Though, the web search environment is not ideal. The existence of an abundance of information, in combination with the dynamic and heterogeneous nature of the Web, makes information retrieval a difficult process for the average user. It is a valid requirement then the development of techniques that can help the users effectively organize and browse the available information, with the ultimate goal of satisfying their information need. Cluster analysis, which deals with the organization of a collection of objects into cohesive groups, can play a very important role towards the achievement of this objective. In this chapter, we present an exhaustive survey of web document clustering approaches available on the literature, classified into three main categories: text-based, link-based and hybrid. Furthermore, we present a thorough comparison of the algorithms based on the various facets of their features and functionality. Finally, based on the review of the different approaches we conclude that although clustering has been a topic for the scientific community for three decades, there are still many open issues that call for more research.

Nora Oikonomakou, Michalis Vazirgiannis

Chapter 44. Causal Discovery

Many algorithms have been proposed for learning a causal network from data. It has been shown, however, that learning all the conditional independencies in a probability distribution is a NP-hard problem. In this chapter, we present an alternative method for learning a causal network from data. Our approach is novel in that it learns functional dependencies in the sample distribution rather than probabilistic independencies. Our method is based on the fact that functional dependency logically implies probabilistic conditional independency. The effectiveness of the proposed approach is explicitly demonstrated using fifteen real-world datasets.

Hong Yao, Cory J. Butz, Howard J. Hamilton

Chapter 45. Ensemble Methods for Classifiers

The idea of ensemble methodology is to build a predictive model by integrating multiple models. It is well-known that ensemble methods can be used for improving prediction performance. In this chapter we provide an overview of ensemble methods in classification tasks. We present all important types of ensemble method including boosting and bagging. Combining methods and modeling issues such as ensemble diversity and ensemble size are discussed.

Lior Rokach

Chapter 46. Decomposition Methodology for Knowledge Discovery and Data Mining

The idea of decomposition methodology is to break down a complex Data Mining task into several smaller, less complex and more manageable, sub-tasks that are solvable by using existing tools, then joining their solutions together in order to solve the original problem. In this chapter we provide an overview of decomposition methods in classification tasks with emphasis on elementary decomposition methods. We present the main properties that characterize various decomposition frameworks and the advantages of using these framework. Finally we discuss the uniqueness of decomposition methodology as opposed to other closely related fields, such as ensemble methods and distributed data mining.

Oded Maimon, Lior Rokach

Chapter 47. Information Fusion

Information fusion techniques are commonly applied in Data Mining and Knowledge Discovery. In this chapter, we will give an overview of such applications considering their three main uses. This is, we consider fusion methods for data preprocessing, model building and information extraction. Some aggregation operators (i.e. particular fusion methods) and their properties are briefly described as well.

Vicenç Torra

Chapter 48. Parallel and Grid-Based Data Mining

Data Mining often is a computing intesive and time requiring process. For this reason, several Data Mining systems have been implemented on parallel computing platforms to achieve high performance in the analysis of large data sets. Moreover, when large data repositories are coupled with geographical distribution of data, users and systems, more sophisticated technologies are needed to implement high-performace distributed KDD systems. Since computational Grids emerged as privileged platforms for distributed computing, a growing number of Grid-based KDD systems has been proposed. In this chapter we first discuss different ways to exploit parallelism in the main Data Mining techniques and algorithms, then we discuss Grid-based KDD systems. Finally, we introduce the Knowledge Grid, an environment which makes use of standard Grid middleware to support the development of parallel and distributed knowledge discovery applications.

Antonio Congiusta, Domenico Talia, Paolo Trunfio

Chapter 49. Collaborative Data Mining

Collaborative Data Mining is a setting where the Data Mining effort is distributed to multiple collaborating agents — human or software. The objective of the collaborative Data Mining effort is to produce solutions to the tackled Data Mining problem which are considered better by some metric, with respect to those solutions that would have been achieved by individual, non-collaborating agents. The solutions require evaluation, comparison, and approaches for combination. Collaboration requires communication, and implies some form of community. The human form of collaboration is a social task. Organizing communities in an effective manner is non trivial and often requires well defined roles and processes. Data Mining, too, benefits from a standard process. This chapter explores the standard Data Mining process CRISP-DM utilized in a collaborative setting.

Steve Moyle

Chapter 50. Organizational Data Mining

Many organizations today possess substantial quantities of business information but have very little real business knowledge. A recent survey of 450 business executives reported that managerial intuition and instinct are more prevalent than hard facts in driving organizational decisions. To reverse this trend, businesses of all sizes would be well advised to adopt Organizational Data Mining (ODM). ODM is defined as leveraging Data Mining tools and technologies to enhance the decision-making process by transforming data into valuable and actionable knowledge to gain a competitive advantage. ODM has helped many organization optimize internal resource allocation while better understanding and responding to the needs of their customers. The fundamental aspects of ODM can be categorized into Artificial Intelligence (AI), Information Technology (IT), and Organizational Theory (OT), with OT being the key destinction between ODM and Data Mining. In this chapter, we introduce ODM, explain its unique characteristics, and report on the current status of ODM research. Next we illustrate how several leading organizations have adopted ODM and are benefiting from it, Then we examine the evolution of ODM to the present day and conclude our chapter by contempleting ODM’s challenging yet opportunistic future.

Hamid R. Nemati, Christopher D. Barko

Chapter 51. Mining Time Series Data

Much of the world’s supply of data is in the form of time series. In the last decade, there has been an explosion of interest in Mining time series data. A nunber of new algorithms have been introduced to classify, cluster, segment, index, discover rules, and detect anomalies/novelties in time series. While these many different techniques used to solve these problems use a multitude of different techniques, they all have one common factor; they require some high level representation of the data, rather than the original raw data. These high level representation are necessary as a feature extraction step, or simply to make the storage, transmission, and computation of massive dataset feasible. A multitute of representations have been proposed in the literature, including spectral transform, wavelets transforms, piecewise polynomials, eigenfunctions, and symbolic mappings. This chapter gives a high-level survey of time series Data Mining tasks, with an emphasis on time series representations.

Chotirat Ann Ralanamahatana, Jessica Lin, Dimitrios Gunopulos, Eamonn Keogh, Michail Vlachos, Gautam Das

Modelling medical diagnostic rules based on rough sets

Chapter 52. Data Mining in Medicine

Extensive amounts of data stored in medical databases require the development of specialized tools for accessing the data, data analysis, knowledge discovery, and cffective use of sloretl knowledge and data. This chapter focuses on Data Mining methods and tools for knowledge discovery. The chapter sketches the selected Data Mining techniques, and illustrates their applicability to medical diagnostic and prognostic problems.

Nada Lavrač, Blaž Zupan

The statistical analysis of contingency table designs

Chapter 53. Learning Information Patterns in Biological Databases

This chapter aims at developing the computational theory for modeling patterns and their hierarchical coordination within biological sequences. With the exception of the promoters and enhancers, the functional significance of the non-coding DNA is not well understood. Scientists are now discovering that specific regions of non-coding DNA interact with the cellular machinery and help bring about the expression of genes. Our premise is that it is possible to study the arrangements of patterns in biological sequences through machine learning algorithms. As the biological database continue their exponential growth, it becomes feasible to apply

in-silico

Data Mining algorithms to discover interesting patterns of motif arrangements and the frequency of their re-iteration. A systematic procedure for achieving this goal is presented.

Gautam B. Singh

Computer Integrated Manufacturing: A Data Mining Approach

Chapter 54. Data Mining for Selection of Manufacturing Processes

Data Mining tools extract knowledge from large databases. The data generated in manufacturing has not been entirely exploited. This chapter discusses applications of Data Mining in a manufacturing environment. A methodology for selection of manufacturing processes is proposed and illustrated with an industrial scenario.

Bruno Agard, Andrew Kusiak

Learning expert systems in numerical analysis of structures

Chapter 55. Data Mining of Design Products and Processes

Yoram Reich

ANSWER: Network monitoring using object-oriented rule

Chapter 56. Data Mining in Telecommunications

Telecommunication companies generate a tremendous amount of data. These data include call detail data, which describes the calls that traverse the telecommunication networks, network data, which describes the state of the hardware and software components in the network, and customer data, which decsribes the telecommmunication customers. This chapter describes how Data Mining can be used to uncover useful information buried within these data sets. Several Data Mining applications are described and together they demonstrate that Data Mining can be used to identify telecommunication fraud, improve marketing effectiveness, and identify network faults.

Gary M. Weiss

Knowledge Discovery for Gene Regulatory Regions Analysis

Chapter 57. Data Mining for Financial Applications

This chapter describes Data Mining in finance by discussing financial tasks, specifics of methodologies and techniques in this Data Mining area. It includes time dependence, data selection, forecast horizon, measures of success, quality of patterns, hypothesis evaluation, problem ID, method profile, attribute-based and relational methodologies. The second part of the chapter discusses Data Mining models and practice in finance. It covers use of neural networks in portfolio management, design of interpretable trading rules and discovering money laundering schemes using decision rules and relational Data Mining methodology.

Boris Kovalerchuk, Evgenii Vityaev

Data Mining for Intrusion Detection

Chapter 58. Data Mining for Intrusion Detection

Data Mining Techniques have been successfully applied in many different fields including marketing, manufacturing, fraud detection and network management. Over the past years there is a lot of interest in security technologies such as intrusion detection, cryptography. authentication and firewalls. This chapter discusses the application of Data Mining techniques to computer security. Conclusions are drawn and directions for future research are suggested.

Anoop Singhal, Sushil Jajodia

Fuzzy Cluster Analysis: Methods for Classification

Chapter 59. Data Mining for Software Testing

Software testing activities are usually planned by human experts, while test automation tools are limited to execution of pre-planned tests only. Evaluation of test outcomes is also associated with a considerable effort by software testers who may have imperfect knowledge of the requirements specification. Not surprisingly, this manual approach to software testing results in heavy losses to the world’s economy. As demonstranted in this chapter, Data Mining algorithms can be efficiently used for automated modeling of tested systems. Induced Data Mining models can be utilized for recovering system requirements, identifying equivalence classes in system inputs, designing a minimal set of regression tests, and evaluating the correctness of software outputs.

Mark Last

Data Mining for CRM

Chapter 60. Data Mining for CRM

Data Mining technology allows marketing organizations to better understand their customers and respond to their needs. This chapter describes how Data Mining can be combined with customer relationship management to help drive improved interactions with customers. An example showing how to use Data Mining to drive customer acquisition activities is presented.

Kurt Thearling

Learning Internal Representation by Error Propagation

Chapter 61. Data Mining for Target Marketing

Targeting is the core of marketing management. It is concerned with offering the right product/service to the customer at the right time and using the proper channel. In this chapter we discuss how Data Mining modeling and analysis can support targeting applications. We focus on three types of targeting models: continuous-choice models, discrete-choice models and in-market timing models, discussing alternative modeling for each application and decision making. We also discuss a range of pitfalls that one needs to be aware of in implementing a data mining solution for a targeting problem.

Nissan Levin, Jacob Zahavi

Software

Chapter 62. Weka

The Weka workbench is an organized collection of state-of-the-art machine learning algorithms and data preprocessing tools. The basic way of interacting with these methods is by invoking them from the command line. However, convenient interactive graphical user interfaces are provided for data exploration, for setting up large-scale experiments on distributed computing platforms, and for designing configurations for streamed data processing. These interfaces constitute an advanced environment for experimental data mining. The system is written in Java and distributed under the terms of the GNU General Public License.

Eibe Frank, Mark Hall, Geoffrey Holmes, Richard Kirkby, Bernhard Pfahringer, Ian H. Witten, Len Trigg

Chapter 63. Oracle Data Mining

Oracle has completed a major research and development effort to add native Data Mining and pattern recognition algorithms to the Oracle RDBMS. As a result, Oracle Data Mining (ODM) provides a comprehensive collection of Data Mining analytics as part of the Oracle database environment that supports the development, integration and deployment of Data Mining applications. This Data Mining infrastructure has a native SQL and PL/SQL API but can also be accessed from a Java API or the ODM user interface. ODM enables data analysts and developers to discover insights hidden in their data and create advanced Data Mining applications that extend the benefits of DMa Mining to many users throughout an organization. ODM leverages the powerful and feature-rich Oracle RDBMS environment including comprehensive capabilities for data storage, data preparation and processing, information retrieval, scalability, security, transaction control, parallelism, versioning, workflow, and reliability. In this article, we describe the functionality and algorithms behind ODM and the advantages of the

Data Mining in the database

paradigm. We conclude with two examples of the use of ODM: a SVM methodology for tumor classification and the integration of Naive Bayes predictive models in Oracle’s marketing business application (Oracle Marketing).

Tamayo P., C. Berger, M. Campos, J. Yarmus, B. Milenova, A. Mozes, M. Taft, M. Hornick, R. Krishnan, S. Thomas, M. Kelly, D. Mukhin, B. Haberstroh, S. Stephens, J. Myczkowski

Chapter 64. Building Data Mining Solutions With OLE DB for DM and XML for Analysis

Data Mining component is included Microsoft SQL Server 2000, on of the most popular DBMS. This gives a push for Data Minwg technologies moving from niche towards mainstream. Apart from a few algorithms, the main contribution of SQL Server Data Mining is the implementation of OLE DB for Data Mining. OLE DB for Data mining is an industrial standard lead by Microsoft and supported by a number of ISVs. It leverages from two existing relational technology: SQL and OLE DB. It defines a SQL language for data mining query based on relational concept. More recently, Microsoft, Hyperion, SAS and a few other BI vendors formed the XML for Analysis Council. The XML for Analysis covers both OLAP and Data Mining. The goal is to allow consumer applications to query various BI packages from different platform. This chapter gives an overview of OLE DB for Data Mining and XML for Analysis. It also shows how to build Data Mining application using these APIs.

Zhaohui Tang, Jamie Maclennan, Pyungchul (Peter) Kim

Chapter 65. LERS—A Data Mining System

LERS (Learning from Examples based on Rough Sets) is a Data Mining system inducing rules from raw data sets. For rule induction LERS uses two approaches: machine learning and knowledge acquisition. In the former approach, induced rule sets are discriminant; in the latter, LERS attempts to induce all potential rules hidden in the data set. LERS accepts input data that are erroneous, have numerical attributes, missing attribute values, and / or incosistent. The system has been used in medicine, nursing, global warming, environmental protection, natural language, data transmission, etc. LERS may process big data sets and frequently outperforms not only other rule induction systems but even human experts.

Jerzy W. Grzymala-Busse

Chapter 66. GainSmarts Data Mining System for Marketing

The winner of the KDD in 1997 and 1998, GainSmarts is one of leading Data Mining software packages. GainSmarts encompasses the entire range of the KDD process, including data import, exploratory data analysis, sampling, feature selection, modeling, knowledge evaluation, scoring and decision making and reporting. GainSmart is mostly noted for its feature selection process which employs a rule-based export system to automatically select the most influential predictors from a much larger set of potential predictors. The modeling suite is particularly rich and includes a variety of predictive Models — binary, multinomial, continuous and even survival analysis, as well as clustering and collaborative filtering models. The output reports are presented in both tabular and visual forms, some of them are also available in Excel form allowing the user to use Excel options to Manipulate the results and conduct sensitivity analyses. Economic criteria are imbedded in the decision making to drive decisions. GainSmarts was developed in SAS with the intensive CPU routines developed in C. It is a multi-lingual system currently available in English, Japanese and German. While developed with a marketing slant, GainSmarts generic nature makes it applicable for a variety of applications in diverse industries.

Nissan Levin, Jacob Zahavi

Wizsoft’s Wizwhy

Abraham Meidan

Chapter 68. DataEngine

DataEngine is a development tool designed lo facilitate intelligent analysis of data, modeling and control. Its various intelligent technologies allow users to create on-line and off-line knowledge discovery and decision support systems, without having to write computer source code. How the tool interfaces with external data source, inserts into user program, and accepts user-defined functions briefly described. Its editors and GUI facilitate analysis of results and report preparation.

Joseph Komem, Moli Schneider

Backmatter

Titel: Data Mining and Knowledge Discovery Handbook
herausgegeben von: Oded Maimon
Lior Rokach
Verlag: Springer US
Electronic ISBN: 978-0-387-25465-4
Print ISBN: 978-0-387-24435-8
DOI: https://doi.org/10.1007/b107408