nach oben

1998 | Buch

Kapitel lesen Erstes Kapitel lesen

Feature Selection for Knowledge Discovery and Data Mining

verfasst von: Huan Liu, Hiroshi Motoda

Verlag: Springer US

Buchreihe : The International Series in Engineering and Computer Science

Enthalten in: Professional Book Archive

Einloggen, um Zugang zu erhalten

Über dieses Buch

As computer power grows and data collection technologies advance, a plethora of data is generated in almost every field where computers are used. The com puter generated data should be analyzed by computers; without the aid of computing technologies, it is certain that huge amounts of data collected will not ever be examined, let alone be used to our advantages. Even with today's advanced computer technologies (e. g. , machine learning and data mining sys tems), discovering knowledge from data can still be fiendishly hard due to the characteristics of the computer generated data. Taking its simplest form, raw data are represented in feature-values. The size of a dataset can be measUJ·ed in two dimensions, number of features (N) and number of instances (P). Both Nand P can be enormously large. This enormity may cause serious problems to many data mining systems. Feature selection is one of the long existing methods that deal with these problems. Its objective is to select a minimal subset of features according to some reasonable criteria so that the original task can be achieved equally well, if not better. By choosing a minimal subset offeatures, irrelevant and redundant features are removed according to the criterion. When N is reduced, the data space shrinks and in a sense, the data set is now a better representative of the whole data population. If necessary, the reduction of N can also give rise to the reduction of P by eliminating duplicates.

Inhaltsverzeichnis

Frontmatter

1. Data Processing and Knowledge Discovery in Databases

Abstract

With advanced computer technologies and their omnipresent usage, data accumulates in a speed unmatchable by the human’s capacity of data processing. To meet this growing challenge, the community of knowledge discovery from databases emerged not long ago. The key issue studied by the community is, in a layman’s term, to make use of the data to our advantage. Or, why should we collect so much of it in the first place? In order to make the raw data useful, we need to represent it, process it, extract knowledge from it, present and understand knowledge for various applications. Through the first chapter, we provide the computational model of our study and the representation of data, and introduce the field of knowledge discovery from databases that evolves from many fields such as classification and clustering in statistics, pattern recognition, neural networks, machine learning, databases, exploratory data analysis, on-line analytical processing, optimization, high-performance and parallel computing, knowledge modeling, and data visualization. The ever advanced data processing technology and the increasing demand of taking advantages of data stored form a new challenge for data mining; one of the solutions to this new challenge is feature selection — the core of this study.

Huan Liu, Hiroshi Motoda

2. Perspectives of Feature Selection

Abstract

From here on, we study feature selection for classification. By choosing this type of feature selection, we can focus on many common perspectives of feature selection, obtain a deep understanding of basic issues of feature selection, appreciate many different methods of feature selection, and later in the book move on to topics related to feature selection. The problem of feature selection can be examined in many perspectives. The four major ones are (1) how to search for the “best” features? (2) what should be used to determine best features, or what are the criteria for evaluation? (3) how should new features be generated for selection, adding or deleting one feature to the existing subset or changing a subset of features? (That is, feature generation is conducted sequentially or in parallel.) and (4) how applications determine feature selection? Applications have different requirements in terms of computational time, results, etc. For instance, the focus of machine learning (Dietterich, 1997) differs from that of data mining (Fayyad et al., 1996).

Huan Liu, Hiroshi Motoda

3. Feature Selection Aspects

Abstract

With a unified model of feature selection, we are ready to discuss in detail different aspects of feature selection. The major aspects of feature selection are (1) search directions (feature subset generation), (2) search strategies, and (3) evaluation measures. The objective of this chapter is two-fold: (a) to study the various options for each aspect in a systematic and principled way and (b) to identify the essential and different characteristics of various feature selection systems.

Huan Liu, Hiroshi Motoda

4. Feature Selection Methods

Abstract

With the unified model, we were able to study the three major aspect in the previous chapter. Now we look at possible combinations of these aspects, which can be used to construct a feature selection method. The objective of this chapter is three-fold: (a) to categorize the existing methods in a framework defined by the three major aspects; (b) to discover what have not been done and what can be done; and (c) to pave the way towards a meta system that links applications to specific methods.

Huan Liu, Hiroshi Motoda

5. Evaluation and Application

Abstract

No matter how great a tool is, if it is applied to a wrong problem, it may delay problem solving, or even worse, cause harm. Experts are needed to match tools with problems. They are not only just domain experts who know a lot about the problems they are facing, but experts who are familiar with feature selection methods. If it is not impossible for us to have such experts, these experts are rare to find. The matching problem still exists regardless of whether we can find such experts or not. The second best solution is to abstract both tools and problems. By relating tools with problems, hopefully, we can help solve the matching problem. Abstracting problems can be done by domain experts according to characteristics of data; abstracting tools requires some measures that can tell us how good each method is and under what circumstances it is good. In order to wisely apply feature selection methods, we need to first discuss their performance.

Huan Liu, Hiroshi Motoda

6. Feature Transformation and Dimensionality Reduction

Abstract

In the previous chapters, we focused on feature subset selection. We now discuss related and/or less developed topics with respect to feature transformation and dimensionality reduction. The first two sections are about feature trans- formation, which introduce techniques in Statistics, Machine Learning, and Knowledge Discovery. The third section discusses feature discretization which is closely related to dimensionality reduction. If subset selection allows reduction in one dimension, feature discretization could enable reduction along two dimensions. In the fourth section, we go beyond the classification model and explore feature selection without class information. Data without class information are unsupervised data (unlabeled). As we move from supervised (labeled) data to unsupervised data, we also step into a territory that is not as well explored as feature selection for classification. However, we foresee the rising need for unsupervised feature selection and expect more work to be carried out in the near future.

Huan Liu, Hiroshi Motoda

7. Less is More

Abstract

Now we almost arrive at the end of our journey of various methods for feature manipulation. It is time for us to reflect the underlying principle for all the ideas and motivations behind so many methods. “Less is more” is the philosophy manifested throughout this book. We are not only facing a grim reality of ever growing amounts of data but also constrained by limitations of tools and representations. Nevertheless, we are determined to continue our pursuit of finding regularities from disordered data and discovering knowledge hidden in massive data. With more data accumulated, we hope to discover many more valuables from the data. “Less is more” says that in order to get more useful findings from the data, we need to first make the data less by removing its irrelevant parts.

Huan Liu, Hiroshi Motoda

Backmatter

Titel: Feature Selection for Knowledge Discovery and Data Mining
verfasst von: Huan Liu
Hiroshi Motoda
Verlag: Springer US
Electronic ISBN: 978-1-4615-5689-3
Print ISBN: 978-1-4613-7604-0
DOI: https://doi.org/10.1007/978-1-4615-5689-3

Springer Professional

Über dieses Buch

Inhaltsverzeichnis

Frontmatter

1. Data Processing and Knowledge Discovery in Databases

2. Perspectives of Feature Selection

3. Feature Selection Aspects

4. Feature Selection Methods

5. Evaluation and Application

6. Feature Transformation and Dimensionality Reduction

7. Less is More

Backmatter