Skip to main content

2004 | OriginalPaper | Buchkapitel

Cherry-Picking as a Robustness Tool

verfasst von : Leanna L. House, David Banks

Erschienen in: Classification, Clustering, and Data Mining Applications

Verlag: Springer Berlin Heidelberg

Aktivieren Sie unsere intelligente Suche um passende Fachinhalte oder Patente zu finden.

search-config
loading …

When there are problems with data quality, it often happens that a reasonably large fraction is good data, and expresses a clear statistical signal, while a smaller fraction is bad data that shows little signal. If it were possible to identify the subset of the data that collectively expresses a strong signal, then one would have a robust tool for uncovering structure in problematic datasets. This paper describes a search strategy for finding large subsets of data with strong signals. The methodology is illustrated for problems in regression. This work is part of a year-long program in statistical data mining that has been organized by SAMSI, the new National Science Foundation center for research at the interface of statistics and applied mathematics.Recently, high dimensional datasets such as micro array gene data and point-of-sale data have become important. It is generally difficult to see the structure of data when the dimension of data is high. Therefore, many studies have invented methods that reduce high dimensional data to lower dimensional data. Among these methods, projection pursuit was developed by Friedman and Tukey (1974) in order to search for an ‘interesting’ linear projection of multidimensional data. They defined the degree of ‘interestingness’ as the difference between the distribution of the projected data and the normal distribution. We call this measure a projection index.However, projection indices that measure the difference from the normal distribution do not always reveal interesting structure because interesting structure depends on the purpose of the analysis. According to th e scientific situation that motivates the data analysis, ‘uninteresting’ structure is not always the normal distribution.Relative projection pursuit allows the user to predefine a reference data set that represent s ‘uninteresting’ structure. The projection index for relative projection pursuit measures the distance between the distribution of the projected target data set and that of the projected reference data set. We show the effectiveness of RPP with numerical examples and actual data.

Metadaten
Titel
Cherry-Picking as a Robustness Tool
verfasst von
Leanna L. House
David Banks
Copyright-Jahr
2004
Verlag
Springer Berlin Heidelberg
DOI
https://doi.org/10.1007/978-3-642-17103-1_20