nach oben

2001 | Buch

Kapitel lesen Erstes Kapitel lesen

Data Mining for Scientific and Engineering Applications

herausgegeben von: Robert L. Grossman, Chandrika Kamath, Philip Kegelmeyer, Vipin Kumar, Raju R. Namburu

Verlag: Springer US

Buchreihe : Massive Computing

Enthalten in: Professional Book Archive

Einloggen, um Zugang zu erhalten

Über dieses Buch

Advances in technology are making massive data sets common in many scientific disciplines, such as astronomy, medical imaging, bio-informatics, combinatorial chemistry, remote sensing, and physics. To find useful information in these data sets, scientists and engineers are turning to data mining techniques. This book is a collection of papers based on the first two in a series of workshops on mining scientific datasets. It illustrates the diversity of problems and application areas that can benefit from data mining, as well as the issues and challenges that differentiate scientific data mining from its commercial counterpart. While the focus of the book is on mining scientific data, the work is of broader interest as many of the techniques can be applied equally well to data arising in business and web applications.
Audience: This work would be an excellent text for students and researchers who are familiar with the basic principles of data mining and want to learn more about the application of data mining to their problem in science or engineering.

Inhaltsverzeichnis

Frontmatter

Chapter 1. On Mining Scientific Datasets

Abstract

Data mining techniques have gained acceptance as a viable means of finding useful information in data. While the techniques can be applied to any kind of data, a brief survey of the work presented at recent conferences in data mining and knowledge discovery might lead one to believe that these techniques are being applied mainly to commercial data sets, to address problems such as customer relationship management, market basket analysis, credit card fraud, etc. Often overlooked is the fact that data mining techniques have long been applied to scientific datasets, with fields such as remote sensing, astronomy, biology, physics, and chemistry, providing a rich environment for the practice of these techniques. In this paper, I describe the various scientific and engineering areas in which data mining is playing an important role and discuss some of the issues that make scientific data mining different from its commercial counterpart. I show that the diversity of applications, the richness of the problems faced by practitioners, and the opportunity to borrow ideas from other domains, make scientific data mining an exciting and challenging field.

Chandrika Kamath

Chapter 2. Understanding High Dimensional and Large Data Sets: Some Mathematical Challenges and Opportunities

Abstract

Spectacular advances in sensor technology, data storage devices, and large-scale computing are producing huge data sets. These large and high-dimensional sets arise naturally in a variety of contexts such as the dynamics of the Internet, imaging for surveillance and diagnostics, and gene sequencing. The significant change in the scale and complexity embodied in these types of data, as well as the intricacies of the underlying phenomena being studied, present some new conceptual challenges. There has been considerable research activity dealing with the organization and analysis of such large data sets. But, by and large, these approaches have had only limited success towards the goal of understanding fully the inherent structures of these large data sets. There is a need, therefore, for new fundamental thinking about these problems and new mathematical approaches. In this paper we review a few such promising directions that draw extensively from fertile areas of harmonic analysis, discrete mathematics, stochastic analysis, and statistical methods.

Jagdish Chandra

Chapter 3. Data Mining at the Interface of Computer Science and Statistics

Abstract

This chapter is written for computer scientists, engineers, mathematicians, and scientists who wish to gain a better understanding of the role of statistical thinking in modern data mining. Data mining has attracted considerable attention both in the research and commercial arenas in recent years, involving the application of a variety of techniques from both computer science and statistics. The chapter discusses how computer scientists and statisticians approach data from different but complementary viewpoints and highlights the fundamental differences between statistical and computational views of data mining. In doing so we review the historical importance of statistical contributions to machine learning and data mining, including neural networks, graphical models, and flexible predictive modeling. The primary conclusion is that closer integration of computational methods with statistical thinking is likely to become increasingly important in data mining applications.

Padhraic Smyth

Chapter 4. Mining Large Image Collections

Abstract

NASA has been involved with remote exploration of the solar system for over forty years and, as a result, has accumulated a vast archive of images. Continued improvements in acquisition and storage technology are yielding new image sets with data volumes measured in terabytes. Within these large image collections there is a wealth of scientific information, but getting from the data to knowledge is a difficult problem both due to the size of the datasets involved and the difficulty of automatically interpreting image data. This chapter provides an overview of our efforts to develop algorithms for mining useful information from large image collections.

Michael C. Burl

Chapter 5. Mining Astronomical Databases

Abstract

The development of software tools and techniques for the efficient ac- cess and analysis of large astronomical databases poses some unique challenges. We briefly describe some of the problems astronomical data and datasets present and give an example from our own efforts to auto- mate the classification of galaxies, and then discuss where “clustering” algorithms may be applicable.

Roberta M. Humphreys, Juan E. Cabanela, Jeffrey Kriessler

Chapter 6. Searching for Bent-Double Galaxies in the First Survey

Abstract

Data mining techniques are increasingly gaining popularity in various scientific domains as viable approaches to the analysis of massive data sets. In this chapter, we describe our experiences in applying data mining to a problem in astronomy, namely, the identification of radio-emitting galaxies with a bent-double morphology. Until recently, astronomers associated with the FIRST (Faint images of the radio Sky at Twenty-cm) survey identified these galaxies through a visual inspection of images. White this manual approach has been very subjective and tedious, it is also becoming increasingly infeasible as the survey has grown in size. Upon completion, FIRST will include almost a million galaxies, making the use of semi-automated analysis methods necessary. We describe the FIRST data set and the problem of identifying bent-double galaxies. We discuss our solution approach, focusing on the challenges we face in the application of data mining to a scientific data set. We explain why, in contrast with most commercial data mining applications, data preprocessing requires a considerable effort in scientific applications. Using decision tree classifiers, we describe the work we are doing in the detection of bent-double galaxies. Our results indicate that data mining techniques, steered by proper domain knowledge, can greatly enhance the manual exploration of massive data sets.

Chandrika Kamath, Erick Cantú-Paz, Imola K. Fodor, Nu Ai Tang

Chapter 7. A Dataspace Infrastructure for Astronomical Data

Abstract

This article describes an internet infrastructure for working with data called DataSpace. A distributed DataSpace application containing data from the 2MASS and DPOSS astronomical data sets is also described. DataSpace is designed so that client applications supporting the remote analysis and distributed mining of data are easy to build.

Robert Grossman, Emory Creel, Marco Mazzucco, Roy Williams

Chapter 8. Data Mining Applications in Bioinformatics

Abstract

This chapter describes opportunities for data mining in the emerging arena of bioinformatics applications. We outline the nature of research issues in bioinformatics and the motivating data management and analysis tasks. Descriptions of successful applications are given, along with an outline of the near-future potential and issues affecting the successful application of data mining.

Naren Ramakrishnan, Ananth Y. Grama

Chapter 9. Mining Residue Contacts in Proteins

Abstract

In this paper we develop data mining techniques to predict 3D contact potentials among protein residues (or amino acids) based on the hierarchical nucleation-propagation model of protein folding. We apply a hybrid approach, using a Hidden Markov Model to extract folding initiation sites, and then apply association mining to discover contact potentials. The new hybrid approach achieves accuracy results better than those reported previously.

Mohammed J. Zaki, Chris Bystroff

Chapter 10. KDD Services at the Goddard Earth Sciences Distributed Active Archive Center

Abstract

NASA’s Goddard Earth Sciences Distributed Active Archive Center (GES DAAC) processes, stores and distributes earth science data from a variety of remote sensing satellites. End users of the data range from instrument scientists to global change and climate researchers to federal agencies and foreign governments. Many of these users apply Knowledge Discovery from Databases (KDD) techniques to large volumes of data (on the order of a terabyte) received from the GES DAAC. However, rapid advances in computer power are enabling increases in data processing that are outpacing tape drive performance and network capacity. As a result, the proportion of data that can be distributed to users continues to decrease. As mitigation, we are migrating more knowledge extraction (e.g., data mining and data reduction) activities into the data center in order to reduce the data volume that needs to be distributed and to offer the users a more useful and manageable product. This migration of activities faces several technical and human-factor challenges. As data reduction and mining algorithms are often quite specific to the user’s research needs, the user’s algorithm must be integrated virtually unchanged into the archive environment. Also, the archive itself is busy with everyday data archive and distribution activities and cannot be dedicated to, or even impacted by, the mining activities. Therefore, we schedule KDD “campaigns”, during which we schedule a wholesale retrieval of specific data products, offering users the opportunity to extract information from the data being retrieved during the campaign.

Christopher Lynnes, Robert Mack

Chapter 11. Data Mining in Integrated Data Access and Data Analysis Systems

Abstract

The rapid increase in the volume of scientific data sets has resulted in distributed data information systems applicable to Earth system science. Such a system should help users to locate data sets, to provide preliminary research results quickly and to support data deliveries under users’ request. At George Mason University, we have been developing a data information system with both search and analysis components. In this system, three phases of data accesses are supported: phase one for meta-data search; phase two for on-line data analysis; and phase three for data ordering. For large volumes of data, searching on meta-data only will not be adequate. Scientists often need to search for data based on actual data values. This is a particular kind of data mining, which searches for data sets based on data content.

In this chapter, we first describe the system architecture. We then develop the concept of a data pyramid model and propose a histogram clustering technique for content-based searches. We use the model and the related technique to answer content-based queries approximately but efficiently. We will also describe our prototypes that integrate the content-based searches into a data information system.

Ruixin Yang, Menas Kafatos, Kwang-Su Yang, X. Sean Wang

Chapter 12. Spatial Data Mining for Classification, Visualisation and Interpretation with Artmap Neural Network

Abstract

Accurate global land cover information is required for many aspects of global change research. Remote sensing provides the only viable basis for the production of this information. This paper reports research undertaken as part of the MODIS effort to map the land cover of North America using the ARTMAP neural network. The main objective is to design a system called ART-VIP (ART for Visualisation and Image Processing) that integrates the ARTMAP neural network algorithm into a standard public domain image processing software, and to help users analyse and interpret the dynamics of the ARTMAP neural network with scientific visualisation tools. The provision of public domain software and methodologies facilitates the use of the ARTMAP neural network architecture for other land cover classification problems.

Weiguo Liu, Sucharita Gopal, Curtis Woodcock

Chapter 13. Real Time Feature Extraction for the Analysis of Turbulent Flows

Abstract

The study of fluid flow turbulence has been an active area of research for over 100 years, mainly because of its technological importance to a vast number of applications. In recent times with the advent of supercomputers and new experimental imaging techniques, terabyte scale data sets are being generated, and hence storage as well as analysis of this data has become a major issue. In this chapter we outline a new approach to tackling these data-sets which relies on selective data storage based on real-time feature extraction and utilizing data mining tools to aid in the discovery and analysis of the data. Visualization results are presented which highlight the type and number of spatially and temporally evolving coherent features that can be extracted from the data sets as well as other high level features.

I. Marusic, G. V. Candler, V. Interrante, P. K. Subbareddy, A. Moss

Chapter 14. Data Mining for Turbulent Flows

Abstract

Data mining techniques hold great promise for enabling the automatic analysis of large data sets generated by scientific simulation, and thus, may help engineers and scientists unravel the causal relationships in the underlying system. In this chapter, we propose several data modeling methods to incorporate spatial and temporal features of scientific simulation data and investigate some of them in the context of developing models for predicting burst events in turbulent flow. We use the classification rules algorithm C4.5rules and support-vector machines on the turbulent flow simulation data to develop predictive models for identifying upward or downward velocity movements of the flow close to the wall as a function of swirl strength in the nearby region.

Eui-Hong Han, George Karypis, Vipin Kumar

Chapter 15. EVITA — Efficient Visualization and Interrogation of Tera-Scale Data

Abstract

Large-scale computational simulations of physical phenomena produce data of unprecedented size (terabyte and petabyte range). Unfortunately, development of appropriate data management and visualization techniques has not kept pace with the growth in size and complexity of such datasets. To address these issues, we are developing a prototype, integrated system (EVITA) to facilitate exploration of tera-scale datasets. The cornerstone of the EVITA system is a representational scheme that allows ranked access to macroscopic features in the dataset. The data and grid are transformed using wavelet techniques while a feature-detection algorithm is used to identify and rank contextually significant features directly in the wavelet domain. The most significant parts of the dataset are thus available for detailed examination in a progressive fashion. The work presented here is similar in essence to much of the work in the traditional data-mining community. We first describe the basic system and follow with a discussion of ongoing work, focusing on efforts in multiscale feature detection and progressive access. Finally, we demonstrate the system for a two-dimensional vector field derived from an oceanographic dataset.

Raghu Machiraju, James E. Fowler, David Thompson, Bharat Soni, Will Schroeder

Chapter 16. Towards Ubiquitous Mining of Distributed Data

Abstract

The demand for understanding and exploring large quantity of data is growing fast in many domains. Scientific research is one among them. While the role of high performance computers in scientific data analysis is important, networks of workstations and the so called “thin” computing devices like the laptops, palmtops, and wearable computers are playing increasingly important roles in this domain. This chapter presents an overview of a collection of techniques that are designed for analyzing heterogeneous data distributed over a network of different computing and storage devices. The collective data mining approach presented here, pays careful attention to the overhead of data communication in a heterogeneous network and offers the capability of ubiquitous mining from distributed data.

Hillol Kargupta, Krishnamoorthy Sivakumar, Weiyun Huang, Rajeev Ayyagari, Rong Chen, Byung-Hoon Park, Erik Johnson

Chapter 17. Decomposable Algorithms for Data Mining

Abstract

Most data mining algorithms have been designed and developed with the assumption that all the relevant data is resident at a single node of a computer network. In a networked environment the data relevant for a mining task may reside, in components, on a number of geographically distributed computer nodes. These databases cannot be easily moved to other network sites due to their size, ownership and security considerations. Mining algorithms, therefore, are needed that can decompose themselves to match the nature of data distribution across nodes and execute partial computations at local data sites and then compose these local summaries to construct global results. We present some mining algorithms having such dynamic decomposition capability.

Raj Bhatnagar

Chapter 18. HDDI™: Hierarchical Distributed Dynamic Indexing

Abstract

The explosive growth of digital repositories of information has been enabled by recent developments in communication and information technologies. The global Internet/World Wide Web exemplifies the rapid deployment of such technologies. Despite significant accomplishments in internetworking, however, scalable indexing and data-mining techniques for computational knowledge management lag behind the rapid growth of distributed collections. Hierarchical Distributed Dynamic Indexing (HDDI™) is an approach that dynamically creates a hierarchical index from distributed document collections. At each node of the hierarchical index, a knowledge base is created and subtopic regions of semantic locality in conceptual space are identified. This chapter introduces HDDI™, focusing on the model building techniques employed at each node of the hierarchy. A novel approach to information clustering based on the contextual transitivity of similarity between terms is introduced. We conclude with several example applications of HDDI™ in the textual data mining and information retrieval fields.

William M. Pottenger, Yong-Bin Kim, Daryl D. Meling

Chapter 19. Parallel Algorithms for Clustering High-Dimensional Large-Scale Datasets

Abstract

Clustering techniques for large scale and high dimensional data sets have found great interest in recent literature. Such data sets are found both in scientific and commercial applications. Clustering is the process of identifying dense regions in a sparse multi-dimensional data set. Several clustering techniques proposed earlier either lack in scalability to a very large set of dimensions or to a large data set. Many of them require key user inputs making it hard to be useful for real world data sets or fail to represent the generated clusters in a intuitive way. We have designed and implemented, pMAFIA, a density and grid based clustering algorithm wherein a multi-dimensional space is divided into finer grids and the dense regions found are merged together to identify the clusters. For large data sets with a large number of dimensions fine division of the multi-dimensional space leads to an enormous amount of computation. We have introduced an adaptive grid framework which not only reduces the computation vastly by forming grids based on the data distribution, but also improves the quality of clustering. Clustering algorithms also need to explore clusters in a subspace of the total data space. We have implemented a new bottom up algorithm which explores all possible subspaces to identify the embedded clusters. Further our framework requires no user input, making pMAFIA a completely unsupervised data mining algorithm. Finally, we have also introduced parallelism in the clustering process, which enables our data mining tool to scale up to massive data sets and large set of dimensions. Data parallelism coupled with task parallelism have shown to yield the best parallelization results on a diverse set of synthetic and real data sets.

Harsha Nagesh, Sanjay Goil, Alok Choudhary

Chapter 20. Efficient Clustering of Very Large Document Collections

Abstract

An invaluable portion of scientific data occurs naturally in text form. Given a large unlabeled document collection, it is often helpful to organize this collection into clusters of related documents. By using a vector space model, text data can be treated as high-dimensional but sparse numerical data vectors. It is a contemporary challenge to efficiently preprocess and cluster very large document collections. In this paper we present a time and memory efficient technique for the entire clustering process, including the creation of the vector space model. This efficiency is obtained by (i) a memory-efficient multi-threaded preprocessing scheme, and (ii) a fast clustering algorithm that fully exploits the sparsity of the data set. We show that this entire process takes time that is linear in the size of the document collection. Detailed experimental results are presented — a highlight of our results is that we are able to effectively cluster a collection of 113,716 NSF award abstracts in 23 minutes (including disk I/O costs) on a single workstation with modest memory consumption.

Inderjit S. Dhillon, James Fan, Yuqiang Guan

Chapter 21. A Scalable Hierarchical Algorithm for Unsupervised Clustering

Abstract

Top-down hierarchical clustering can be done in a scalable way. Here we describe a scalable unsupervised clustering algorithm designed for large datasets from a variety of applications. The method constructs a tree of nested clusters top-down, where each cluster in the tree is split according to the leading principal direction. We use a fast principal direction solver to achieve a fast overall method. The algorithm can be applied to any dataset whose entries can be embedded in a high dimensional Euclidean space, and takes full advantage of any sparsity present in the data. We show the performance of the method on text document data, in terms of both scalability and quality of clusters. We demonstrate the versatility of the method in different domains by showing results from text documents, human cancer gene expression data, and astrophysical data. For that last domain, we use an out of core variant of the underlying method which is capable of efficiently clustering large datasets using only a relatively small memory partition.

Daniel Boley

Chapter 22. High-Performance Singular Value Decomposition

Abstract

Singular value decomposition is, among other things, a dimensionality reduction technique. It is used in data mining as a way to improve similarity measurements and as a preprocessing step before automatic clustering. We present several parallel algorithms for computing the SVD of a large matrix. Given a matrix with n rows and m columns, p-fold speedup of the computation part is achieved. Communication overheads range from O ( m² ) to O ( pm² ), which is smaller than the communication overheads of techniques based on parallelizing the inner loop of Hestenes-style algorithms.

David B. Skillicorn, Xiaolan Yang

Chapter 23. Mining High-Dimensional Scientific Data Sets Using Singular Value Decomposition

Abstract

Clustering is an undirected knowledge discovery technique based on the partitioning of large sets of data objects into homogenous groups. All objects contained in the same group have similar characteristics. Grouping multivariate data is a difficult data mining task when no domain knowledge on data structure is available. In this chapter we describe the use of a well known linear projection technique, called Singular Value Decomposition (SVD) to discover clusters. SVD is an optimal dimensionality reduction method that projects a multi-dimensional pattern space into a subspace that preserves the character of data. The plot of the 2- or 3-dimensional points gives the human user a guide to discover the presence of homogeneous groups (clusters) in the data set. A user, by inspecting the graphical 2- or 3-dimensional representations, may identify groups on the basis of space density and define threshold values to separate clusters. Experimental results on real scientific data sets assess the quality of the clustering obtained.

Ekaterina Maltseva, Clara Pizzuti, Domenico Talia

Chapter 24. Spatial Dependence in Data Mining

Abstract

Data sets that represent observational units that are located at different points on a map often exhibit spatial dependence. Symptoms of such spatial dependence include clustering of similar values by location (e.g., house prices and incomes are similar in a subdivision, wetlands usually linear other wetlands, pollution gradually dissapates away from the source). This chapter discusses alternative models and methods that can be used to estimate regression relationships for this type of sample data.

James P. LeSage, R. Kelley Pace

Chapter 25. SPARC: Spatial Association Rule-Based Classification

Abstract

Spatial classification is to classify spatial objects based on the spatial and nonspatial features of these objects in a database. The classification results, taken as the models for the data, can be used for better understanding of the relationships among the objects in the database and for prediction of characteristics and features of new objects. Spatial classification is a challenging task due to the sparsity of spatial features which leads to high dimensionality and also the “curse of dimensionality. In this paper, we introduce an association-based spatial classification algorithm, called SPARC (SPatial Association Rule-based Classification), for efficient spatial classification in large geospatial databases. SPARC explores spatial association-based classification and integrates a few important techniques developed in spatial indexing and data mining to achieve high scalability when classifying a large number of spatial data objects. These techniques include micro-clustering, spatial join indexing, feature reduction by frequent pattern mining, and association-based classification. Our performance study shows that SPARC is efficient for classification of spatial objects in large databases.

Jiawei Han, Anthony K. H. Tung, Jing He

Chapter 26. What’s Spatial About Spatial Data Mining: Three Case Studies

Abstract

Spatial data mining is the process of discovering interesting and previously unknown, but potentially useful, patterns from large spatial datasets. Extracting interesting and useful patterns from spatial datasets is more difficult than extracting the corresponding patterns from traditional numeric and categorical data due to the complexity of spatial data types, spatial relationships, and spatial autocorrelation. A popular approach is to apply classical data mining techniques after transforming spatial components into non-spatial components via feature selection. An alternative is to explore new models, new objective functions, and new patterns which are more suitable for spatial data and their unique properties. This chapter investigates techniques in the literature to incorporate spatial components via feature selection, new models, new objective functions, and new patterns.

Shashi Shekhar, Yan Huang, Weili Wu, C. T. Lu, S. Chawla

Chapter 27. Predicting Failures in Event Sequences

Abstract

In this paper we develop new techniques for predicting failures and monitoring in categorical event sequences. New techniques are needed because failures are rare and previous data mining algorithms were overwhelmed by the staggering number of very frequent, but entirely unpredictive patterns that exist in such databases. This paper combines several techniques for pruning out unpredictive and redundant patterns, which reduce the size of the returned rule set by more than three orders of magnitude. As a concrete application, we present PlanMine, an algorithm to extract patterns of events that predict failures in databases of plan executions. PlanMine has also been fully integrated into two real-world planning systems. We experimentally evaluate the rules discovered by PlanMine, and show that they are extremely useful for understanding and improving plans, as well as for building monitors that raise alarms before failures happen.

Mohammed J. Zaki, Neal Lesh, Mitsunori Ogihara

Chapter 28. Efficient Algorithms for Mining Long Patterns in Scientific Data Sets

Abstract

In this paper we present an algorithm for mining long patterns in databases. The algorithm finds large itemsets by using depth first search on a lexicographic tree of itemsets. The focus of this paper is to develop CPU-efficient algorithms for finding frequent itemsets in the cases when the database contains patterns which are very wide. We refer to this algorithm as DepthProject, and it achieves upto two orders of magnitude speedup over the recently proposed MaxMiner algorithm for finding long patterns. These techniques may be quite useful for applications in areas such as computational biology in which the number of records is relatively small, but the itemsets are very long. This necessitates the discovery of patterns using algorithms which are especially tailored to the nature of such domains.

Ramesh C. Agarwal, Charu C. Aggarwal

Chapter 29. Probabilistic Estimation in Data Mining

Abstract

The goal of scientific inquiry is to uncover the principles that govern the world around us, and ultimately to express those principles in a mathematical form that reflects the empirical characteristics of observed data. In this regard, we have been exploring ways of modifying machine learning techniques so that the resulting predictive models likewise reflect the empirical characteristics of observed data. Following the principles of robust estimation, our methodology involves first examining the data to identify an appropriate family of statistical distributions for modeling the data, and then incorporating the corresponding maximum-likelihood estimation procedures into a decision tree algorithm. We have applied this methodology to insurance risk modeling and have obtained tree-based models superior to those obtained using conventional classification and regression tree algorithms.

Edwin P. D. Pednault, Chidanand Apte

Chapter 30. Classification Using Association Rules: Weaknesses and Enhancements

Abstract

Existing classification and rule learning algorithms in machine learning mainly use heuristic/greedy search to find a subset of regularities (e.g., a decision tree or a set of rules) in data for classification. In the past few years, extensive research was done in the database community on learning rules using exhaustive search under the name of association rule mining. The objective there is to find all rules in data that satisfy the user-specified minimum support and minimum confidence. Although the whole set of rules may not be used directly for accurate classification, effective and efficient classifiers have been built using the rules. This paper aims to improve such an exhaustive search based classification system CBA (Classification Based on Associations). The main strength of this system is that it is able to use the most accurate rules for classification. However, it also has weaknesses. This paper proposes two new techniques to deal with these weaknesses. This results in remarkably accurate classifiers. Experiments on a set of 34 benchmark datasets show that on average the new techniques reduce the error of CBA by 17% and is superior to CBA on 26 of the 34 datasets. They reduce the error of the decision tree classifier C4.5 by 19%, and improve performance on 29 datasets. Similar good results are also achieved against the existing classification systems, RIPPER, LB and a Naïve-Bayes classifier.

Bing Liu, Yiming Ma, Ching-Kian Wong

Titel: Data Mining for Scientific and Engineering Applications
herausgegeben von: Robert L. Grossman
Chandrika Kamath
Philip Kegelmeyer
Vipin Kumar
Raju R. Namburu
Verlag: Springer US
Electronic ISBN: 978-1-4615-1733-7
Print ISBN: 978-1-4020-0114-7
DOI: https://doi.org/10.1007/978-1-4615-1733-7

Springer Professional

Über dieses Buch

Inhaltsverzeichnis

Frontmatter

Chapter 1. On Mining Scientific Datasets

Chapter 2. Understanding High Dimensional and Large Data Sets: Some Mathematical Challenges and Opportunities

Chapter 3. Data Mining at the Interface of Computer Science and Statistics

Chapter 4. Mining Large Image Collections

Chapter 5. Mining Astronomical Databases

Chapter 6. Searching for Bent-Double Galaxies in the First Survey

Chapter 7. A Dataspace Infrastructure for Astronomical Data

Chapter 8. Data Mining Applications in Bioinformatics

Chapter 9. Mining Residue Contacts in Proteins

Chapter 10. KDD Services at the Goddard Earth Sciences Distributed Active Archive Center

Chapter 11. Data Mining in Integrated Data Access and Data Analysis Systems

Chapter 12. Spatial Data Mining for Classification, Visualisation and Interpretation with Artmap Neural Network

Chapter 13. Real Time Feature Extraction for the Analysis of Turbulent Flows

Chapter 14. Data Mining for Turbulent Flows

Chapter 15. EVITA — Efficient Visualization and Interrogation of Tera-Scale Data

Chapter 16. Towards Ubiquitous Mining of Distributed Data

Chapter 17. Decomposable Algorithms for Data Mining

Chapter 18. HDDI™: Hierarchical Distributed Dynamic Indexing

Chapter 19. Parallel Algorithms for Clustering High-Dimensional Large-Scale Datasets

Chapter 20. Efficient Clustering of Very Large Document Collections

Chapter 21. A Scalable Hierarchical Algorithm for Unsupervised Clustering

Chapter 22. High-Performance Singular Value Decomposition

Chapter 23. Mining High-Dimensional Scientific Data Sets Using Singular Value Decomposition

Chapter 24. Spatial Dependence in Data Mining

Chapter 25. SPARC: Spatial Association Rule-Based Classification

Chapter 26. What’s Spatial About Spatial Data Mining: Three Case Studies

Chapter 27. Predicting Failures in Event Sequences

Chapter 28. Efficient Algorithms for Mining Long Patterns in Scientific Data Sets

Chapter 29. Probabilistic Estimation in Data Mining

Chapter 30. Classification Using Association Rules: Weaknesses and Enhancements