Skip to main content

2005 | Buch

Foundations and Advances in Data Mining

herausgegeben von: Professor Wesley Chu, Professor Tsau Young Lin

Verlag: Springer Berlin Heidelberg

Buchreihe : Studies in Fuzziness and Soft Computing

insite
SUCHEN

Über dieses Buch

With the growing use of information technology and the recent advances in web systems, the amount of data available to users has increased exponentially. Thus, there is a critical need to understand the content of the data. As a result, data-mining has become a popular research topic in recent years for the treatment of the "data rich and information poor" syndrome. In this carefully edited volume a theoretical foundation as well as important new directions for data-mining research are presented. It brings together a set of well respected data mining theoreticians and researchers with practical data mining experiences. The presented theories will give data mining practitioners a scientific perspective in data mining and thus provide more insight into their problems, and the provided new data mining topics can be expected to stimulate further research in these important directions.

Inhaltsverzeichnis

Frontmatter
The Mathematics of Learning: Dealing with Data *
Abstract
Learning is key to developing systems tailored to a broad range of data analysis and information extraction tasks. We outline the mathematical foundations of learning theory and describe a key algorithm of it.
T. Poggio, S. Smale
Logical Regression Analysis: From Mathematical Formulas to Linguistic Rules
Abstract
Data mining means the discovery of knowledge from (a large amount of)data, and so data mining should provide not only predictions but also knowledge such as rules that are comprehensible to humans. Data mining techniques should satisfy the two requirements, that is, accurate predictions and comprehensible rules.
H. Tsukimoto
A Feature/Attribute Theory for Association Mining and Constructing the Complete Feature Set
Abstract
A correct selection of features (attributes) is vital in data mining. For this aim, the complete set of features is constructed. Here are some important results: (1) Isomorphic relational tables have isomorphic patterns. Such an isomorphism classifies relational tables into isomorphic classes. (2) A unique canonical model for each isomorphic class is constructed; the canonical model is the bitmap indexes or its variants. (3) All possible features (attributes) is generated in the canonical model. (4) Through isomorphism theorem, all un-interpreted features of any table can be obtained.
T.Y. Lin
A New Theoretical Framework for K-Means-Type Clustering
Abstract
One of the fundamental clustering problems is to assign n points into k clusters based on the minimal sum-of-squares(MSSC), which is known to be NP-hard. In this paper, by using matrix arguments, we first model MSSC as a so-called 0-1 semidefinite programming (SDP). The classical K-means algorithm can be interpreted as a special heuristics for the underlying 0-1 SDP. Moreover, the 0-1 SDP model can be further approximated by the relaxed and polynomially solvable linear and semidefinite programming. This opens new avenues for solving MSSC. The 0-1 SDP model can be applied not only to MSSC, but also to other scenarios of clustering as well. In particular, we show that the recently proposed normalized k-cut and spectral clustering can also be embedded into the 0-1 SDP model in various kernel spaces.
J. Peng, Y. Xia
Clustering Via Decision Tree Construction
Abstract
Clustering is an exploratory data analysis task. It aims to find the intrinsic structure of data by organizing data objects into similarity groups or clusters. It is often called unsupervised learning because no class labels denoting an a priori partition of the objects are given. This is in contrast with supervised learning (e.g., classification) for which the data objects are already labeled with known classes. Past research in clustering has produced many algorithms. However, these algorithms have some shortcomings. In this paper, we propose a novel clustering technique, which is based on a supervised learning technique called decision tree construction. The new technique is able to overcome many of these shortcomings. The key idea is to use a decision tree to partition the data space into cluster (or dense) regions and empty (or sparse) regions (which produce outliers and anomalies). We achieve this by introducing virtual data points into the space and then applying a modified decision tree algorithm for the purpose. The technique is able to find “natural” clusters in large high dimensional spaces efficiently. It is suitable for clustering in the full dimensional space as well as in subspaces. It also provides easily comprehensible descriptions of the resulting clusters. Experiments on both synthetic data and real-life data show that the technique is effective and also scales well for large high dimensional datasets.
B. Liu, Y. Xia, P.S. Yu
Incremental Mining on Association Rules
Abstract
The discovery of association rules has been known to be useful in selective marketing, decision analysis, and business management. An important application area of mining association rules is the market basket analysis, which studies the buying behaviors of customers by searching for sets of items that are frequently purchased together. With the increasing use of the record-based databases whose data is being continuously added, recent important applications have called for the need of incremental mining. In dynamic transaction databases, new transactions are appended and obsolete transactions are discarded as time advances. Several research works have developed feasible algorithms for deriving precise association rules efficiently and effectively in such dynamic databases. On the other hand, approaches to generate approximations from data streams have received a significant amount of research attention recently. In each scheme, previously proposed algorithms are explored with examples to illustrate their concepts and techniques in this chapter.
W.-G. Teng, M.-S. Chen
Mining Association Rules from Tabular Data Guided by Maximal Frequent Itemsets
Abstract
We propose the use of maximal frequent itemsets (MFIs) to derive association rules from tabular datasets. We first present an efficient method to derive MFIs directly from tabular data using the information from previous search, known as tail information. Then we utilize tabular format to derive MFI, which can reduce the search space and the time needed for support-counting. Tabular data allows us to use spreadsheet as a user interface. The spreadsheet functions enable users to conveniently search and sort rules. To effectively present large numbers of rules, we organize rules into hierarchical trees from general to specific on the spreadsheet Experimental results reveal that our proposed method of using tail information to generate MFI yields significant improvements over conventional methods. Using inverted indices to compute supports for itemsets is faster than the hash tree counting method. We have applied the proposed technique to a set of tabular data that was collected from surgery outcomes and that contains a large number of dependent attributes. The application of our technique was able to derive rules for physicians in assisting their clinical decisions.
Q. Zou, Y. Chen, W.W. Chu, X. Lu
Sequential Pattern Mining by Pattern-Growth: Principles and Extensions*
Abstract
Sequential pattern mining is an important data mining problem with broad applications. However, it is also a challenging problem since the mining may have to generate or examine a combinatorially explosive number of intermediate subsequences. Recent studies have developed two major classes of sequential pattern mining methods: (1) a candidate generation-and-test approach, represented by (i)GSP [30], a horizontal format-based sequential pattern mining method, and (ii) SPADE [36], a vertical format-based method; and (2) a sequential pattern growth method, represented by PrefixSpan [26] and its further extensions, such as CloSpan for mining closed sequential patterns [35].
J. Han, J. Pei, X. Yan
Web Page Classification*
Abstract
This chapter describes systems that automatically classify web pages into meaningful categories. It first defines two types of web page classification: subject based and genre based classifications. It then describes the state of the art techniques and subsystems used to build automatic web page classification systems, including web page representations, dimensionality reductions, web page classifiers, and evaluation of web page classifiers. Such systems are essential tools for Web Mining and for the future of Semantic Web.
B. Choi, Z. Yao
Web Mining – Concepts, Applications and Research Directions
Abstract
From its very beginning, the potential of extracting valuable knowledge from the Web has been quite evident. Web mining, i.e. the application of data mining techniques to extract knowledge from Web content, structure, and usage, is the collection of technologies to fulfill this potential. Interest in Web mining has grown rapidly in its short history, both in the research and practitioner communities. This paper provides a brief overview of the accomplishments of the field, both in terms of technologies and applications, and outlines key future research directions.
T. Srivastava, P. Desikan, V. Kumar
Privacy-Preserving Data Mining
Abstract
The growth of data mining has raised concerns among privacy advocates. Some of this is based on a misunderstanding of what data mining does. The previous chapters have shown how data mining concentrates on extraction of rules, patterns and other such summary knowledge from large data sets. This would not seem to inherently violate privacy, which is generally concerned with the release of individual data values rather than summaries.
C. Clifton, M. Kantarcıoğlu, J. Vaidya
Metadaten
Titel
Foundations and Advances in Data Mining
herausgegeben von
Professor Wesley Chu
Professor Tsau Young Lin
Copyright-Jahr
2005
Verlag
Springer Berlin Heidelberg
Electronic ISBN
978-3-540-32393-8
Print ISBN
978-3-540-25057-9
DOI
https://doi.org/10.1007/b104039

Premium Partner