Skip to main content

1998 | Buch

Research and Development in Knowledge Discovery and Data Mining

Second Pacific-Asia Conference, PAKDD-98 Melbourne, Australia, April 15–17, 1998 Proceedings

herausgegeben von: Xindong Wu, Ramamohanarao Kotagiri, Kevin B. Korb

Verlag: Springer Berlin Heidelberg

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

This book constitutes the refereed proceedings of the Second Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD-98, held in Melbourne, Australia, in April 1998. The book presents 30 revised full papers selected from a total of 110 submissions; also included are 20 poster presentations. The papers contribute new results to all current aspects in knowledge discovery and data mining on the research level as well as on the level of systems development. Among the areas covered are machine learning, information systems, the Internet, statistics, knowledge acquisition, data visualization, software reengineering, and knowledge based systems.

Inhaltsverzeichnis

Frontmatter
Knowledge acquisition for goal prediction in a multi-user adventure game

We present an approach to goal recognition which uses a Dynamic Belief Network to represent domain features needed to identify users' goals and plans. Different network structures have been developed, and their conditional probability distributions have been automatically acquired from training data. These networks show a high degree of accuracy in predicting users' goals. Our approach allows the use of incomplete, sparse and noisy data during both training and testing. We then apply simple learning techniques to learn significant actions in the domain. This speeds up the performance of the most promising dynamic belief networks without loss in predictive accuracy.

D. W. Albrecht, A. E. Nicholson, I. Zukerman
Hybrid data mining systems: The next generation

The promise of Hybrid Systems as the next generation of Data Mining systems is investigated. This work is motivated by the obvious limitations of paradigms for Data Mining that are being used in isolation. We present a classification of hybrid systems based on the level of interaction between the component paradigms and present example objectives for the development of such systems. We highlight possible hybrid solutions by discussing various hybrid systems that may be developed to enhance the Nearest Neighbour algorithm. These include statistical measures to enhance distance measures, a loose coupling with Neural Networks or alternatively a tight coupling with genetic algorithms, to discover attribute weights. We also establish enhancements to the k-NN that make it appropriate for use as a paradigm for addressing regression data mining goals. We provide results obtained using these systems, comparing them with more traditional paradigms used to solve regression goals within Data Mining.

Sarabjot S. Anand, John G. Hughes
Discovering case knowledge using data mining

The use of Data Mining in removing current bottlenecks within Case-based Reasoning (CBR) systems is investigated along with the possible role of CBR in providing a knowledge management back-end to current Data Mining systems. In particular, this paper discusses the use of Data Mining in two aspects of the MZ system [ANAN97a], namely, the acquisition of cases and discovery of adaptation knowledge. We discuss, in detail, the approach taken to discover cases and outline the methodology to discover adaptation knowledge. For case discovery, a Kohonen network is used to identify initial clusters within the database. These clusters are then analysed using C4.5 and non-unique clusters are grouped to form concepts. A regression tree induction algorithm is then used to ensure that the concepts are rich in information required to predict the dependent variable in the data set. Cases are then chosen from each of the identified concepts as well as outliers in the database. Initial results obtained in the acquisition of cases are presented and analysed. They indicate that the proposed approach achieves a high reduction in the size of the case base.

S. S. Anand, D. Patterson, J. G. Hughes, D. A. Bell
Discovery of association rules over ordinal data: A new and faster algorithm and its application to basket analysis

This paper argues that quantitative information like prices, amounts bought, and time can give valuable insights into consumer behavior. While Boolean association rules discard any quantitative information, existing algorithms for quantitative association rules can hardly be used for basket analysis. They either lack performance, are restricted to the two-dimensional case, or make questionable assumptions about the data. We propose a new and faster algorithm Q2 for the discovery of multi-dimensional association rules over ordinal data, which is based on ideas presented in [SA96]. Our new algorithm Q2 does not search for quantitative association rules from the very beginning. Instead Q2 prunes out a lot of candidates by first computing the frequent Boolean itemsets. After that, the frequent quantitative itemsets are found in a single pass over the data.In addition, a new absolute measure for the interestingness of quantitative association rules is introduced. It is based on the view that quantitative association rules have to be interpreted on the background of their Boolean generalizations.We experimentally compare the new algorithm against the previous approach, obtaining performance improvements of more than an order of magnitude on supermarket data. A rather astonishing result of this paper is that an additional run through the transactions does pay off when searching for quantitative association rules.

Oliver Büchter, Rüdiger Wirth
Effect of data skewness in parallel mining of association rules

An efficient parallel algorithm FPM(Fast Parallel Mining) for mining association rules on a shared-nothing parallel system has been proposed. It adopts the count distribution approach and has incorporated two powerful candidate pruning techniques, i.e., distributed pruning and global pruning. It has a simple communication scheme which performs only one round of message exchange in each iteration. We found that the two pruning techniques are very sensitive to data skewness, which describes the degree of non-uniformity of the itemset distribution among the database partitions. Distributed pruning is very effective when data skewness is high. Global pruning is more effective than distributed pruning even for the mild data skewness case. We have implemented the algorithm on an IBM SP2 parallel machine. The performance studies confirm our observation on the relationship between the effectiveness of the two pruning techniques and data skewness. It has also shown that FPM outperforms CD (Count Distribution) consistently, which is a parallel version of the popular Apriori algorithm [2, 3]. Furthermore, FPM has nice parallelism of speedup, scaleup and sizeup.

David W. Cheung, Yongqiao Xiao
Trend directed learning: A case study

Misleading caused by low quality data is a well known problem in knowledge discovery in databases. Several techniques have been introduced to deal with the problem which include inexact learning strategies, such as rough set based approaches and probabilistic approaches. This paper presents an approach for detecting trend using contribution functions. A trend directed method for the discovery of knowledge structure from low quality data bases is described. The experimental results show that trend directed methods are superior to other learning strategies, particularly when the learning is performed on low quality data bases.

Honghua Dai
Interestingness of discovered association rules in terms of neighborhood-based unexpectedness

One of the central problems in knowledge discovery is the development of good measures of interestingness of discovered patterns. With such measures, a user needs to manually examine only the more interesting rules, instead of each of a large number of mined rules. Previous proposals of such measures include rule templates, minimal rule cover, actionability, and unexpectedness in the statistical sense or against user beliefs.In this paper we will introduce neighborhood-based interestingness by considering unexpectedness in terms of neighborhood-based parameters. We first present some novel notions of distance between rules and of neighborhoods of rules. The neighborhood-based interestingness of a rule is then defined in terms of the pattern of the fluctuation of confidences or the density of mined rules in some of its neighborhoods. Such interestingness can also be defined for sets of rules (e.g. plateaus and ridges) when their neighborhoods have certain properties. We can rank the interesting rules by combining some neighborhood-based characteristics, the support and confidence of the rules, and users' feedback. We discuss how to implement the proposed ideas and compare our work with related ones. We also give a few expected tendencies of changes due to rule structures, which should be taken into account when considering unexpectedness. We concentrate on association rules and briefly discuss generalization to other types of rules.

Guozhu Dong, Jinyan Li
Point estimation using the Kullback-Leibler loss function and MML
David L. Dowe, Rohan A. Baxter, Jonathan J. Oliver, Chris S. Wallace
Single factor analysis in MML mixture modelling

Mixture modelling concerns the unsupervised discovery of clusters within data. Most current clustering algorithms assume that variables within classes are uncorrelated. We present a method for producing and evaluating models which account for inter-attribute correlation within classes with a single Gaussian linear factor. The method used is Minimum Message Length (MML), an invariant, information-theoretic Bayesian hypothesis evaluation criterion. Our work extends and unifies that of Wallace and Boulton (1968) and Wallace and Freeman (1992), concerned respectively with MML mixture modelling and MML single factor analysis. Results on simulated data are comparable to those of Wallace and Freeman (1992), outperforming Maximum Likelihood. We include an application of mixture modelling with single factors on spectral data from the Infrared Astronomical Satellite. Our model shows fewer unnecessary classes than that produced by AutoClass (Goebel et. al. 1989) due to the use of factors in modelling correlation.

Russell T. Edwards, David L. Dowe
Discovering associations in spatial data — An efficient medoid based approach

Spatial data mining is the discovery of novel and interesting relationships and characteristics that may exist implicitly in spatial databases. The identification of clusters coupled with Geographical Information System provides a means of information generalization. A variety of clustering approaches exists. A non-hierarchical method in data mining applications is the medoid approach. Many heuristics have been developed for this approach. This paper carefully analyses the complexity of hill-climbing heuristics for medoid based spatial clustering. Improvements to recently suggested heuristics like CLARANS are identified. We propose a novel idea, the stopping early of the heuristic search, and demonstrate that this provides large savings in computational time while the quality of the partition remains unaffected.

Vladimir Estivill-Castrol, Alan T. Murray
Data mining using dynamically constructed recurrent fuzzy neural networks

Approaches to data mining proposed so far are mainly symbolic decision trees and numerical feedforward neural networks methods. While decision trees give, in many cases, lower accuracy compared to feedforward neural networks, the latter show black-box behaviour, long training times, and difficulty to incorporate available knowledge. We propose to use an incrementally-generated recurrent fuzzy neural network which has the following advantages over feedforward neural network approach: ability to incorporate existing domain knowledge as well as to establish relationships from scratch, and shorter training time. The recurrent structure of the proposed method is able to account for temporal data changes in contrast to both both feedforward neural network and decision tree approaches. It can be viewed as a gray box which incorporates best features of both symbolic and numerical methods. The effectiveness of the proposed approach is demonstrated by experimental results on a set of standard data mining problems.

Yakov Frayman, Lipo Wang
CCAIIA: Clustering categorical attributes into interesting association rules

We investigate the problem of mining interesting association rules over a pair of categorical attributes at any level of data granularity. We do this by integrating the rule discovery process with a form of clustering. This allows associations between groups of ;items to be formed where the groping of items is based on maximising the “interestingness” of the associations discovered. Previous work on mining generalised associations assumes either a distance metric on the attribute values or a taxonomy over the items mined. These methods use the metric/taxonomy to limit the space of possible associations that can be found. We develop a measure of the interestingness of association rules based on support and the dependency between the item sets and use this measure to guide the search. We apply the method to a data set and observe the extraction of “interesting” associations. This method could allow interesting and unexpected associations to be discovered as the search space is not being limited by user defined hierarchies.

Brett Gray, M. E. Orlowska
Selective materialization: An efficient method for spatial data cube construction

On-line analytical processing (OLAP) has gained its popularity in database industry. With a huge amount of data stored in spatial databases and the introduction of spatial components to many relational or object-relational databases, it is important to study the methods for spatial data warehousing and on-line analytical processing of spatial data. In this paper, we study methods for spatial OLAP, by integration of nonspatial on-line analytical processing (OLAP) methods with spatial database implementation techniques. A spatial data warehouse model, which consists of both spatial and nonspatial dimensions and measures, is proposed. Methods for computation of spatial data cubes and analytical processing on such spatial data cubes are studied, with several strategies proposed, including approximation and partial materialization of the spatial objects resulted from spatial OLAP operations. Some techniques for selective materialization of the spatial computation results are worked out, and the performance study has demonstrated the effectiveness of these techniques.

Jiawei Han, Nebojsa Stefanovic, Krzysztof Koperski
Mining market basket data using share measures and characterized itemsets

We propose the share-confidence framework for knowledge discovery from databases which addresses the problem of mining itemsets from market basket data. Our goal is two-fold: (1) to present new itemset measures which are practical and useful alternatives to the commonly used support measure; (2) to not only discover the buying patterns of customers, but also to discover customer profiles by partitioning customers into distinct classes. We present a new algorithm for classifying itemsets based upon characteristic attributes extracted from census or lifestyle data. Our algorithm combines the Apriori algorithm for discovering association rules between items in large databases, and the AOG algorithm for attribute-oriented generalization in large databases. We suggest how characterized itemsets can be generalized according to concept hierarchies associated with the characteristic attributes. Finally, we present experimental results that demonstrate the utility of the share-confidence framework.

Robert J. Hilderman, Colin L. Carter, Howard J. Hamilton, Nick Cercone
Automatic visualization method for visual data mining

Data mining has drawn a great deal of attention as a technique for introducing effective business strategies. Data mining derives effective patterns and rules from a large amount of data to support the decision making process. Although various approaches have been attempted in some research fields, we focused on the visual data mining support system from the viewpoint of harnessing the perceptual and cognitive capabilities of the human user. In visual data mining support, defining the visualization model is important to obtain effective patterns. The main point of the definition is how to find useful data attributes as the visualization target. This paper proposes a visual data mining support system based on two key concepts: automatic target attribute selection method using a decision tree or correlation coefficient matrix and automatic scene definition generating methods. Application examples are presented and evaluated.

Yuichi Iizuka, Hisako Shiohara, Tetsuya Iizuka, Seiji Isobe
Rough-set inspired approach to knowledge discovery in business databases

We present an approach to knowledge discovery in databases that is based on the idea of the attribute space partition. An inspiration for this approach was the methodology of the rough set theory. Two systems, ProbRough and TRANCE, which are representative of this approach are capable of inducing decision rules from databases with practically unlimited number of objects. The beam search strategy of ProbRough is guided by the global cost criterion and leads to inducing rough classifiers which are mainly intended for making decisions concerning new unseen objects. In case of TRANCE, the exhaustive search strategy in a space of user pre-specified models is guided by the criterion expressed in terms of local properties of the model. The relationships between values of attributes and decisions, detected by both systems, are presented in the form of rules that are easily understood by humans. The presented approach to knowledge discovery is illustrated on two real-world examples from database marketing. The rules induced by ProbRough and TRANCE provided a lot of useful information on customer behavior patterns and about the phenomenon of customer retention.

W. Kowalczyk, Z. Piasta
Representative association rules

Discovering association rules between items in a large database is an important database mining problem. The number of association rules may be huge. In this paper, we define a cover operator that logically derives a set of association rules from a given association rule. Representative association rules are defined as a least set of rules that covers all association rules satisfying certain user specified constraints. A user may be provided with a set of representative association rules instead of the whole set of association rules. The association rules, which are not representative ones, may be generated on demand by means of the cover operator. In this paper, we offer an algorithm computing representative association rules.

Marzena Kryszkiewicz
Identifying relevant databases for multidatabase mining

Various tools and systems for knowledge discovery and data mining are developed and available for applications. However, when we are immersed in heaps of databases, an immediate question facing practitioners is where we should start mining. In this paper, breaking away from the conventional data mining assumption that many databases be joined into one, we argue that the first step for multidatabase mining is to identify databases that are most likely relevant to an application; without doing so, the mining process can be lengthy, aimless and ineffective. A relevance measure is thus proposed to identify relevant databases for mining tasks with an objective to find patterns or regularities about certain attributes. An efficient implementation for identifying relevant databases is described. Experiments are conducted to validate the measure's performance and to show its promising applications.

Huan Liu, Hongjun Lu, Jun Yao
Minimum message length segmentation

The segmentation problem arises in many applications in data mining, A.I. and statistics, including segmenting time series, decision tree algorithms and image processing. In this paper, we consider a range of criteria which may be applied to determine if some data should be segmented into two or regions. We develop a information theoretic criterion (MML) for the segmentation of univariate data with Gaussian errors. We perform simulations comparing segmentation methods (MML, AIC, MDL and BIC) and conclude that the MML criterion is the preferred criterion. We then apply the segmentation method to financial time series data.

Jonathan J. Oliver, Rohan A. Baxter, Chris S. Wallace
Bayesian classification trees with overlapping leaves applied to credit-scoring

We develop a Bayesian procedure for classification with trees by switching between different model structures. For classification trees with overlap we use a Markov chain Monte Carlo procedure to produce an ensemble of trees which allow the assessment of prediction uncertainty and the value of new information. The approach is applied to a large credit scoring application.

Gerhard Paass, Jörg Kindermann
Contextual text representation for unsupervised knowledge discovery in texts

This paper studies the role of lexical contextual relations for the problem of unsupervised knowledge discovery in full texts. Narrative texts have inherent structure dictated by language usage in generating them. We suggest that the relative distance of terms within a text gives sufficient information about its structure and its relevant content. Furthermore, this structure can be used to discover implicit knowledge embedded in the text, therefore serving as a good candidate to represent effectively the text content for knowledge elicitation tasks. We qualitatively demonstrate that a useful text structure and content can be systematically extracted by collocational lexical analysis without the need to encode any supplemental sources of knowledge. We present an algorithm that systematically extracts the most relevant facts in the texts and labels them by their overall theme, dictated by local contextual information. It exploits domain independent lexical frequencies and mutual information measures to find the relevant contextual units in the texts. We report results from experiments in a real-world textual database of psychiatric evaluation renorts.

Patrick Perrin, Fred Petry
Treatment of missing values for association rules

Agrawal et at. [2] have proposed a fast algorithm to explore very large transaction databases with association rules [l]. In many real world applications data are managed in relational databases where missing values are often inevitable. We will show, in this case, that association rules algorithms give bad results because they have been developed for transaction databases where practically the problem of missing values does not exist. In this paper, we propose a new approach to mine association rules in relational databases containing missing values. The main idea is to cut a database in several valid databases (vdb) for a rule, a vdb must not have any missing values. We redefine support and confidence of rules for vdb. These definitions are fully compatible with

Arnaud Ragel, Bruno Crémilleux
Mining regression rules and regression trees

We propose a new type of regression rules to represent the conditional functional relationship between a response variable and p numericvalued explanatory variables, conditioning on values of a set of categorical variables. Regression rules are ideal for representing relationships existed in mixture of categorical data and numeric data. A set of regression rules can also be presented in the form of a tree graph, called the regression tree, to assist understanding, interpreting, and applying these rules. We also introduce a process for mining regression rules from data stored in a relational database. This process uses the concept of multivariate and multidimensional OLAP to minimize operations for source data retrieval, and uses homogeneity tests to reduce the size of search space. Thus, it can be used to support mining regression rules in an efficient manner in the context of very large databases.

Buh-Yun Sher, Shin-Chung Shao, Wen-Shyong Hsieh
Mining algorithms for sequential patterns in parallel : Hash based approach

In this paper, we study the problem of mining sequential patterns in a large database of customer transactions. Since finding sequential patterns has to handle a large amount of customer transaction data and requires multiple passes over the database, it is expected that parallel algorithms help to improve the performance significantly. We consider the parallel algorithms for mining sequential patterns on a shared-nothing environment. Three parallel algorithms (Non Partitioned Sequential Pattern Mining(NPSPM), Simply Partitioned Sequential Pattern Mining(SPSPM) and Hash Partitioned Sequential Pattern Mining(HPSPM)) are proposed. In NPSPM, the candidate sequences are just copied among all the nodes, which can lead to memory overflow for large databases. The remaining two algorithms partition the candidate sequences over the nodes, which can efficiently exploit the total system's memory as the number of nodes in increased. If it is partitioned simply, customer transaction data has to be broadcasted to all nodes. HPSPM partitions the candidate sequences among the nodes using hash function, which eliminates the customer transaction data broadcasting and reduces the comparison workload. We describe the implementation of these algorithms on a shared-nothing parallel computer IBM SP2 and its performance evaluation results. Among three algorithms HPSPM attains best performance.

Takahiko Shintani, Masaru Kitsuregawa
Wavelet transform in similarity paradigm

Searching for similarity in time series finds still broader applications in data mining. However, due to the very broad spectrum of data involved, there is no possibility of defining one single notion of similarity suitable to serve all applications. We present a powerful framework based on wavelet decomposition, which allows designing and implementing a variety of criteria for the evaluation of similarity between time series. As an example, two main classes of similarity measures are considered. One is the global, statistical similarity, which uses the wavelet transform derived Hurst exponent to classify time series according to their global scaling properties. The second measure estimates similarity locally using the scale-position bifurcation representation.

Zbigniew R. Struzik, Arno Siebes
Improved rule discovery performance on uncertainty

In this paper we describe the improved version of a novel rule induction algorithm, namely ILA. We first outline the basic algorithm, and then present how the algorithm is enhanced using the new evaluation metric that handles uncertainty in a given data set. In addition to having a faster induction than the original one, we believe that our contribution comes into picture with a new metric that allows users to define their preferences through a penalty factor. We use this penalty factor to tackle with over-fitting bias, which is inherently found in a great many of inductive algorithms. We compare the improved algorithm ILA-2 to a variety of induction algorithms, including ID3, OC1, C4.5, CN2, and ILA. According to our preliminary experimental work, the algorithm appears to be comparable to the well-known algorithms such as CN2 and C4.5 in terms of accuracy and size.

Mehmet R. Tolun, Hayri Sever, Mahmut Uludağ
Feature mining and mapping of collinear data

Collinear data such as spectra and time-varying signals are very high-dimensional and are characterized by having highly correlated, context-dependent localized structures. Feature mining involves extracting the important local features whilst, at the same time, retaining as much information as possible and facilitating the automated analysis and interpretation of the data. We present a novel wavelet-based feature mining approach which extracts the optimal features for a particular application. An automated search is performed for the wavelet which optimizes specified multivariate modeling criteria. In this paper we consider mapping analysis as the multivariate model and show how wavelets are able to elucidate the underlying group structure in the data.

O. de Vel, D. Coomans, S. Patrick
Knowledge discovery in discretionary legal domains

Significant obstacles must be overcome if knowledge discovery techniques are to be applied in the legal domain. In this paper we argue that in order to use knowledge discovery in the legal domain it is essential to use domain expertise and important that an abundance of commonplace cases is available.Even with appropriate data, data mining techniques in law must deal with contradictory cases and use statistical techniques in order to define error and estimate performance. We illustrate these points by describing our own error heuristic and the method we use for dealing with contradictions for the training of neural networks in the domain of property proceedings in Australian Family Law. In law, an explanation for a decision reached is often more important than the decision. We advocate the use of a theory of argumentation developed by the British philosopher Stephen Toulmin to provide explanations to support the outcomes predicted by our knowledge discovery system Split Up. We also discuss the use genetic algorithms to minimise the number of features our knowledge discovery system must use.

John Zeleznikow, Andrew Stranieri
Scaling up the rule generation of C4.5

C4.5 is the most well-known inductive learning algorithm. It can be used to build decision trees as well as production rules. Production rules are a very common formalism for representing and using knowledge in many real-world domains. C4.5 generates production rules from raw trees. It has been shown that the set of production rules is usually both simpler and more accurate than the decision tree from which the ruleset was formed. This research shows that generating production rules from pruned trees usually results in significantly simpler rulesets than generating rules from raw trees. This reduction in complexity is achieved without reducing prediction accuracies. Furthermore, the new approach uses significantly less induction time than the latter. This paper uses experiments in a wide variety of natural domains to illustrate these points. It also shows that the new method scales up better than the old one in terms of ruleset size, the number of rules, and learning time when the training set size increases. This is an important characteristic for learning algorithms used for data mining.

Zijian Zheng
Data mining based on the generalization distribution table and rough sets

This paper introduces a new approach for mining if-then rules in databases with uncertainty and incompleteness. This approach is based on the combination of Generalization Distribution Table (GDT) and the rough set methodology. The GDT provides a probabilistic basis for evaluating the strength of a rule. It is used to find the rules with larger strengths from possible rules. Furthermore, the rough set methodology is used to find minimal relative reducts from the set of rules with larger strengths. The strength of a rule represents the uncertainty of the rule, which is influenced by both unseen instances and noises. By using our approach, a minimal set of rules with larger strengths can be acquired from databases with noisy, incomplete data. We have applied this approach to discover rules from some real databases.

Ning Zhong, Juzhen Dong, Setsuo Ohsuga
Constructing personalized information agents

The explosion of the World Wide Web as a global information network brings with it a number of related challenges for information agents. In this poster, we propose our architecture for constructing a personal information agent, which consists an web searching assistant and an interest domain manager responsible for the construction of the search domains for the user. Under the infrastructure, other information agents could be collaborated into this environment to discovery new information.

Chia-Hui Chang, Chang-Chi Hsu
Towards real time discovery from distributed information sources
V. Cho, B. W:▵hrich
Constructing conceptual scales in formal concept analysis

Formal concept analysis (FCA)

Richard Cole, Peter Eklund, Don Walker
The hunter and the hunted — Modelling the relationship between web pages and search engines

The major objective of a World Wide Web (WWW) search engine is to uncover WWW pages relevant to a search. The original purpose of a page might have been to inform but since the advent of search engines there has been another imperative to catch the engines' attention and to be rated as highly as possible. In turn the search engines will have to be more clever to filter out gratuitous promotion. This race is an ongoing evolutionary process.

David L. Dowe, Lloyd Allison, Glen Pringle
An efficient global discretization method
K. M. Ho, P. D. Scott
Learning user preferences on the WEB

We present a new tool called INDWEB, based on Inductive Logic Programming, that can learn some concepts that characterized interesting pages for a user or a group of users with respect to a set of criteria on these pages but also on these users or group of users.

François Jacquenet, Patrice Brenot
Using rough sets for knowledge discovery in the development of a decision support system for issuing smog alerts
Ilona Jagielska
Empirical results on data dimensionality reduction using the divided self-organizing map
Takamasa Koshizen, Hiroaki Ogawa, John Fulcher
Mining association rules with linguistic cloud models
Deyi Li, Kaichang Di, Deren Li, Xuemei Shi
A data mining approach for query refinement

We propose a data mining approach for query refinement using Association Rules (ARs) among keywords being extracted from a document database. We are concerned with two issues in this paper. First, instead of using minimum support and minimum confidence which has little effectiveness of screening documents, we use maximum support and maximum confidence. Second, to further reduce the number of rules, we introduce two co-related concepts: “stem rule” and “coverage” The effectiveness of using ARs to screen is reported as well.

Ye Liu, Hanxiong Chen, Jeffrey Xu Yu, Nobuo Ohbo
CFMD: A conflict-free multivariate discretization algorithm
Extended abstract
Ying Lu, Huan Liu, Chew Lim Tan
Characteristic rule induction algorithm for data mining
Akira Maeda, Hideyuki Maki, Hiroyuki Akimori
Data-mining massive time series astronomical data sets — A case study
Michael K. Ng, Zhexue Huang, Markus Hegland
Multiple databases, partial reasoning, and knowledge discovery

We consider multiple sources of information, sets of sentences they provide, and the resulting multiple partial theories. The set of all theories is equipped with an information ordering and forms a lattice. We relate the framework to knowledge discovery.

Chris Nowak
Design recovery with data mining techniques
Carlos Montes de Oca, Doris L. Carver
The CLARET algorithm
Adrian R. Pearce, Terry Caelli
LRTree: A hybrid technique for classifying myocardial infarction data containing unknown attribute values
Christine L. Tsien, Hamish S. F. Fraser, Isaac S. Kohane
Modelling decision tables from data

On most datasets induction algorithms can generate very accurate classifiers. Sometimes, however, these classifiers are very hard to understand for humans. Therefore, in this paper it is investigated how we can present the extracted knowledge to the user by means of decision tables. Decision tables are very easy to understand. Furthermore, decision tables provide interesting facilities to check the extracted knowledge on consistency and completeness. In this paper, it is demonstrated how a consistent and complete DT can be modelled starting from raw data. Because the modelled decision tables are sufficiently small they allow easy consultation of the represented knowledge.

G. Wets, J. Vanthienen, H. Timmermans
A classification and relationship extraction scheme for relational databases based on fuzzy logic
M. Vazirgiannis
Mining association rules for estimation and prediction
Takashi Washio, Hiroki Matsuura, Hiroshi Motoda
Rule generalization by condition combination

This paper introduces an approach of condition combination to generate rules in a decision table. First we describe the concepts of negative condition and condition combination upon which our work is based, then their important properties for designing our algorithms. The algorithms are analyzed to show their time complexity and concern with attribute ordering.

Hanhong Xue, Qingsheng Cai
Backmatter
Metadaten
Titel
Research and Development in Knowledge Discovery and Data Mining
herausgegeben von
Xindong Wu
Ramamohanarao Kotagiri
Kevin B. Korb
Copyright-Jahr
1998
Verlag
Springer Berlin Heidelberg
Electronic ISBN
978-3-540-69768-8
Print ISBN
978-3-540-64383-8
DOI
https://doi.org/10.1007/3-540-64383-4