Knowledge discovery — A control theory perspective

Knowledge discovery in databases is a label for an activity performed in a wide variety of application domains within the science and business communities as well as for pleasure, see Fayyad, Haussler & Stolorz (1996) and Imielinski & Mannila (1996). The activity uses a large and heterogeneous data-set as a basis for synthesizing new and relevant knowledge. The knowledge is new because hidden relationships within the data are explicated and/or data is combined with prior knowledge to elucidate a given problem. The term relevant is used to emphasize that knowledge discovery is a goal-driven process in which knowledge is constructed to facilitate the solution of a problem.Knowledge discovery may be viewed as a process containing many tasks as discussed in Fayyad, Piatetsky-Shapiro & Smyth (1996). Some of these tasks are well understood while others depend on human judgement in an implicit matter. Further, the process is characterized by heavy iterations between the tasks. This is very similar to many creative engineering processes, eg. the development of dynamic models as discussed in the industrial field study by Foss, Lohmann & Marquardt (1997). In this reference mechanistic, or first principles based, models are emphasized and the tasks involved in model development are defined by:1.Initial data collection and problem formulation. The initial data are collected and some more or less precise formulation of the modeling problem is developed.2.Tools selection. The software tools to support modeling and allow simulation are selected.3.Conceptual modeling. The system to be modeled e.g. a chemical reactor, a power generator, or a marine vessel, is abstracted at first. The essential compartments and the dominant phenomena occurring are identified and documented for later reuse.4.Model representation. A representation of the system model is generated. Often, equations are used, however, a graphical block diagram (or any other formalism) may alternatively be used depending on the modeling tools selected above.5.Implementation. The model representation is implemented using the means provided by the modeling system of the software employed. These may range from general programming languages, to equation-based modeling languages or graphical block-oriented interfaces.6.Verification. The model implementation is verified to really capture the intent of the modeler. No simulations for the actual problem to be solved are carried out for this purpose.7.Initialization. Reasonable initial values are provided or computed, the numerical solution process is debugged.8.Validation. The results of the simulation are validated against some reference, ideally against experimental data.9.Documentation. The modeling process, the model, and the simulation results during validation and application of the model are documented.10.Model application. The model is used in some model-based process engineering problem solving task.For other model types like neural network models where data-driven knowledge is utilized, the modeling process will be somewhat different. Some of the tasks, like the conceptual modeling phase, will vanish. In this paper both black-box (or data-driven) models (Ljung 1987), (Ljung 1991), and mechanistic models (Marquardt 1996) will be discussed.Typical application areas for dynamic models are control, prediction, planning and fault detection and diagnosis. A major deficiency of todays methods is the lack of ability to utilize a wide variety of knowledge. As an example a black-box model structure has very limited abilities to utilize first principles knowledge on a problem. This has provided a basis for developing different hybrid schemes. Two hybrid schemes will highlight the discussion. First, it will be shown how a mechanistic model can be combined with a black-box model to represent a pH neutralization system efficiently (Johansen & Foss 1993). Second, the combination of continuous and discrete control inputs is considered utilizing a two-tank example as case. Different approaches to handle this heterogeneous case are considered (Slupphaug, Vada & Foss 1997).The hybrid approach may be viewed as a means to integrate different types of knowledge, i.e. being able to utilize a heterogeneous knowledge base to derive a model. Standard practice today is that methods and software can treat large homogeneous data-sets. A typical example of a homogeneous data-set is time-series data from some system, e.g. temperature, pressure and compositions measurements over some time frame provided by the instrumentation and control system of a chemical reactor. If textual information of a qualitative nature is provided by plant personel the data becomes heterogeneous.The above discussion will form the basis for analyzing the interaction between knowledge discovery, and modeling and identification of dynamic models. In particular we will be interested in identifying how concepts from knowledge discovery can enrich state-of-the-art within control, prediction, planning and fault detection and diagnosis of dynamic systems.

Bjarne A. Foss

Modelling customer retention with Rough Data Models

Banks, as many other companies, try to develop a long-term relationship with their clients. When a client decides to move to another bank it usually implies some financial loses. Therefore, banks are very interested in identifying some mechanisms behind such decisions and determining clients that are about to leave the given bank. One way of getting such an insight is to analyse historical data that describe customer behaviour in the past.In this paper we present a methodology and some results of an analysis of a large data set provided by a big mutual fund investment company. Our approach, based on the concept of Rough Data Model, [7], resulted in the identification of key factors that influence customer retention. Moreover, a number of rules that characterise various groups of clients have been generated. Our results have been highly appreciated by the company and led to specific actions aimed at increasing customer retention.

Wojciech Kowalczyk, Frank Slisser

Share based measures for itemsets

We introduce the measures share, coincidence and dominance as alternatives to the standard itemset methodology measure of support. An itemset is a group of items bought together in a transaction. The support of an itemset is the ratio of transactions containing the itemset to the total number of transactions. The share of an itemset is the ratio of the count of items purchased together to the total count of items in all transactions. The coincidence of an itemset is the ratio of the count of items in that itemset to the total of those same items in the database. The dominance of an item in an itemset specifies the extent to which that item dominates the total of all items in the itemset. Share based measures have the advantage over support of reflecting accurately how many units are being moved by a business. The share measure can be extended to quantify the financial impact of an itemset on the business.

Colin L. Carter, Howard J. Hamilton, Nick Cercone

Parallel knowledge discovery using domain generalization graphs

Multi-Attribute Generalization is an algorithm for attribute-oriented induction in relational databases using domain generalization graphs. Each node in a domain generalization graph represents a different way of summarizing the domain values associated with an attribute. When generalizing a set of attributes, we show how a serial implementation of the algorithm generates all possible combinations of nodes from the domain generalization graphs associated with the attributes, resulting in the presentation of all possible generalized relations for the set. We then show how the inherent parallelism in domain generalization graphs is exploited by a parallel implementation of the algorithm. Significant speedups were obtained using our approach when large discovery tasks were partitioned across multiple processors. The results of our work enable a database analyst to quickly and efficiently analyze the contents of a relational database from many different perspectives.

Robert J. Hilderman, Howard J. Hamilton, Robert J. Kowalchuk, Nick Cercone

Rough set theory and rule induction techniques for discovery of attribute dependencies in medical information systems

Problems connected with applications of the rough set theory to identify the most important attributes and with induction of decision rules from the medical data set are discussed in this paper. The medical data set concerns patients with multiple injuries. The direct use of the original rough set model leads to finding too many possibilities of reducing the input data. To solve this difficulty, a new approach integrating rough set theory, rule induction and statistical techniques is introduced. First, the Chi-square test is additionally performed in order to reject non-significant attributes. Then, starting from remaining attributes we try to construct such definitions of new attributes that improve finally discovered decision rules. The results have shown that the proposed approach integrating all methods has given better results than those obtained by applying the original rough set method.

Jerzy Stefanowski, Krzysztof Slowiński

Logical calculi for knowledge discovery in databases

Observational calculi were defined in relation to GUHA method of mechanising hypotheses formation. Formulae of observational calculi correspond to statistical hypothesis tests and various further assertions verificated in the process of data analysis. An example of application of the GUHA procedure PC-ASSOC is described in the paper. Logical relation among formulae of observational calculi are discussed and some important results concerning deduction rules are shown. Possibilities of applications of logical properties of formulae corresponding to hypotheses tests in the field of KDD are suggested.

Jan Rauch

Extraction of experts' decision process from clinical databases using rough set model

One of the most important problems on rule induction methods is that they cannot extract rules, which plausibly represent experts' decision processes. On one hand, rule induction methods induce probabilistic rules, the description length of which is too short, compared with the experts' rules. On the other hand, construction of Bayesian networks generates too lengthy rules. In this paper, the characteristics of experts' rules are closely examined and a new approach to extract plausible rules is introduced, which consists of the following three procedures. First, the characterization of decision attributes (given classes) is extracted from databases and the classes are classified into several groups with respect to the characterization. Then, two kinds of sub-rules, characterization rules for each group and discrimination rules for each class in the group are induced. Finally, those two parts are integrated into one rule for each decision attribute. The proposed method is evaluated on medical databases, the experimental results of which show that induced rules correctly represent experts' decision processes.

Shusaku Tsumoto

Discovering of health risks and case-based forecasting of epidemics in a health surveillance system

In this paper we present the methodology and the architecture of an early warning system which fulfills the following tasks. (1) discovering of health risks, (2) forecasting of the temporal and spatial spread of epidemics and (3) estimating of the consequences of an epidemic w.r.t. the personnel load and costs of the public health service. To cope this three task methods from knowledge discovery and data mining, case-based reasoning, and statistics are applied.

M. Bull, G. Kundt, L. Gierl

An algorithm for multi-relational discovery of subgroups

We consider the problem of finding statistically unusual subgroups in a multi-relation database, and extend previous work on single-relation subgroup discovery. We give a precise definition of the multi-relation subgroup discovery task, propose a specific form of declarative bias based on foreign links as a means of specifying the hypothesis space, and show how propositional evaluation functions can be adapted to the multi-relation setting. We then describe an algorithm for this problem setting that uses optimistic estimate and minimal support pruning, an optimal refinement operator and sampling to ensure efficiency and can easily be parallelized.

Stefan Wrobel

Finding similar time series

Similarity of objects is one of the crucial concepts in several applications, including data mining. For complex objects, similarity is nontrivial to define. In this paper we present an intuitive model for measuring the similarity between two time series. The model takes into account outliers, different scaling functions, and variable sampling rates. Using methods from computational geometry, we show that this notion of similarity can be computed in polynomial time. Using statistical approximation techniques, the algorithms can be speeded up considerably. We give preliminary experimental results that show the naturalness of the notion.

Gautam Das, Dimitrios Gunopulos, Heikki Mannila

Exploration of document collections with self-organizing maps: A novel approach to similarity representation

Classification is one of the central issues in any system dealing with text data. The need for effective approaches is dramatically increased nowadays due to the advent of massive digital libraries containing free-form documents. What we are looking for are powerful methods for the exploration of such libraries whereby the detection of similarities between the various text documents is the overall goal. In other words, methods that may be used to gain insight in the inherent structure of the various items contained in a text archive are needed. In this paper we demonstrate the applicability of self-organizing maps, a neural network model adhering to the unsupervised learning paradigm, for the task of text document clustering. In order to improve the representation of the result we present an extension to the basic learning rule that captures the movement of the various weight vectors in a two-dimensional output space for convenient visual inspection. The result of the extended training algorithm allows intuitive analysis of the similarities inherent in the input data and most important, intuitive recognition of cluster boundaries.

Dieter Merkl

Pattern based browsing in document collections

We present Document Explorer, a data mining system searching for patterns in document collections. These patterns provide knowledge on the application domain that is represented by the collection. A pattern can also be seen as a query that retrieves a set of documents. Thus the data mining tools can be used to identify interesting queries which can be used to browse the collection. The main pattern types, the system can search for, are frequent sets of concepts, association rules, concept distributions, and concept graphs. To enable the user to specify some explicit bias, the system provides several types of constraints for searching the vast implicit spaces of patterns that exist in the collection. The patterns which have been verified as interesting are structured and presented in a visual user interface allowing the user to operate on the results to refine and redirect search tasks or to access the associated documents. The system offers preprocessing tools to construct or refine a knowledge base of domain concepts and to create an internal representation of the document collection that will be used by all subsequent data mining operations. In this paper, we give an overview on the Document Explorer system. We summarize our methodical approaches and solutions for the special requirements of this document mining area.

Ronen Feldman, Willi Klösgen, Yaniv Ben-Yehuda, Gil Kedar, Vladimir Reznikov

Induction of fuzzy characteristic rules

In this paper we present a method for induction of fuzzy characteristic rules from a fuzzy subset. We use a two step method where in the first step we search for candidate predicates which will make up the characteristic rules. For this purpose typical values are used to isolate characteristic intervals for attributes, these intervals will make up the candidate predicates. In the second step we use the candidate predicates to induce the fuzzy characteristic rules.

Dan Rasmussen, Ronald R. Yager

Regression-based classification methods and their comparison with decision tree algorithms

Classification learning can be considered as a regression problem with dependent variable consisting of 0s and 1s. Reducing classification to the problem of finding numerical dependencies we gain an opportunity to utilize powerful regression methods implemented in the PolyAnalyst data mining system. Resulting regression functions can be considered as fuzzy membership indicators for a recognized class. In order to obtain classifying rules, the optimum threshold values which minimize the number of misclassified cases can be found for these functions. We show that this approach allows one to solve the over-fit problem satisfactorily and provides results that are at least not worse than results obtained by the most popular decision tree algorithms.

Mikhail V. Kiselev, Sergei M. Ananyan, Sergei B. Arseniev

Attribute discovery and rough sets

Most knowledge discovery methods assume that the original representation space is adequate, that is, the initial attributes are sufficiently relevant to the problem at hand. In real-world applications discovery of new attributes and selection of relevant attributes are applied frequently in data pre-processing. In the paper we discuss rough set based approach to attribute discovery. We consider discovery of adequate attributes for structural objects. We present two algorithms for extracting new attributes.

Jaroslaw Stepaniuk

Generation of rules from incomplete information systems

In the paper we define certain and possible rules in an incomplete information system as certain/possible ones in every completion of the initial system. The careful examination of the dependencies between an incomplete system and its completions allow us to state that it is feasible to generate all certain rules and some important class of possible rules directly from the incomplete information system. Space complexity of the proposed method of rules' generation is linear with regard to the number of objects in the initial system.

Marzena Kryszkiewicz

Knowledge discovery from software engineering data: Rough set analysis and its interaction with goal-oriented measurement

Knowledge discovery from software engineering measurement data is an essential prerequisite for software quality improvement. Software measurement is increasingly recognized as a device to better understand, monitor, predict, and improve software development and maintenance projects. The paper describes the interaction between goal-oriented measurement and rough set based analysis in the context of experimental software engineering. The gained experiences are based on the application of rough set analysis in practical problems of software engineering.

Günther Ruhe, Fraunhofer Gesellschaft

Efficient multisplitting on numerical data

Numerical data poses a problem to symbolic learning methods, since numerical value ranges inherently need to be partitioned into intervals for representation and handling. An evaluation function is used to approximate the goodness of different partition candidates. Most existing methods for multisplitting on numerical attributes are based on heuristics, because of the apparent efficiency advantages. We characterize a class of well-behaved cumulative evaluation functions for which efficient discovery of the optimal multisplit is possible by dynamic programming. A single pass through the data suffices to evaluate multisplits of all arities. This class contains many important attribute evaluation functions familiar from symbolic machine learning research. Our empirical experiments convey that there is no significant differences in efficiency between the method that produces optimal partitions and those that are based on heuristics. Moreover, we demonstrate that optimal multisplitting can be beneficial in decision tree learning in contrast to using the much applied binarization of numerical attributes or heuristical multisplitting.

Tapio Elomaa, Juho Rousu

SNOUT: An intelligent assistant for exploratory data analysis

In this paper we present an account of the main features of Snout, an intelligent assistant for exploratory data analysis (EDA) of social science survey data that incorporates a range of data mining techniques. EDA has much in common with existing data mining techniques: its main objective is to help an investigator reach an understanding of the important relationships in a data set rather than simply develop predictive models for selected variables. Brief descriptions of a number of novel techniques developed for use in Snout are presented. These include heuristic variable level inference and classification, automatic category formation, the use of similarity trees to identify groups of related variables, interactive decision tree construction, and model selection using a genetic algorithm.

P. D. Scott, A. P. M. Coxon, M. H. Hobbs, R. J. Williams

Exploratory analysis of biochemical processes using hybrid modeling methods

A heuristic process data based procedure has been developed that allows a universal time-efficient exploratory analysis of biochemical processes and predictive information extraction from large data sets. It uses artificial neural networks in combination with mass balance equations to represent unknown process relationships between process variables. An efficient algorithm for training of this hybrid system has been developed. The result of the procedure is a numerical representation of the important process relationships that immediately allows to determine improved set points and/or profiles for the manipulated variables with respect to process performance. For illustration, the procedure is applied for extraction of the important relationships in fed-batch bacterial cultivation process.

Rimvydas Simutis

Using signature files for querying time-series data

This paper describes our work on a new automatic indexing technique for large one-dimensional (ID) or time-series data. The principal idea of the proposed time-series data indexing method is to encode the shape of time-series into an alphabet of characters and then to treat them as text. As far as we know this is a novel approach to ID data indexing. In this paper we report our results in using the proposed indexing method for retrieval of real-life time-series data by its content.

Henrik André-Jönsson, Dushan Z. Badal

A new and versatile method for association generation

Current algorithms for finding associations among the attributes describing data in a database have a number of shortcomings:1.Applications that require associations with very small support have prohibitively large running times.2.They assume a static database. Some applications require generating associations in real-time from a dynamic database, where transactions are constantly being added and deleted. There are no existing algorithms to accomodate such applications.3.They can only find associations of the type where a conjunction of attributes implies a conjunction of different attributes. It turns out that there are many cases where a conjunction of attributes implies another conjunction only provided the exclusion of certain attributes. To our knowledge, there is no current algorithm that can generate such excluding associations.We present a novel method for association generation, that answers all three above desiderata. Our method is inherently different from all existing algorithms, and especially suitable to textual databases with binary attributes. At the heart of our algorithm lies the use of subword trees for quick indexing into the required database statistics. We tested our algorithm on the Reuters-22173 database with satisfactory results.

Amihood Amir, Ronen Feldman, Reuven Kashi

Bivariate decision trees

Decision tree methods constitute an important and much used technique for classification problems. When such trees are used in a Datamining and Knowledge Discovery context, ease of interpretation of the resulting trees is an important requirement to be met. Decision trees with tests based on a single variable, as produced by methods such as ID3, C4.5 etc., often require a large number of tests to achieve an acceptable accuracy. This makes interpretation of these trees, which is an important reason for their use, disputable. Recently, a number of methods for constructing decision trees with multivariate tests have been presented. Multivariate decision trees are often smaller and more accurate than univariate trees; however, the use of linear combinations of the variables may result in trees that are hard to interpret. In this paper we consider trees with test bases on combinations of at most two variables. We show that bivariate decision trees are an interesting alternative to both uni- and multivariate trees, especially qua ease of interpretation.

Jan C. Bioch, Onno van der Meer, Rob Potharst

Towards process-oriented tool support for knowledge discovery in databases

Knowledge Discovery in Databases (KDD) is currently a hot topic in industry and academia. Although KDD is now widely accepted as a complex process of many different phases, the focus of research behind most emerging products is on underlying algorithms and modelling techniques. The main bottleneck for KDD applications is not the lack of techniques. The challenge is to exploit and combine existing algorithms effectively, and help the user during all phases of the KDD process. In this paper, we describe the project Citrus which addresses these practically relevant issues. Starting from a commercially available system, we develop a scaleable, extensible tool inherently based on the view of KDD as an interactive and iterative process. We sketch the main components of this system, namely an information manager for effective retrieval of data and results, an execution server for efficient execution, and a process support interface for guiding the user through the process.

Rüdiger Wirth, Colin Shearer, Udo Grimmer, Thomas Reinartz, Jörg Schlösser, Christoph Breitner, Robert Engels, Guido Lindner

A connectionist approach to structural similarity determination as a basis of clustering, classification and feature detection

Many algorithms in machine learning, knowledge discovery, pattern recognition and classification are based on the estimation of the similarity or the distance between the analysed objects. Objects with higher structural complexity often cannot be described by feature vectors without losing important structural information. These objects can adequately be represented in the language of logic or by labeled graphs. The similarity of such descriptions is difficult to define and to compute. In this paper, a connectionist approach for the determination of the similarity of arbitrary labeled graphs is introduced. Using an example from organic chemistry, the application of the approach within one distance based and one generalisation based classfication algorithm is demonstrated. The generalisation based algorithm forms clusters or subclasses of similar examples of the same class and extracts the parts of the objects which determine the class of the object. The algorithms perform very satisfactorily in comparison with recent logical and feature vector approaches. Moreover, being able to handle structural data directly, the algorithms need only a subset of the given features of the objects for classification.

Kristina Schädler, Fritz Wysotzki

Searching for relational patterns in data

We consider several basic classes of tolerance relations among objects. These (global) relations are defined from some predefined similarity measures on values of attributes. A tolerance relation in a given class of tolerance relations is optimal with respect to a given decision table A if it contains only pairs of objects with the same decision and the number of such pairs contained in the relation is maximal among all relations from the class. We present a method for (sub-)optimal tolerance relation learning from data (decision table). The presented method is based on rough set approach. We show that for some basic families of tolerance relations this problem can be transformed to a relative geometrical problem in a real affine space. Hence geometrical computations are becoming useful tools for solving the problem of global tolerance relation construction. The complexity of considered problems can be evaluated by the complexity of the corresponding geometrical problems. We propose some efficient heuristics searching for an approximation of optimal tolerance relations in considered families of tolerance relations. The global tolerance relations can be treated as patterns in the cartesian product of the object set. We show how to apply the relational patterns (global tolerance relations) in clustering and classification of objects.

Sinh Hoa Nguyen, Andrzej Skowron

Finding spatial clusters

The special challenge in analysing geographical data comes from the spatial distribution of the objects. We are interested here in finding out whether a given property is randomly distributed or concentrated somewhere. More exactly: consider a two-dimensional region subdivided into non-overlapping fields, e. g. a state divided into counties, and assume that some fields are marked for having a distinguishing property. Do the marked fields exhibit some spatial clustering? Two tests feasible in data mining situations are proposed here, based on the number fields in clusters (defined by means of triplets, i. e. essentially three marked fields with a common boundary point) and on the number of edges of marked fields shared by another marked field. For regular settings such as honeycombs (sets of hexagons) some theoretical results are reported. In addition, simulations have been performed on honeycombs as well as on real subdivisions of a region and the tests have been applied to real data.

Friedrich Gebhardt

Interactive interpretation of hierarchical clustering

Automatic clustering methods are part of data mining methods. They aim at building clusters of items so that similar items fall into the same cluster while dissimilar ones fall into separate clusters. A particular class of clustering methods are hierarchical ones where recursive clusters are formed to grow a tree representing an approximation of similarities between items. We propose a new interactive interface to help the user to interpret the result of such a clustering process, according to the item characteristics. The prototype has been applied successfully to a special case of items providing nice graphical representations (electric load curves) but can also be used with other types of curves or with more standard items.

Eric Boudaillier, Georges Hébrail

The principle of transformation between efficiency and effectiveness: Towards a fair evaluation of the cost-effectiveness of KDD techniques

Most of the KDD literature focuses on analyzing the effectiveness of KDD techniques, in the sense e.g. of reducing the classification error rate in the case of classification tasks. Efficiency issues are usually considered of secondary importance, it considered at all. In contrast, we focus on the cost-effectiveness of KDD techniques, i.e. on the trade-off between effectiveness (reduction of error rate) and efficiency (reduction of processing time). In particular, we show that a gain in efficiency can be transformed into a gain in effectiveness, and this principle can be used to evaluate the cost-effectiveness of KDD systems in a fair manner. We discuss the application of this general principle to evaluate the cost-effectiveness of two general kinds of KDD techniques, namely classification algorithms and attribute selection algorithms.

Alex A. Freitas

Recognizing reliability of discovered knowledge

When using discovered knowledge for decision making (e.g. classification in the case of machine learning), the question of reliability becomes very important. Unlike global view on the algorithms (evaluation of overall accuracy on some testing data) or unlike multistrategy learning (voting of more classifiers), we propose “local” evaluation for each example using one classifier. The basic idea is to learn to classify the correct decisions made by the classifier. This is done by creating new class attribute “match” and by running the learning algorithm on the same input attributes. We call this (second) step verification. Some first preliminary experimental results of this method used with C4.5 and CN4 are reported. These results show that: (1) if the classification accuracy is very high, it makes no sence to perform the verification step (since the verification step will create only the majority rule), (2) in multiple-class and/or noisy domains the verification accuracy can be significantly higher then the classification accuracy.

Petr Berka

Clustering techniques in biological sequence analysis

In biological sequence analysis many DNA and RNA sequences discovered in laboratory experiments are not properly identified. Here the focus is on using clustering algorithms to provide a structure to the data. The approach is inter-disciplinary using domain knowledge to identify such sequences. The enormous volume and high dimensionality of unidentified biological sequence data presents a challenge. Nonetheless useful and interesting results have been obtained, both directly and indirectly, by applying clustering to the data.

A. M. Manning, J. A. Keane, A. Brass, C. A. Goble

TOAS intelligence mining; analysis of natural language processing and computational linguistics

The Technology Opportunities Analysis System (TOAS), being developed under a Defense Advanced Research Projects Agency (DARPA) project, enables mining of text files using bibliometrics. TOAS, a software system, extracts useful information from literature abstract files, which have identified fields that repeat in each abstract record of specific databases, such as Engineering Index (ENGI), INSPEC, Business Index, U.S. Patents, and the National Technical Information Service (NTIS) Research Reports. The TOAS applies various technologies, which include natural language processing (NLP), computational linguistics (CL), fuzzy analysis, latent semantic indexing, and principle components analysis (PCA). This software system combines simple operations (i.e., listing, counting, list comparisons and sorting of search term retrieved consolidated records' field results) with complex matrix manipulations, statistical inference and artificial intelligence approaches to reveal patterns and provide insights from large amounts of information, primarily related to technology-oriented management issues.The authors apply the TOAS tool on its own root technologies, NLP and computational linguistics-two apparently synonymous terms. These terms, however, when used in a literature search of the same abstract databases, ENGI and INSPEC, provide distinctly different search results with only 10% to 25% search result abstract records overlap. This paper introduces TOAS, summarizes analyses comparing NLP and CL, and then discusses the underlying development implications.

Robert J. Watts, Alan L. Porter, Scott Cunningham, Donghua Zhu

Algorithms for constructing of decision trees

Decision trees are widely used in different applications for problem solving and for knowledge representation. In the paper algorithms for decision tree constructing with bounds on complexity and precision are considered. In these algorithms different measures for time complexity of decision trees and different measures for uncertainty of decision tables are used. New results about precision of polynomial approximate algorithms for covering problem solving [1, 2] show that some of considered algorithms for decision tree constructing are, apparently, close to unimprovable.

Mikhail Moshkov

Mining in the phrasal frontier

Data mining methods have been applied to a wide variety of domains. Surprisingly enough, only a few examples of data mining in text are available. However, considering the amount of existing document collections, text mining would be most useful. Traditionally, texts have been analysed using various information retrieval related methods and natural language processing. In this paper, we present our first experiments in applying general methods of data mining to discovering phrases and co-occurring terms. We also describe the text mining process developed. Our results show that data mining methods — with appropriate preprocessing — can be used in text processing, and that by shifting the focus the process can be used to obtain results for various purposes.

Helena Ahonen, Oskari Heinonen, Mika Klemettinen, A. Inkeri Verkamo

Mining time series using rough sets — A case study

This article attempts to deal with the problem of time within the framework of rough sets. The rough set theory has emphasized the reduction of information necessary to acquire desired knowledge. This is particularly important when we are dealing with time. The farther back we are tracing our dependencies, the more attributes will become independent of our current decisions.We formalize approaches to reasoning with time series where the sequence of events is important, and introduce formalisms to deduce decision rules with real-time constraints.

Anders Torvill Bjorvand

Neural networks design: Rough set approach to continuous data

The construction of neural network based on decision rules generated by rough set methods is given. Analogies between neural network semantics and rough set approach to continuous data are pointed out. General framework states the initial point for such applications like, e.g., tuning decision rules by neural network learning process.

Nguyen Hung Son, Marcin S. Szczuka, Dominik Ślezak

On meta levels of an organized society of KDD agents

Modeling of KDD (Knowledge Discovery in Databases) process constitutes an important and new research area of KDD, that is, meta levels of the KDD process, including formal specification of the process, its planning, scheduling, controlling, management, evolution, and reuse. The key issue is how to increase both autonomy and versatility of a KDD system. Our methodology is to create an organized society of KDD agents. This means (1) to develop many kinds of KDD agents for different discovery tasks; (2) to use the KDD agents in multiple learning phases in a distributed cooperative mode; (3) to manage the society of KDD agents by multiple meta-control levels. Based on this methodology, a multi-strategy and cooperative KDD system, which can be imagined as a softbot and is named GLS (Global Learning Scheme), has being developing by us. This paper focuses on the meta control levels for increasing both autonomy and versatility of the KDD system.

Ning Zhong, Setsuo Ohsuga, Chunnian Liu, Yoshitsugu Kakemoto, Xiaodong Zhang

Using neural network to extract knowledge from database

Classification, which involves finding rules that partition a given data set into disjoint groups, is one class of data mining problems. Approaches proposed so far for mining classification rules for large databases are mainly decision tree based on symbolic learning methods. In this paper, we use artificial neural network to mine classification rules. We present a novel approach, called LBSB, composed of two phases to extract rules from artificial neural network and discover knowledge in databases. Some experiments have demonstrated that our method generates rules of better performance than the decision tree approach in noisy conditions.

Zhou Yuanhui, Lu Yuchang, Shi Chunyi

Induction of strong feature subsets

The problem of features subset selection can be defined as the selection of a relevant subset of features which allows a learning algorithm to induce small high-accuracy concepts. To achieve this goal, two main approaches have been developed, the first one is algorithm-independent (filter approach) which considers only the data, when the second approach takes into account both the data and a given learning algorithm (wrapper approach). Recent work were developed to study the interest of rough set theory and more particularly its notions of reducts and core to deal with the problem of features subset selection. Different methods were proposed to select features using both the core and the reduct concepts, whereas other researches show that useful features subset does not necessarily contain all features in cores. In this paper, we underline the fact that rough set theory is concerned with deterministic analysis of attribute dependencies which are at basis of the two notions of reduct and core. We extend the notion of dependency which allows us to find both deterministic and non-deterministic dependencies. A new notion of strong reducts is then introduced and leads to the definition of strong feature subsets (SFS). The interest of SFS is illustrated by the improvement of the accuracy of our learning system, called Alpha, on real-world datasets.

Mohamed Quafafou, Moussa Boussouf

Rough sets for data mining and knowledge discovery

The problem of handling imperfect and approximate knowledge has been recently recognized as the crucial issue in solving several complex real-life problems, also in the case of data mining and knowledge discovery. Among many approaches to imperfect knowledge, such as fuzzy sets of Zadeh (1965) various forms of neural networks and evidence theory, to name a few, the theory of information systems and rough sets introduced by Z. Pawlak in 1982 has lately gain substantial attention. Rough sets are attractive from the computational point of view because the underlying concept of an information system is basically a relation in a database. Rough sets also have a rather intuitive common-sense interpretation (i.e. belief and plausibility) having at the same time a sound mathematical definition (i.e. lower and upper bounds).Several problems concerning data analysis presented in the form of an information system can be solved using the rough set theory. This methodology has been successfully applied in medical data analysis, finance, voice recognition, image processing, process modelling and identification, conflict resolution, etc. It has many advantages such as: efficient algorithms for finding hidden patterns in data, finding minimal sets of data (data reduction), evaluation of the significance of data, generation of relevant sets of decision rules or features from data, straightforward interpretation of results presented as rules, and others.This tutorial introduces basic rough sets for data and continues with advanced methods for synthesizing approximate classification and decision rules; it then gives methods for dealing with large data sets. The participants will be given an opportunity to gain hands-on experience using Rosetta — our PC-based tool-kit for data analysis with rough sets.

Jan Komorowski, Lech Polkowski, Andrzej Skowron

Techniques and applications of KDD

Increasing amounts of data are collected in application domains as diverse as finance, market data analysis, astronomy, manufacturing, and education. Large datasets can lead to knowledge essential in domain understanding and decision-making, but knowledge discovery requires intelligent and largely automatic methods and systems, supporting the analysts. The rapidly emerging field of Knowledge Discovery in Databases (KDD) provides these tools. It also assimilates methods of machine learning and discovery, statistics, databases, and visualization.In this basic tutorial, we explain what is KDD and what are its basic objectives and methods. We introduce the knowledge discovery process that includes data pre-processing, selection, transformation, explorations for patterns, and qualification of pattern as knowledge. We present the most important data mining methods, the conditions for their successful application and their advantages in different application areas. Many methods and many forms of knowledge can be integrated in a modular, interactive and iterative KDD system. We discuss search efficiency, knowledge evaluation and refinement, the role of background knowledge, and languages designed to form different elements of knowledge. Finally we outline several key challenges.

Willi Klösgen, Jan Zytkow

A tutorial introduction to high performance data mining

The goal of the tutorial is to provide researchers, practitioners, and advanced students with an introduction to techniques from high performance computing and high performance data management which are applicable to data mining and to illustrate their use on a variety of practical data mining problems.A fundamental problem in data mining is to develop data mining algorithms and systems which scale as the amount of data grows, as the dimension grows, and as the complexity of the data grows. Recently, 1) parallel and distributed variants of some standard data mining algorithms have been developed and 2) data mining systems have begun to develop higher performance access to databases and data warehouses. These and related topics will be covered in the tutorial. Finally, we will cover several case studies involving mining large data sets, from 10-500 Gigabytes in size.

Robert Grossman

Data mining in the telecommunications industry

The Data Mining Group at BT Laboratories carries out data mining projects for other parts of the company, provides consultancy on data mining, and carries out research into new data mining techniques.The first part of the tutorial describes the application of data mining in three areas particularly relevant to the telecommunications business: Network Performance, Fraud Investigation, and Market Analysis. A number of case studies will be presented.The second part draws general conclusions about the process of Knowledge Discovery in the telecommunications industry. Practitioners' tips for each stage will be given, concentrating on the consultancy and customer-facing aspects of the process.Although the presentation is aimed at a wide audience, a familiarity with basic Statistics and Machine Learning techniques is assumed.

Leo Carbonara, Huw Roberts, Blaise Egan

Springer Professional

Über dieses Buch

Inhaltsverzeichnis

Frontmatter