Skip to main content

2002 | Buch

Progress in Discovery Science

Final Report of the Japanese Dicsovery Science Project

herausgegeben von: Setsuo Arikawa, Ayumi Shinohara

Verlag: Springer Berlin Heidelberg

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Inhaltsverzeichnis

Frontmatter
Searching for Mutual Exclusion Algorithms Using BDDs

The impact of verification technologies would be much greater if they could not only verify existing information systems, but also synthesize or discover new ones. In our previous study, we tried to discover new algorithms that satisfy a given specification, by first defining a space of algorithms, and then checking each algorithm in the space against the specification, using an automatic verifier, i.e., model checker. Needless to say, the most serious problem of this approach is in search space explosion. In this paper, we describe case studies in which we employed symbolic model checking using BDD and searched for synchronization algorithms. By employing symbolic model checking, we could speed up enumeration and verification of algorithms. We also discuss the use of approximation for reducing the search space.

Koichi Takahashi, Masami Hagiya
Reducing Search Space in Solving Higher-Order Equations

We describe the results of our investigation of equational problem solving in higher-order setting. The main problem is identified to be that of reducing the search space of higher-order lazy narrowing calculi, namely how to reduce the search space without losing the completeness of the calculi. We present a higher-order calculus HOLN0 as a system of inference rules and discuss various refinements that enable the reduction of the search space by eliminating some sources of nondeterminism inherent in the calculus.

Tetsuo Ida, Mircea Marin, Taro Suzuki
The Structure of Scientific Discovery: From a Philosophical Point of View

The term “discovery” can be defined as “the act of becoming aware of something previously existing but unknown.” There are many kinds of discoveries from geographical to mathematical ones. Roughly speaking, we can classify discoveries into two types: the factual discovery and the conceptual one. In the historical development of scientific inquiry, the former typically occurs during the period of “normal science” in Kuhnian term. The latter emerges in the opportunity of “scientific revolution,” i.e., the time of a paradigm-shift. It is necessary for scientific discoveries to use imagination as well as reasoning. Based on the study of aphasic disturbances from a viewpoint of using figurative expressions by Roman Jakobson, we would like to discriminate between the metonymical imagination from the metaphorical one. While the former is related to discoveries of facts and laws which complete an unfinished theory, the latter is concerned with conceptual discoveries which change a viewpoint from explicit facts to implicit unknown relations. Considering these points, it is useful to examine Poincaré’s description of his own experience concerning mathematical discovery. He emphasized the working of “the subliminal ego” and “the aesthetic sensibility” in his discovery of Fuchsian function. These activities can be compared to the function of metaphorical imagination. However, it is not easy to formulate the unconscious process of “aspect-dawning” in the form of explicit rules or algorithm. This suggests the difficulty in realizing scientific discovery by computers.

Keiichi Noé
Ideal Concepts, Intuitions, and Mathematical Knowledge Acquisitions in Husserl and Hilbert
A Preliminary Report*

We analyze Husserl’s and Hilbert’s studies on the role of “ideal concepts” and that of the notion of “mathematical intuition” in the context of mathematical knowledge acquisitions. We note a striking similarity between Husserl’s analysis in 1901 (presented at Hilbert’s seminar) and Hilbert’s analysis in the 1920’s on the problem justifying the ideal concepts. We also analyze the similarity and difference on Husserl’s standpoint and Hilbert’s standpoint on mathematical objectivities and on mathematical intuitions. In the course of this analysis we compare these with Gödel’s and some Hilbert scool members’ standpoints. We also propose a view on mathematical knowledge acquisitions along a “dynamic” interpretation of the Husserlian philosophy of mathematics, which provides a view different from the traditional views such as realist, nominalist, constructivist, conventionalist, empiricist views.

Mitsuhiro Okada
Theory of Judgments and Derivations
Dedicated to the Memory of Professor Katuzi Ono

We propose a computational and logical framework NF (Natural Framework) which is suitable for presenting mathematics formally. Our framework is an extendable framework since it is open-ended both computationally and logically in the sense of Martin-Löf’s theory or types. NF is developed in three steps. Firstly, we introduce a theory of expressions and schemata which is used to provide a universe for representing mathematical objects, in particular, judgments and derivations as well as other usual mathematical objects. Secondly, we develop a theory of judgments within the syntactic universe of expressions. Finally, we introduce the notions of derivation and derivation game and will show that we can develop mathematics as derivation games by regarding mathematics as an open-ended process of defining new concepts and deriving new judgments. Our theory is inspired by Martin-Löf’s theory of expressions and Edinburgh LF, but conceptually much simpler. Our theory is also influenced by Gentzen’s natural deduction systems.

Masahiko Sato
Efficient Data Mining from Large Text Databases

In this paper, we consider the problem of discovering a simple class of combinatorial patterns from a large collection of unstructured text data. As a framework of data mining, we adopted optimized pattern discovery in which a mining algorithm discovers the best patterns that optimize a given statistical measure within a class of hypothesis patterns on a given data set. We present efficient algorithms for the classes of proximity word association patterns and report the experiments on the keyword discovery from Web data.

Hiroki Arimura, Hiroshi Sakamoto, Setsuo Arikawa
A Computational Model for Children’s Language Acquisition Using Inductive Logic Programming

This paper describes our research activity on developing a computational model for children’s word acquisition using inductive logic programming. We incorporate cognitive biases developed recently to explain the efficiency of children’s language acquisition. We also design a co-evolution mechanism of acquiring concept definitions for words and developing concept hierarchy. Concept hierarchy plays an important roles of defining contexts for later word learning processes. A context switching mechanism is used to select relevant set of attributes for learning a word depending on the category which it belongs to. On the other hand, during acquiring definitions for words, concept hierarchy is developed. We developed an experimental language acquisition system called WISDOM (Word Induction System for Deriving Object Model) and conducted virtual experiments or simulations on acquisition of words in two different categories. The experiments shows feasibility of our approach.

Ikuo Kobayashi, Koichi Furukawa, Tomonobu Ozaki, Mutsumi Imai
Some Criterions for Selecting the Best Data Abstractions

This paper presents and summarizes some criterions for selecting the best data abstraction for relations in relational databases. The data abstraction can be understood as a grouping of attribute values whose individual aspects are forgotten and are therefore abstracted to some more abstract value together. Consequently, a relation after the abstraction is a more compact one for which data miners will work efficiently. It is however a major problem that, when an important aspect of data values is neglected in the abstraction, then the quality of extracted knowledge becomes worse. So, it is the central issue to present a criterion under which only an adequate data abstraction is selected so as to keep the important information and to reduce the sizes of relations at the same time. From this viewpoint, we present in this paper three criterions and test them for a task of classifying tuples in a relation given several target classes. All the criterions are derived from a notion of similarities among class distributions, and are formalized based on the standard information theory. We also summarize our experimental results for the classification task, and discuss a future work.

Makoto Haraguchi, Yoshimitsu Kudoh
Discovery of Chances Underlying Real Data

Chance discovery is to notice and explain the significance of a chance, especially if the chance is rare and its significance is unnoticed. Chance discovery is essential for various real requirements in human life. The paper presents the significance and the achieved methods of chance discovery. Fundamental discussions of how to realize chance discovery extracts keys for the progress of chance discovery: communication, imagination, and data mining. As an approach to chance discovery, visualized data mining methods are shown as tools aiding chance discoveries.

Yukio Ohsawa
Towards the Integration of Inductive and Nonmonotonic Logic Programming

Commonsense reasoning and machine learning are two important topics in AI. These techniques are realized in logic programming as nonmonotonic logic programming (NMLP) and inductive logic programming (ILP), respectively. NMLP and ILP have seemingly different motivations and goals, but they have much in common in the background of problems. This article overviews the author’s recent research results for realizing induction from nonmonotonic logic programs.

Chiaki Sakama
EM Learning for Symbolic-Statistical Models in Statistical Abduction

We first review a logical-statistical framework called statistical abduction and identify its three computational tasks, one of which is the learning of parameters from observations by ML (maximum likelihood) estimation. Traditionally, in the presence of missing values, the EM algorithm has been used for ML estimation. We report that the graphical EM algorithm, a new EM algorithm developed for statistical abduction, achieved the same time complexity as specialized EM algorithms developed in each discipline such as the Inside-Outside algorithm for PCFGs (probabilistic context free grammars). Furthermore, learning experiments using two corpora revealed that it can outperform the Inside-Outside algorithm by orders of magnitude. We then specifically look into a family of extensions of PCFGs that incorporate context sensitiveness into PCFGs. Experiments show that they are learnable by the graphical EM algorithm using at most twice as much time as plain PCFGs even though these extensions have higher time complexity.

Taisuke Sato
Refutable/Inductive Learning from Neighbor Examples and Its Application to Decision Trees over Patterns

The paper develops the theory of refutable/inductive learning as a foundation of discovery science from examples. We consider refutable/inductive language learning from positive examples, some of which may be incorrect. The error or incorrectness we consider is the one described uniformly in terms of a distance over strings. We define a k-neighbor closure of a language L as the collection of strings each of which is at most k distant from some string in L. In ordinary learning paradigm, a target language is assumed to belong to a hypothesis space without any guarantee. In this paper, we allow an inference machine to infer a neighbor closure instead of the original language as an admissible approximation. We formalize such kind of learning, and give some sufficient conditions for a hypothesis space.As its application to concrete problems, we deal with languages defined by decision trees over patterns. The problem of learning decision trees over patterns has been studied from a viewpoint of knowledge discovery for Genome information processing in the framework of PAC learning from both positive and negative examples. We investigate their learnability in the limit from neighbor examples as well as refutable learnability from complete examples, i.e., from both positive and negative examples. Furthermore, we present some procedures which plays an important role for designing efficient learning algorithms for decision trees over regular patterns.

Masako Sato, Yasuhito Mukouchi, Mikiharu Terada
Constructing a Critical Casebase to Represent a Lattice-Based Relation

This paper gives a general framework for analyzing casebased reasoning to represent a lattice-based relation. This is a generalization of our previous work to analyze case-based reasoning over a boolean domain [Satoh98],[Satoh00a] and a tree-structured domain [Satoh00b]. In these work, we use a set-inclusion based similarity which is a generalization of a similarity measure proposed in a legal domain [Ashley90],[Ashley94]. We show representability of a lattice-based relation, approximation method of constructing a minimal casebase to represent a relation and complexity analysis of the method.

Ken Satoh
On Dimension Reduction Mappings for Approximate Retrieval of Multi-dimensional Data

Approximate retrieval of multi-dimensional data, such as documents, digital images, and audio clips, is a method to get objects within some dissimilarity from a given object. We assume a metric space containing objects, where distance is used to measure dissimilarity. In Euclidean metric spaces, approximate retrieval is easily and efficiently realized by a spatial indexing/access method R-tree. First, we consider objects in discrete L1 (or Manhattan distance) metric space, and present embedding method into Euclidean space for them. Then, we propose a projection mapping H-Map to reduce dimensionality of multi-dimensional data, which can be applied to any metric space such as L1 or L∞ metric space, as well as Euclidean space. H-Map does not require coordinates of data unlike K-L transformation. H-Map has an advantage in using spatial indexing such as R-tree because it is a continuous mapping from a metric space to an L∞ metric space, where a hyper-sphere is a hyper-cube in the usual sense. Finally we show that the distance function itself, which is simpler than H-Map, can be used as a dimension reduction mapping for any metric space.

Takeshi Shinohara, Hiroki Ishizaka
Rule Discovery from fMRI Brain Images by Logical Regression Analysis

This paper presents rule discovery from fMRI brain images. The algorithm for the discovery is the Logical Regression Analysis, which consists of two steps. The first step is regression analysis. The second step is rule extraction from the regression formula obtained by the regression analysis. In this paper, we use nonparametric regression analysis as a regression analysis, since there are not sufficient data in rule discovery from fMRI brain images. The algorithm was applied to several experimental tasks such as finger tapping and calculation. This paper reports the experiment of calculation, which has rediscovered well-known facts and discovered new facts.

Hiroshi Tsukimoto, Mitsuru Kakimoto, Chie Morita, Yoshiaki Kikuchi
A Theory of Hypothesis Finding in Clausal Logic

Hypothesis finding constitutes a basic technique for fields of inference related to Discovery Science, like inductive inference and abductive inference. In this paper we explain that various hypothesis finding methods in clausal logic can be put on one general ground by using the combination of the upward refinement and residue hypotheses. Their combination also gives a natural extension of the relative subsumption relation. We also explain that definite bottom clauses, a special type of residue hypotheses, can be found by extending SLD-resolution.

Akihiro Yamamoto, Bertram Fronhöfer
Efficient Data Mining by Active Learning

An important issue in data mining and knowledge discovery is the issue of data scalability. We propose an approach to this problem by applying active learning as a method for data selection. In particular, we propose and evaluate a selective sampling method that belongs to the general category of ‘uncertainty sampling,’ by adopting and extending the ‘query by bagging’ method, proposed earlier by the authors as a query learning method. We empirically evaluate the effectiveness of the proposed method by comparing its performance against Breiman’s Ivotes, a representative sampling method for scaling up inductive algorithms. Our results show that the performance of the proposed method compares favorably against that of Ivotes, both in terms of the predictive accuracy achieved using a fixed amount of computation time, and the final accuracy achieved. This is found to be especially the case when the data size approaches a million, a typical data size encountered in real world data mining applications. We have also examined the effect of noise in the data and found that the advantage of the proposed method becomes more pronounced for larger noise levels.

Hiroshi Mamitsuka, Naoki Abe
Data Compression Method Combining Properties of PPM and CTW

Universal compression and leaning has been interacting with each other. This paper combines two compression schemes, PPM (Prediction by Partial Match) and CTW (Context Tree Weighting), to a scheme which can predict the multi-alphabet probabilities to attain better compression ratio.

Takumi Okazaki, Kunihiko Sadakane, Hiroshi Imai
Discovery of Definition Patterns by Compressing Dictionary Sentences

This paper proposes an automatic method to discover definition patterns from an ordinary dictionary. There are frequent patterns to describe words and concepts in a ordinary dictionary. Each definition pattern gives a set of similar words and can be used as a template to clarify distinctions among them. To discover these definition patterns, we convert definition sentences into tree structures, and compress them using the MDL principle. The experiment on a Japanese children dictionary is reported, showing the effectiveness of our method.

Masatoshi Tsuchiya, Sadao Kurohashi, Satoshi Sato
On-Line Algorithm to Predict Nearly as Well as the Best Pruning of a Decision Tree

We review underlying mechanisms of the multiplicative weight-update prediction algorithms which somehow combine experts’ predictions to obtain its own prediction that is almost as good as the best expert’s prediction. Looking into the mechanisms we show how such an algorithm with the experts arranged on one layer can be naturally generalized to the one with the experts laid on nodes of trees. Consequently we give an on-line prediction algorithm that, when given a decision tree, produces predictions not much worse than the predictions made by the best pruning of the given decision tree.

Akira Maruoka, Eiji Takimoto
Finding Best Patterns Practically

Finding a pattern which separates two sets is a critical task in discovery. Given two sets of strings, consider the problem to find a subsequence that is common to one set but never appears in the other set. The problem is known to be NP-complete. Episode pattern is a generalized concept of subsequence pattern where the length of substring containing the subsequence is bounded. We generalize these problems to optimization problems, and give practical algorithms to solve them exactly. Our algorithms utilize some pruning heuristics based on the combinatorial properties of strings, and efficient data structures which recognize subsequence and episode patterns.

Ayumi Shinohara, Masayuki Takeda, Setsuo Arikawa, Masahiro Hirao, Hiromasa Hoshino, Shunsuke Inenaga
Classification of Object Sequences Using Syntactical Structure

When classifying a sequence of objects, in an ordinary classification, where objects are assumed to be independently drawn from identical information sources, each object is classified independently. This assumption often causes deterioration in the accuracy of classification. In this paper, we consider a method to classify objects in a sequence by taking account of the context of the sequence. We define this problem as component classification and present a dynamic programming algorithm where a hidden Markov model is used to describe the probability distribution of the object sequences. We show the effectiveness of the component classification experimentally, using musical structure analysis.

Atsuhiro Takasu
Top-Down Decision Tree Boosting and Its Applications

Top-down algorithms such as C4.5 and CART for constructing decision trees are known to perform boosting, with the procedure of choosing classification rules at internal nodes regarded as the base learner. In this work, by introducing a notion of pseudo-entropy functions for measuring the loss of hypotheses, we give a new insight into this boosting scheme from an information-theoretic viewpoint: Whenever the base learner produces hypotheses with non-zero mutual information, the top-down algorithm reduces the conditional entropy (uncertainty) about the target function as the tree grows. Although its theoretical guarantee on its performance is worse than other popular boosting algorithms such as AdaBoost, the top-down algorithms can naturally treat multiclass classification problems. Furthermore we propose a base learner LIN that produces linear classification functions and carry out some experiments to examine the performance of the top-down algorithm with LIN as the base learner. The results show that the algorithm can sometimes perform as well as or better than AdaBoost.

Eiji Takimoto, Akira Maruoka
Extraction of Primitive Motion and Discovery of Association Rules from Human Motion Data

In past several years, more and more digital multimedia data in the forms of image, video and audio have been captured and archived. This kind of newresource is exiting, yet the sheer volume of data makes any retrieval task overwhelming and its efficient usage impossible. In order to deal with the deficiency, tagging method is required so as to browse the content of multimedia data almost instantly.In this paper, we will focus on tagging human motion data. The motion data have the following features: movements of some body parts have influence on other body parts. We call this dependency motion association rule. Thus, the task of tagging motion data is equal to the task of expressing motion by using motion association rules. Association rules consist of symbols, which uniquely represent basic patterns. We call these basic patterns primitive motions. Primitive motions are extracted from the motion data by using segmentation and clustering processes. Finally, we will discuss some experiments to discover association rules from multi-stream of the motion data.

Kuniaki Uehara, Mitsuomi Shimada
Algorithmic Aspects of Boosting

We discuss algorithmic aspects of boosting techniques, such as Majority Vote Boosting [Fre95], AdaBoost [FS97], and MadaBoost [DW00a]. Considering a situation where we are given a huge amount of examples and asked to find some rule for explaining these example data, we show some reasonable algorithmic approaches for dealing with such a huge dataset by boosting techniques. Through this example, we explain how to use and how to implement “adaptivity” for scaling-up existing algorithms.

Osamu Watanabe
Automatic Detection of Geomagnetic Jerks by Applying a Statistical Time Series Model to Geomagnetic Monthly Means

A geomagnetic jerk is defined as a sudden change in the trend of the time derivative of geomagnetic secular variation. A statistical time series model is applied to monthly means of geomagnetic eastward component obtained at 124 geomagnetic observatories to detect geomagnetic jerks objectively. The trend component in the model is expressed by a second order spline function with variable knots. The optimum parameter values of the model including positions of knots are estimated by the maximum likelihood method, and the optimum number of parameters is determined based on the Akaike Information Criterion. The geomagnetic jerks are detected objectively and automatically by regarding the determined positions of knots as the occurrence epochs. This analysis reveals that the geomagnetic jerk in 1991 is a local phenomenon while the 1969 and 1978 jerks are confirmed to be global phenomena.

Hiromichi Nagao, Tomoyuki Higuchi, Toshihiko Iyemori, Tohru Araki
Application of Multivariate Maxwellian Mixture Model to Plasma Velocity Distribution

Recent space plasma observations have provided us with three-dimensional velocity distributions having multiple peaks. We propose a method for analyzing such velocity distributions via a multivariate Maxwellian mixture model where each component of the model represents each of the multiple peaks. The parameters of the model are determined through the Expectation-Maximization (EM) algorithm. For the automatic judgment of the preferable number of components in the mixture model, we introduce a method of examining the number of extrema of a resulting mixture model. We show applications of our method to velocity distributions observed in the Earth’s magnetotail.

Genta Ueno, Nagatomo Nakamura, Tomoyuki Higuchi, Takashi Tsuchiya, Shinobu Machida, Tohru Araki
Inductive Thermodynamics from Time Series Data Analysis

We propose an inductive thermodynamics from time series data analysis. Using some modern techniques developed in Statistical Science and Artificial Intelligence, we construct a mathematical model from time series data. We introduce an effective potential for a steady state distribution and construct thermodynamics following the recent work by Sekimoto-Sasa. We apply our idea to neutron noise data from a test nuclear reactor. We interpret the slow transformation of the control bar as work. The response to the transformation appears as excess heat production in accordance with the second law.

Hiroshi H. Hasegawa, Takashi Washio, Yukari Ishimiya
Mining of Topographic Feature from Heterogeneous Imagery and Its Application to Lunar Craters

In this study, a crater detection system for a large-scale image database is proposed. The original images are grouped according to spatial frequency patterns and both optimized parameter sets and noise reduction techniques used to identify candidate craters. False candidates are excluded using a self-organizing map (SOM) approach. The results show that despite the fact that a accurate classification is achievable using the proposed technique, future improvements in detection process of the system are needed.

Rie Honda, Yuichi Iijima, Osamu Konishi
Application of Neural Network Technique to Combustion Spray Dynamics Analysis

This paper presents results from several analytical and empirical combustion process investigations using data mining tools and techniques. An artificial neural network was used to analyze the performance of data in phase Doppler anemometry (PDA) and particle image velocimetry (PIV) which can measure droplet size and velocity in combustion spray. The dataset used for the analysis was obtained from measurements in a practical combustion burner. The preliminary results are discussed, and improvements to the neural network architecture are suggested. The inclusion of additional input variables and modified data pre-processing improved the results of the classification process, providing a higher level of accuracy and narrower ranges of classified droplet sizes.

Yuji Ikeda, Dariusz Mazurkiewicz
Computational Analysis of Plasma Waves and Particles in the Auroral Region Observed by Scientific Satellite

In this paper we introduce new computational techniques for extracting the attributes and/or characteristics of the plasma waves and particles from the enormous scientific database. These techniques enable us to represent the characteristics of complicated phenomena using a few key-parameters, and also to analyze large amount of datasets in a systematic way. These techniques are applied to the observational data of the ion heating/acceleration phenomena in the auroral region obtained by the Akebono satellite. Finally we evaluate our algorithms through some correlation analyses between waves and particles and demonstrate their availability.

Yoshiya Kasahara, Ryotaro Niitsu, Toru Sato
A Flexible Modeling of Global Plasma Profile Deduced from Wave Data

As the plasma profile around the earth varies with local time, season and solar activity, it is important to have a means to determine it in a fine time resolution. In this paper we propose a method to determine the global plasma profile from the satellite observation data. We adopt a stochastic model to represent the distribution of plasma. In the model, the parameters are determined by ABIC(Akaike Bayesian Information Criterion) deduced from observed wave data. The validity of our method was evaluated using simulated data and it is found that the given distributions are successfully reconstructed by smoothing the observation data appropriately.

Yoshitaka Goto, Yoshiya Kasahara, Toru Sato
Extraction of Signal from High Dimensional Time Series: Analysis of Ocean Bottom Seismograph Data

A signal extraction method is developed based on a prior knowledge on the propagation of seismic signal. To explore underground velocity structure based on OBS (Ocean Bottom Seismogram), it is necessary to detect reflection and refraction waves from the data contaminated with relatively large direct wave and its multiples. In this paper, we consider methods based on the time series and spatial-temporal decompositions of the data. In spatial-temporal decomposition, the difference of the travel time (moveout) corresponding to underground layer structure is utilized. The proposed methods are exemplified with a real OBS data.

Genshiro Kitagawa, Tetsuo Takanami, Asako Kuwano, Yoshio Murai, Hideki Shimamura
Foundations of Designing Computational Knowledge Discovery Processes

We propose a new paradigm for computational knowledge discovery, called VOX (View Oriented eXploration). Recent research has revealed that actual discoveries cannot be achieved using only component technologies such as machine learning theory or data mining algorithms. Recognizing how the computer can assist the actual discovery tasks, we developed a solution to this problem. Our aim is to construct a principle of computational knowledge discovery, which will be used for building actual applications or discovery systems, and for accelerating such entire processes. VOX is a mathematical abstraction of knowledge discovery processes, and provides a unified description method for the discovery processes. We present advantages obtained by using VOX. Through an actual computational experiment, we show the usefulness of this new paradigm. We also designed a programming language based on this concept. The language is called VML (View Modeling Language), which is defined as an extension of a functional language ML. Finally, we present the future plans and directions in this research.

Yoshinori Tamada, Hideo Bannai, Osamu Maruyama, Satoru Miyano
Computing Optimal Hypotheses Efficiently for Boosting

This paper sheds light on a strong connection between AdaBoost and several optimization algorithms for data mining. AdaBoost has been the subject of much interests as an effective methodology for classification task. AdaBoost repeatedly generates one hypothesis in each round, and finally it is able to make a highly accurate prediction by taking a weighted majority vote on the resulting hypotheses. Freund and Schapire have remarked that the use of simple hypotheses such as singletest decision trees instead of huge trees would be promising for achieving high accuracy and avoiding overfitting to the training data. One major drawback of this approach however is that accuracies of simple individual hypotheses may not always be high, hence demanding a way of computing more accurate (or, the most accurate) simple hypotheses effciently. In this paper, we consider several classes of simple but expressive hypotheses such as ranges and regions for numeric attributes, subsets of categorical values, and conjunctions of Boolean tests. For each class, we develop an efficient algorithm for choosing the optimal hypothesis.

Shinichi Morishita
Discovering Polynomials to Fit Multivariate Data Having Numeric and Nominal Variables

This paper proposes an improved version of a method for discovering polynomials to fit multivariate data containing numeric and nominal variables. Each polynomial is accompanied with the corresponding nominal condition stating when to apply the polynomial. Such a nominally conditioned polynomial is called a rule. A set of such rules can be regarded as a single numeric function, and such a function can be approximated and learned by three-layer neural networks. The method selects the best from those trained neural networks with different numbers of hidden units by a newly introduced double layer of cross-validation, and restores the final rules from the best. Experiments using two data sets show that the proposed method works well in discovering very succinct and interesting rules even from data containing irrelevant variables and a small amount of noise.

Ryohei Nakano, Kazumi Saito
Finding of Signal and Image by Integer-Type Haar Lifting Wavelet Transform

This paper describes a new method for finding portions having the same feature in target signals or images from a time series or a reference image. The new method uses an integer-type Haar lifting wavelet transform. Free parameters contained in this transform are learned by using training signals or images. The advantage of this method is to be able to find portions having the same feature in the targets, and to realize robust extraction due to rounding-off arithmetic in the trained transform. In simulations, we show how well the method finds geomagnetic sudden commencements from time series of geomagnetic horizontal components, and extracts facial images from a snapshot.

Koichi Niijima, Shigeru Takano
In Pursuit of Interesting Patterns with Undirected Discovery of Exception Rules

This paper reports our progress on interesting pattern discovery in the discovery science project. We first introduce undirected discovery of exception rules, in which a pattern represents a pair of an exception rule and its corresponding strong rule. Then, we explain scheduled discovery, exception rule discovery guided by a meta-pattern, and data mining contests as our contribution to the project. These can be classified as pattern search, pattern representation, and scheme justification from the viewpoint of research topics in interesting pattern discovery.

Einoshin Suzuki
Mining from Literary Texts: Pattern Discovery and Similarity Computation

This paper surveys our recent studies of text mining from literary works, especially classical Japanese poems, Waka. We present methods for finding characteristic patterns in anthologies of Waka poems, as well as those for finding similar poem pairs. Our aim is to obtain good results that are of interest to Waka researchers, not just to develop efficient algorithms. We report successful results in finding patterns and similar poem pairs, some of which led to new discoveries.

Masayuki Takeda, Tomoko Fukuda, Ichirō Nanri
Second Difference Method Reinforced by Grouping: A New Tool for Assistance in Assignment of ComplexMolecular Spectra

This report describes a search for a new method for assistance in assignment of complex molecular spectra without obviously recognizable regular patterns, on which traditional methods of spectral assignment have depended. We propose “second difference method reinforced by grouping,” a computer-aided technique for picking out regular patterns buried in a list of observed line wavenumbers which look like randomly distributed. The results of tests suggest that this technique may be succesfully applied even to the cases in which the method of Loomis-Wood diagram would fail. Superiority of the present method to the Loomis-Wood method in immunity against the interference by the overlap of unrelated spectral bands is also demonstrated.

Takehiko Tanaka
Discovery of Positive and Negative Knowledge in Medical Databases Using Rough Sets

One of the most important problems on rule induction methods is that extracted rules partially represent information on experts’ decision processes, which makes rule interpretation by domain experts difficult. In order to solve this problem, the characteristics of medical reasoning is discussed, and positive and negative rules are introduced which model medical experts’ rules. Then, for induction of positive and negative rules, two search algorithms are provided. The proposed rule induction method was evaluated on medical databases, the experimental results of which show that induced rules correctly represented experts’ knowledge and several interesting patterns were discovered.

Shusaku Tsumoto
Toward the Discovery of First Principle Based Scientific Law Equations

Conventional work on scientific discovery such as BACON derives empirical law equations from experimental data. In recent years, SDS introducing mathematical admissibility constraints has been proposed to discover first principle based law equations, and it has been further extended to discover law equations from passively observed data. Furthermore, SSF has been proposed to discover the structure of a simultaneous equation model representing an objective process through experiments. In this report, the progress of these studies on the discovery of first principle based scientific law equations is summarized, and the future directions of this research are presented.

Takashi Washio, Hiroshi Motoda
A Machine Learning Algorithm for Analyzing String Patterns Helps to Discover Simple and Interpretable Business Rules from Purchase History

This paper presents a new application for discovering useful knowledge from purchase history that can be helpful to create effective marketing strategy, using a machine learning algorithm, BONSAI, proposed by Shimozono et al. in 1994 which was originally developed for analyzing string patterns developed for knowledge discovery from amino acid sequences. In order to adapt BONSAI to our purpose, we translate purchase history of customers into character strings such that each symbol represents a brand purchased by a customer. For our purpose, we extend BONSAI in the following aspects; 1) While original BONSAI generates a decision tree over regular patterns which are limited to substrings, we extend it to subsequences. 2) We generate rules which contain not only regular patterns but numerical attributes such as age, the number of visits, profit and etc. 3) We extend regular expression so that we can consider whether a certain pattern occurs in some latter part of the whole string. 4) We implement majority voting based on 1-D and 2-D region rules on top of decision trees.Applying the BONSAI extended in this manner to real customers’ purchase history of drugstore chain in Japan, we have succeeded in generating interesting business rules which practitioners have not yet recognized.

Yukinobu Hamuro, Hideki Kawata, Naoki Katoh, Katsutoshi Yada
Constructing Inductive Applications by Meta-Learning with Method Repositories

Here is presented CAMLET that is a platform for automatic composition of inductive applications with method repositories that organize many inductive learning methods. CAMLET starts with constructing a basic design specification for inductive applications with method repositories and data type hierarchy that are specific to inductive learning algorithms. After instantiating the basic design with a given data set into a detailed design specification and then compiling it into codes, CAMLET executes them on computers. CAMLET changes the constructed specification until it goes beyond the goal accuracy given from a user. After having implemented CAMLET on UNIX platforms with Perl and C languages, we have done the case studies of constructing inductive applications for eight different data sets from the StatLog project and have compared the accuracies of the inductive applications composed by CAMLET with all the accuracies from popular inductive learning algorithms. The results have shown us that the inductive applications composed by CAMLET take the first accuracy on the average.

Hidenao Abe, Takahira Yamaguchi
Knowledge Discovery from Semistructured Texts

This paper surveys our recent results on the knowledge discovery from semistructured texts, which contain heterogeneous structures represented by labeled trees. The aim of our study is to extract useful information from documents on the Web. First, we present the theoretical results on learning rewriting rules between labeled trees. Second, we apply our method to the learning HTML trees in the framework of the wrapper induction. We also examine our algorithms for real world HTML documents and present the results.

Hiroshi Sakamoto, Hiroki Arimura, Setsuo Arikawa
Packet Analysis in Congested Networks

This paper proposes new methods of measuring the Internet traffic. These are useful to analysing the network status, especially when the traffic is heavy, i.e. the network is congested. Our first method realizes a light weight measurement which counts only TCP flags, which occupies 6 bits in a TCP packet. Based on the simple flag counts, we can tell whether the network is congested or not. Moreover, we can estimate the average throughput of a network connection based on the flag count. Our second method analyses a sequence of TCP packets based on an automaton, or a protocol machine. The original automaton has been used in the formal specification of TCP protocol. However, it is not applicable to the real Internet traffic. We have improved the automaton in various ways, and established a modified machine. Using the new machine, we can analyse the Internet traffic even if there are packet losses.

Masaki Fukushima, Shigeki Goto
Visualization and Analysis of Web Graphs

We review the progress of our research on Web Graphs. A Web Graph is a directed graph whose nodes are Web pages and whose edges are hyperlinks between pages. Many people use bookmarks and pages of links as a knowledge on internet. We developed a visualization system of Web Graphs. It is a system for construction and analysis of Web graphs. For constructing and analysis of large graphs, the SVD (Singular Value Decomposition) of the adjacency matrix of the graph is used. The experimental application of the system yield some discovery that are unforseen by other approach. The scree plots of the singular values of the adjacency matrix is introduced and confirmed that can be used as a measure to evaluate the Web space.

Sachio Hirokawa, Daisuke Ikeda
Knowledge Discovery in Auto-tuning Parallel Numerical Library

This paper proposes the parallel numerical library called ILIB which realises auto-tuning facilities with selectable calculation kernels, communication methods between processors, and various number of unrolling for loop expansion. This auto-tuning methodology has advantage not only in usability of library but also in performance of library. In fact, results of the performance evaluation show that the auto-tuning or auto-correction feature for the parameters is a crucial technique to attain high performance. A set of parameters which are auto-selected by this auto-tuning methodology gives us several kinds of important knowledge for highly efficient program production. These kinds of knowledge will help us to develop some other high-performance programs, in general.

Hisayasu Kuroda, Takahiro Katagiri, Yasumasa Kanada
Extended Association Algorithm Based on ROC Analysis for Visual Information Navigator

It is very important to derive association rules at high speed from huge volume of databases. However, the typical fast mining algorithms in text databases tend to derive meaningless rules such as stopwords, then many researchers try to remove these noisy rules by using various filters. In our researches, we improve the association algorithm and develop information navigation systems for text data using visual interface, and we also apply a dictionary to remove noisy keywords from derived association rules. In order to remove noisy keywords automatically, we propose an algorithm basedon the true positive rate and the false positive rate in the ROC analysis. Moreover, in order to remove stopwords automatically from raw association rules, we introduce several thresholdv alues of the ROC analysis into our proposedmining algorithm. We evaluate the performance of our proposedmining algorithms in a bibliographic database.

Hiroyuki Kawano, Minoru Kawahara
WWW Visualization Tools for Discovering Interesting Web Pages

In this paper, we describe three types of WWW Visualization tools based on the hyperbolic tree, the WWW Information Access System (WebMapper), the Interactive Browsing Support System (HANAVI), and Web Site Rating System (KAGAMI). WebMapper integrates structure visualization and retrieval of WWW information. It consists of hyperbolic tree visualization and attribute-value pair query manipulation, and provides a filtering function to reduce the size of the WWW information structure. HANAVI is an interactive version of Web Mapper implemented by Java Applet to run on a general web browser. KAGAMI is an automatic web site rating system that outputs six evaluation parameters on a radar chart. It can promote the improvement of a web site and provide effective information of the web site. Our tools enable web browsing support and discovery of interesting pages and lead to creation of a more effective web site.

Hironori Hiraishi, Fumio Mizoguchi
Scalable and Comprehensible Visualization for Discovery of Knowledge from the Internet

We propose visualization techniques for supporting a discovery from and comprehension of the information space built on the Internet. Although the Internet is certainly a rich source of knowledge, discoveries from the net are often too hard. On the one hand, it is difficult to find places where useful knowledge is buried since the information space on the Internet are huge and ill-structured. On the other hand, even if they are found, it is still difficult to read from useful knowledge since it is scattered on a number of fine-grained pages and articles.In order to help human users to find and understand concealed knowledge, we propose two levels of visualization techniques. The first level is designed to provide scalable visualizations, presenting skeletal structures of huge hierarchies. It provides sketchy maps of the entire space and helps the user to navigate to portions where candidate information is stored. The second level provides more detailed and comprehensible views of a small region of the information space. It can help the user to understand knowledge that is scattered on multiple articles and thus inherently hard to follow.

Etsuya Shibayama, Masashi Toyoda, Jun Yabe, Shin Takahashi
Meme Media for Re-editing and Redistributing Intellectual Assets and Their Application to Interactive Virtual Information Materialization

With the growing need for interdisciplinary and international availability, distribution and exchange of intellectual assets including information, knowledge, ideas, pieces of work, and tools in re-editable and redistributable organicforms, we need new media technologies that externalize scientific, technological, and/or cultural knowledge fragments in an organic way, and promote their advanced use, international distribution, reuse, and re-editing. These media may be called meme media since they carry what R. Dawkins called ‘memes’. An accumulation of memes in a society forms a meme pool that functions like a gene pool. Meme pools will bring about rapid accumulations of memes, and require new technologies for the management and retrieval of memes. Our group had been working on the research and development of meme media, including Intelligent Pad and Intelligent Box for 2D and 3D representation. In the Discovery Science Project, our group was asked to extend the meme media framework for the distribution and reuse of the project’s achievements. In addition to this first mission, we were asked to apply 3D meme media architecture to virtual materialization of database records. The first mission has led us to the proposal of a meme pool architecture ‘Piazza’, while the second has developed a framework providing visual interactive components for (1) accessing databases, (2) specifying and modifying database queries, (3) defining an interactive 3D object as a template to materialize each record in a virtual pace, and (4) defining a virtual space and its coordinate system for the information materialization.

Yuzuru Tanaka
Backmatter
Metadaten
Titel
Progress in Discovery Science
herausgegeben von
Setsuo Arikawa
Ayumi Shinohara
Copyright-Jahr
2002
Verlag
Springer Berlin Heidelberg
Electronic ISBN
978-3-540-45884-5
Print ISBN
978-3-540-43338-5
DOI
https://doi.org/10.1007/3-540-45884-0