Skip to main content
Top

1996 | Book

Learning from Data

Artificial Intelligence and Statistics V

Editors: Doug Fisher, Hans-J. Lenz

Publisher: Springer New York

Book Series : Lecture Notes in Statistics

insite
SEARCH

About this book

Ten years ago Bill Gale of AT&T Bell Laboratories was primary organizer of the first Workshop on Artificial Intelligence and Statistics. In the early days of the Workshop series it seemed clear that researchers in AI and statistics had common interests, though with different emphases, goals, and vocabularies. In learning and model selection, for example, a historical goal of AI to build autonomous agents probably contributed to a focus on parameter-free learning systems, which relied little on an external analyst's assumptions about the data. This seemed at odds with statistical strategy, which stemmed from a view that model selection methods were tools to augment, not replace, the abilities of a human analyst. Thus, statisticians have traditionally spent considerably more time exploiting prior information of the environment to model data and exploratory data analysis methods tailored to their assumptions. In statistics, special emphasis is placed on model checking, making extensive use of residual analysis, because all models are 'wrong', but some are better than others. It is increasingly recognized that AI researchers and/or AI programs can exploit the same kind of statistical strategies to good effect. Often AI researchers and statisticians emphasized different aspects of what in retrospect we might now regard as the same overriding tasks.

Table of Contents

Frontmatter

Causality

Frontmatter
1. Two Algorithms for Inducing Structural Equation Models from Data

We present two algorithms for inducing structural equation models from data. Assuming no latent variables, these models have a causal interpretation and their parameters may be estimated by linear multiple regression. Our algorithms are comparable with PC [Spirtes93] and IC [Pearl91a, Pearl91b], which rely on conditional independence. We present the algorithms and empirical comparisons with PC and IC.

Paul R. Cohen, Dawn E. Gregory, Lisa Ballesteros, Robert St. Amant
2. Using Causal Knowledge to Learn More Useful Decision Rules From Data

One of the most popular and enduring paradigms in the intersection of machine-learning and computational statistics is the use of recursive-partitioning or “tree-structured” methods to “learn” classification trees from data sets (Buntine, 1993; Quinlan, 1986). This approach applies to independent variables of all scale types (binary, categorical, ordered categorical, and continuous) and to noisy as well as to noiseless training sets. It produces classification trees that can readily be reexpressed as sets of expert systems rules (with each conjunction of literals corresponding to a set of values for variables along one branch through the tree). Each such rule produces a probability vector for the possible classes (or dependent variable values) that the object being classified may have, thus automatically presenting confidence and uncertainty information about its conclusions. Classification trees can be validated by methods such as cross-validation (Breiman et al., 1984), and they can easily be modified to handle missing data by constructing rules that exploit only the information contained in the observed variables.

Louis Anthony Cox Jr.
3. A Causal Calculus for Statistical Research

A calculus is proposed that admits two conditioning operators: ordinary Bayes conditioning, P (y|X = x), and causal conditioning, P (y|set(X = x)), that is, conditioning P (y) on holding X constant (at x) by external intervention. This distinction, which will be supported by three rules of inference, will permit us to derive probability expressions for the combined effect of observations and interventions. The resulting calculus yields simple solutions to a number of interesting problems in causal inference and should allow rank-and-file researchers to tackle practical problems that are generally considered too hard, or impossible. Examples are: 1.Deciding whether the information available in a given observational study is sufficient for obtaining consistent estimates of causal effects.2.Deriving algebraic expressions for causal effect estimands.3.Selecting measurements that would render randomized experiments unnecessary.4.Selecting a set of indirect (randomized) experiments to replace direct experiments that are either infeasible or too expensive.5.Predicting (or bounding) the efficacy of treatments from randomized trials with imperfect compliance. Starting with nonparametric specification of structural equations, the paper establishes the semantics necessary for a theory of interventions, presents the three rules of inference, and proposes an operational definition of structural equations.

Judea Pearl
4. Likelihood-based Causal Inference

A method is given which uses subject matter assumptions to discriminate recursive models and thus point toward possible causal explanations. The assumptions alone do not specify any order among the variables — rather just a theoretical absence of direct association. We show how these assumptions, while not specifying any ordering, can when combined with the data through the likelihood function yield information about an underlying recursive order. We derive details of the method for multi-normal random variables.

Qing Yao, David Tritchler

Inference and Decision Making

Frontmatter
5. Ploxoma: Testbed for Uncertain Inference

This paper compares two formalisms for uncertain inference, Kyburg’s Combinatorial Semantics and Dempster-Shafer belief function theory, on the basis of an example from the domain of medical diagnosis. I review Shafer’s example about the imaginary disease ploxoma and show how it would be represented in Combinatorial Semantics. I conclude that belief function theory has a qualitative advantage because it offers greater flexibility of expression, and provides results about more specific classes of patients. Nevertheless, a quantitative comparison reveals that the inferences sanctioned by Combinatorial Semantics are more reliable than those of belief function theory.

H. Blau
6. Solving Influence Diagrams Using Gibbs Sampling

We describe a Monte Carlo method for solving influence diagrams. This method is a combination of stochastic dynamic programming and Gibbs sampling, an iterative Markov chain Monte Carlo method. Our method is especially useful when exact methods for solving influence diagrams fail.

Ali Jenzarli
7. Modeling and Monitoring Dynamic Systems by Chain Graphs

It is widely recognized that probabilistic graphical models provide a good framework for both knowledge representation and probabilistic inference (e.g., see [Cheeseman94], [Whittaker90]). The dynamic behaviour of a system which changes over time requires an implicit or explicit time representation. In this paper, an implicit time representation using dynamic graphical models is proposed. Our goal is to model the state of a system and its evolution over time in a richer and more natural way than other approaches together with a more suitable treatment of the inference on variables of interest.

Alberto Lekuona, Beatriz Lacruz, Pilar Lasala
8. Propagation of Gaussian belief functions

A Gaussian belief function (GBF) can be intuitively described as a Gaussian distribution over a hyperplane, whose parallel sub-hyperplanes are the focal elements. It includes as special cases non-probabilistic linear equations, statistical observations, multivariate Gaussian distributions, and vacuous belief functions. The notion of GBFs was proposed in [Dem 90b], formalized in [Sha 92] and [Liu 95a], and successfully applied in combining independent statistical models in [Liu 95b]. In this paper, we propose a join-tree computation scheme for GBFs. We first represent Dempster’s rule for combining GBFs obtained in [Liu 95a] equivalently in terms of matrix sweepings. We then show that the operations of GBFs follow the axioms of Shenoy and Shafer [ShSh 90] and justify the possibility of a join-tree computation scheme for GBFs. The result enriches theory of local computation by extending its application to the combination of statistical models and the integration of knowledge bases. An example is carried out to illustrate how combined inference can be made in accordance with multiple statistical models using graphical belief function models.

Liping Liu
9. On Test Selection Strategies for Belief Networks

Decision making under uncertainty typically requires an iterative process of information acquisition. At each stage, the decision maker chooses the next best test (or tests) to perform, and re-evaluates the possible decisions. Value-of-information analyses provide a formal strategy for selecting the next test(s). However, the complete decision-theoretic approach is impractical and researchers have sought approximations.In this paper, we present strategies for both myopic and limited non-myopic (working with known test groups) test selection in the context of belief networks. We focus primarily on utility-free test selection strategies. However, the methods have immediate application to the decision-theoretic framework.

David Madigan, Russell G. Almond
10. Representing and Solving Asymmetric Decision Problems Using Valuation Networks

This paper deals with asymmetric decision problems. We describe a generalization of the valuation network representation and solution technique to enable efficient representation and solution of asymmetric decision problems. The generalization includes the concepts of indicator valuations and effective frames. We illustrate our technique by solving Raiffa’s oil wildcatter’s problem in complete detail.

Prakash P. Shenoy
11. A Hill-Climbing Approach for Optimizing Classification Trees

We consider the problem of minimizing the expected cost of determining the correct value of a binary-valued function when it is costly to inspect the values of its arguments. This type of problem arises in distributed computing, in the design of diagnostic expert systems, in reliability analysis of multi-component systems, and in many other applications. Any feasible solution to the problem can be described by a sequential inspection procedure which is usually represented by a binary classification tree. In this paper, we propose an efficient hill-climbing algorithm to search for the optimal or near-optimal classification trees. Computational results show that the hill-climbing approach was able to find optimal solutions for 95% of the cases tested.

Xiaorong Sun, Steve Y. Chiu, Louis Anthony Cox

Search Control in Model Hunting

Frontmatter
12. Learning Bayesian Networks is NP-Complete

Algorithms for learning Bayesian networks from data have two components: a scoring metric and a search procedure. The scoring metric computes a score reflecting the goodness-of-fit of the structure to the data. The search procedure tries to identify network structures with high scores. Heckerman et al. (1995) introduce a Bayesian metric, called the BDe metric, that computes the relative posterior probability of a network structure given data. In this paper, we show that the search problem of identifying a Bayesian network—among those where each node has at most K parents—that has a relative posterior probability greater than a given constant is NP-complete, when the BDe metric is used.

David Maxwell Chickering
13. Heuristic Search for Model Structure: the Benefits of Restraining Greed

Inductive modeling or “machine learning” algorithms are able to discover structure in high-dimensional data in a nearly automated fashion. These adaptive statistical methods — including decision trees, polynomial networks, projection pursuit models, and additive networks — repeatedly search for, and add on, the model component judged best at that state. Because of the huge model space of possible components, the choice is typically greedy; that is, optimal only in the very short term. In fact, it is usual for the analyst and algorithm to be greedy at three levels: when choosing a 1) term within a model, 2) model within a family, and 3) family within a wide collection of methods. It is better, we argue, to “take a longer view” in each stage. For the first stage (term selection) examples are presented for classification using decision trees and estimation using regression. To improve the third stage (method selection) we propose fusing information from disparate models to make a combined model more robust. (Fused models merge their output estimates but also share information on, for example, variables to employ and cases to ignore.) Benefits of fusing are demonstrated on a challenging classification dataset, where the task is to infer the species of a bat from its chirps.

John F. Elder IV
14. Learning Possibilistic Networks from Data

We introduce a method for inducing the structure of (causal) possibilistic networks from databases of sample cases. In comparison to the construction of Bayesian belief networks, the proposed framework has some advantages, namely the explicit consideration of imprecise (set-valued) data, and the realization of a controlled form of information compression in order to increase the efficiency of the learning strategy as well as approximate reasoning using local propagation techniques. Our learning method has been applied to reconstruct a non-singly connected network of 22 nodes and 24 arcs without the need of any a priori supplied node ordering.

Jörg Gebhardt, Rudolf Kruse
15. Detecting Imperfect Patterns in Event Streams Using Local Search

Recurring patterns in event streams may indicate causal influences. Such patterns have been used as the basis of software debugging. One technique, dependency detection, for finding these patterns requires exhaustive search and perfect patterns, two characteristics that are unrealistic for event streams extracted from software executions. This paper presents an enhanced version of dependency detection that uses a more flexible pattern matching scheme to extend the types of patterns detected and local search to reduce the computational demands. The new version was tested on real and artificial data to determine whether local search is effective for detecting strong patterns.

Adele E. Howe
16. Structure Learning of Bayesian Networks by Hybrid Genetic Algorithms

This paper demonstrates how genetic algorithms can be used to discover the structure of a Bayesian network from a given database with cases. The results presented, were obtained by applying four different types of genetic algorithms — SSGA (Steady State Genetic Algorithm), GAeλ (Genetic Algorithm elistist of degree λ), hSSGA (hybrid Steady State Genetic Algorithm) and the hGAeλ (hybrid Genetic Algorithm elitist of degree λ) — to simulations of the ALARM Network. The behaviour of these algorithms is studied as their parameters are varied.

Pedro Larrañaga, Roberto Murga, Mikel Poza, Cindy Kuijpers
17. An Axiomatization of Loglinear Models with an Application to the Model-Search Problem

A good strategy to save computational time in a model-search problem consists in endowing the search procedure with a mechanism of logical inference, which sometimes allows a loglinear model to be accepted or rejected on logical grounds, without resorting to the numeric test. In principle, the best inferential mechanism should based on a complete axiomatization of loglinear models. We present a (probably incomplete) axiomatization, which can be translated into a graphical inference procedure working with directed acyclic graphs, and show how it can be applied to find an efficient solution to the model-search problem.

Francesco M. Malvestuto
18. Detecting Complex Dependencies in Categorical Data

Locating and evaluating relationships among values in multiple streams of data is a difficult and important task. Consider the data flowing from monitors in an intensive care unit. Readings from various subsets of the monitors are indicative and predictive of certain aspects of the patient’s state. We present an algorithm that facilitates discovery and assessment of the strength of such predictive relationships called Multi-stream Dependency Detection (MSDD). We use heuristic search to guide our exploration of the space of potentially interesting dependencies to uncover those that are significant. We begin by reviewing the dependency detection technique described in [3], and extend it to the multiple stream case, describing in detail our heuristic search over the space of possible dependencies. Quantitative evidence for the utility of our approach is provided through a series of experiments with artificially-generated data. In addition, we present results from the application of our algorithm to two real problem domains: feature-based classification and prediction of pathologies in a simulated shipping network.

Tim Oates, Matthew D. Schmill, Dawn E. Gregory, Paul R. Cohen

Classification

Frontmatter
19. A Comparative Evaluation of Sequential Feature Selection Algorithms

Several recent machine learning publications demonstrate the utility of using feature selection algorithms in supervised learning tasks. Among these, sequential feature selection algorithms are receiving attention. The most frequently studied variants of these algorithms are forward and backward sequential selection. Many studies on supervised learning with sequential feature selection report applications of these algorithms, but do not consider variants of them that might be more appropriate for some performance tasks. This paper reports positive empirical results on such variants, and argues for their serious consideration in similar learning tasks.

David W. Aha, Richard L. Bankert
20. Classification Using Bayes Averaging of Multiple, Relational Rule-based Models

We present a way of approximating the posterior probability of a rule-set model that is comprised of a set of class descriptions. Each class description, in turn, consists of a set of relational rules. The ability to compute this posterior and to learn many models from the same training set allows us to approximate the expectation that an example to be classified belongs to some class. The example is assigned to the class maximizing the expectation. By assuming a uniform prior distribution of models, the posterior of the model does not depend on the structure of the model: it only depends on how the training examples are partitioned by the rules of the rule-set model. This uniform distribution assumption allows us to compute the posterior for models containing relational and recursive rules. Our approximation to the posterior probability yields significant improvements in accuracy as measured on four relational data sets and four attribute-value data sets from the UCI repository. We also provide evidence that learning multiple models helps most in data sets in which there are many, apparently equally good rules to learn.

Kamal Ali, Michael Pazzani
21. Picking the Best Expert from a Sequence

We examine the problem of finding a good expert from a sequence of experts. Each expert has an “error rate”; we wish to find an expert with a low error rate. However, each expert’s error rate is unknown and can only be estimated by a sequence of experimental trials. Moreover, the distribution of error rates is also unknown. Given a bound on the total number of trials, there is thus a tradeoff between the number of experts examined and the accuracy of estimating their error rates.We present a new expert-finding algorithm and prove an upper bound on the expected error rate of the expert found. A second approach, based on the sequential ratio test, gives another expert-finding algorithm that is not provably better but which performs better in our empirical studies.

Ruth Bergman, Ronald L. Rivest
22. Hierarchical Clustering of Composite Objects with a Variable Number of Components

This paper examines the problem of clustering a sequence of objects that cannot be described with a predefined list of attributes (or variables). In many applications, a fixed list of attributes cannot be determined without substantial pre-processing. An extension of the traditional propositional formalism is thus proposed, which allows objects to be represented as a set of components, i.e. there is no mapping between attributes and values. The algorithm used for clustering is briefly illustrated, and mechanisms to handle sets are described. Some empirical evaluations are also provided to assess the validity of the approach.

A. Ketterlin, P. Gançarski, J. J. Korczak
23. Searching for Dependencies in Bayesian Classifiers

Naive Bayesian classifiers which make independence assumptions perform remarkably well on some data sets but poorly on others. We explore ways to improve the Bayesian classifier by searching for dependencies among attributes. We propose and evaluate two algorithms for detecting dependencies among attributes and show that the backward sequential elimination and joining algorithm provides the most improvement over the naive Bayesian classifier. The domains on which the most improvement occurs are those domains on which the naive Bayesian classifier is significantly less accurate than a decision tree learner. This suggests that the attributes used in some common databases are not independent conditioned on the class and that the violations of the independence assumption that affect the accuracy of the classifier can be detected from training data.

Michael J. Pazzani

General Learning Issues

Frontmatter
24. Statistical Analysis of Complex Systems in Biomedicine

The future explanatory power in biomedicine will be at the molecular-genetic level of analysis (rather than the epidemiologic-demographic or anatomic-cellular levels). This is the level of complex systems. Complex systems are characterized by nonlinearity and complex interactions. It is difficult for traditional statistical methods to capture complex systems because traditional methods attempt to find the model that best fits the statistician’s understanding of the phenomenon; complex systems are difficult to understand and therefore difficult to fit with a simple model. Artificial neural networks are nonparametric regression models. They can capture any phenomena, to any degree of accuracy (depending on the adequacy of the data and the power of the predictors), without prior knowledge of the phenomena. Further, artificial neural networks can be represented, not only as formulae, but also as graphical models. Graphical models can increase analytic power and flexibility. Artificial neural networks are a powerful method for capturing complex phenomena, but their use requires a paradigm shift, from exploratory analysis of the data to exploratory analysis of the model.

Harry B. Burke
25. Learning in Hybrid Noise Environments Using Statistical Queries

We consider formal models of learning from noisy data. Specifically, we focus on learning in the probability approximately correct model as defined by Valiant. Two of the most widely studied models of noise in this setting have been classification noise and malicious errors. However, a more realistic model combining the two types of noise has not been formalized. We define a learning environment based on a natural combination of these two noise models. We first show that hypothesis testing is possible in this model. We next describe a simple technique for learning in this model, and then describe a more powerful technique based on statistical query learning. We show that the noise tolerance of this improved technique is roughly optimal with respect to the desired learning accuracy and that it provides a smooth tradeoff between the tolerable amounts of the two types of noise. Finally, we show that statistical query simulation yields learning algorithms for other combinations of noise models, thus demonstrating that statistical query specification truly captures the generic fault tolerance of a learning algorithm.

Scott E. Decatur
26. On the Statistical Comparison of Inductive Learning Methods

Experimental comparisons between statistical and machine learning methods appear with increasing frequency in the literature. However, there does not seem to be a consensus on how such a comparison is performed in a methodologically sound way. Especially the effect of testing multiple hypotheses on the probability of producing a ”false alarm” is often ignored.We transfer multiple comparison procedures from the statistical literature to the type of study discussed in this paper. These testing procedures take the number of tests performed into account, thereby controlling the probability of generating ”false alarms”. The multiple comparison procedures selected are illustrated on well-know regression and classification data sets.

A. Feelders, W. Verkooijen
27. Dynamical Selection of Learning Algorithms

Determining the conditions for which a given learning algorithm is appropriate is an open problem in machine learning. Methods for selecting a learning algorithm for a given domain have met with limited success. This paper proposes a new approach to predicting a given example’s class by locating it in the “example space” and then choosing the best learner(s) in that region of the example space to make predictions. The regions of the example space are defined by the prediction patterns of the learners being used. The learner(s) chosen for prediction are selected according to their past performance in that region. This dynamic approach to learning algorithm selection is compared to other methods for selecting from multiple learning algorithms. The approach is then extended to weight rather than select the algorithms according to their past performance in a given region. Both approaches are further evaluated on a set of ten domains and compared to several other meta-learning strategies.

Christopher J. Merz
28. Learning Bayesian Networks Using Feature Selection

This paper introduces a novel enhancement for learning Bayesian networks with a bias for small, high-predictive-accuracy networks. The new approach selects a subset of features that maximizes predictive accuracy prior to the network learning phase. We examine explicitly the effects of two aspects of the algorithm, feature selection and node ordering. Our approach generates networks that are computationally simpler to evaluate and display predictive accuracy comparable to that of Bayesian networks which model all attributes.

Gregory M. Provan, Moninder Singh
29. Data Representations in Learning

This paper examines the effect of varying the coarse-ness (or fine-ness) in a data representation upon the learning or recognition accuracy achievable. This accuracy is quantified by the least probability of error in recognition or the Bayes error rate, for a finite-class pattern recognition problem. We examine variation in recognition accuracy as a function of resolution, by modeling the granularity variation of the representation as a refinement of the underlying probability structure of the data. Specifically, refining the data representation leads to improved bounds on the probability of error. Indeed, this confirms the intuitive notion that more information can lead to improved decision-making. This analysis may be extended to multiresolution methods where coarse-to-fine and fineto-coarse variations in representations are possible.We also discuss a general method to examine the effects of image resolution on recognizer performance. Empirical results in a 840-class Japanese optical character recognition task are presented. Considerable improvements in performance are observed as resolution increases from 40 to 200 ppi. However, diminshed performance improvements are observed at resolutions higher than 200 ppi. These results are useful in the design of optical character recognizers. We suggest that our results may be relevant to human letter recognition studies, where such an objective evaluation of the task is required.

Geetha Srikantan, Sargur N. Srihari

EDA: Tools and Methods

Frontmatter
30. Rule Induction as Exploratory Data Analysis

This paper examines induction of decision rules for purposes of exploratory data analysis, and presents various tools and techniques for this. Decision tables provide a compact, consistent format that allows several fairly large classifiers to be examined on a page. Variation in rulesets arises naturally to a surprisingly large degree from small differences in sampling from a training set, and extreme differences can be provoked by altering parameters related to misclassification costs. We argue that this variation can be beneficial to human understanding by providing multiple views of the data. Unfortunately the goal of simplifying classifiers to improve their clarity conflicts with the requirement in some applications for accurate class probability estimates. These phenomena are explained and exhibited on several datasets, including an easily understood illustration using rulesets for a domain based on the card game Twenty-one. Techniques for displaying rulesets in the form of decision tables are also presented.

Jason Catlett
31. Non-Linear Dimensionality Reduction: A Comparative Performance Analysis

We present an analysis of the comparative performance of non-linear dimensionality reduction methods such as Non-Linear Mapping, Non-Metric Multidimensional Scaling and the Kohonen Self-Organising Feature Map for which data sets of different dimensions are used. To obtain comparative measures of how well the mapping is performed, Procrustes analysis, the Spearman rank correlation coefficient and the scatter-plot diagram are used. Results indicate that, in low dimensions, Non-Linear Mapping has the best performance especially when measured in terms of the Spearman rank correlation coefficient. The output from the Kohonen Self-Organising Feature Map is easier to interpret than the output from the other methods as it often provides a superior qualitative visual output. Also, the Kohonen Self-Organising Feature Map may outperform the other methods in a high-dimensional setting.

Olivier de Vel, Sofianto Li, Danny Coomans
32. Omega-Stat: An Environment for Implementing Intelligent Modeling Strategies

Omega-Stat, a new data analysis and modeling paradigm, is built on Lisp-Stat—an object-oriented statistical programming environment. It contains extensible, reusable-component libraries for performing data management, multivariate analyses, modeling, and dynamic graphics. A point-and-click user interface allows instant access to all objects, including analysis and graphics objects. Modeling is done by adding new model objects, i.e., extended datasets, to a tree structure originally containing prototypes for linear, generalized linear, and nonlinear models. Knowledge, and methods for accessing this knowledge, are embedded within model objects and edge objects linking these models. This representation allows the modeling process to be studied by following the analysis trails of expert analysts. The objective is to provide an expert consultant that is accessible as part of man/machine interaction. Modeling strategies can then be built into Omega-Stat, by using prior knowledge and data-analytic heuristics, to guide the process of constructing the model tree and the iterative search for an “optimal” model.

E. James Harner, Hanga C. Galfalvy
33. Framework for a Generic Knowledge Discovery Toolkit

Industrial and commercial firms accumulate vast quantities of data in the course of their day-to-day business. The primary use of this data is to monitor business processes: inventory, maintenance actions, and so on. However this data contains much valuable information that, if accessible, would enhance the understanding of, and aid in improving the performance of, the processes being monitored.

Pat Riddle, Roman Fresnedo, David Newman
34. Control Representation in an EDA Assistant

To develop an initial understanding of complex data, one often begins with exploration. Exploratory data analysis (EDA) provides a set of statistical tools through which patterns in data may be extracted and examined in detail. The search space of EDA operations is enormous, too large to be explored directly in a data-driven manner. More abstract EDA procedures can be captured, however, by representations commonly used in AI planning systems. We describe an implemented planning representation for Aide, an automated EDA assistant, with a focus on control issues.

Robert St. Amant, Paul R. Cohen

Decision and Regression Tree Induction

Frontmatter
35. A Further Comparison of Simplification Methods for Decision-Tree Induction

This paper presents an empirical investigation of eight well-known simplification methods for decision trees induced from training data. Twelve data sets are considered to compare both the accuracy and the complexity of simplified trees. The computation of optimally pruned trees is used in order to give a clear definition of bias of the methods towards overpruning and underpruning. The results indicate that the simplification strategies which exploit an independent pruning set do not perform better than the others. Furthermore, some methods show an evident bias towards either underpruning or overpruning.

Donato Malerba, Floriana Esposito, Giovanni Semeraro
36. Robust Linear Discriminant Trees

We present a new method for the induction of classification trees with linear discriminants as the partitioning function at each internal node. This paper presents two main contributions: first, a novel objective function called soft entropy which is used to identify optimal coefficients for the linear discriminants, and second, a novel method for removing outliers called iterative re-faltering which boosts performance on many datasets. These two ideas are presented in the context of a single learning algorithm called DT-SEPIR, which is compared with the CART and OC1 algorithms.

George H. John
37. Tree Structured Interpretable Regression

We describe a new method of tree-based regression. The response is estimated by building an adaptive linear model which varies for different paths through the tree. The method has the following potential advantages over traditional methods: the method can naturally be applied to very large datasets in which only a small proportion of the predictors are useful, the resulting regression rules are more easily interpreted and applied, and may be more accurate in application, since the rules are derived by means of a cross-validation technique which maximizes their predictive accuracy. The system is evaluated in an empirical study and compared to traditional regression and CART systems.

David Lubinsky
38. An Exact Probability Metric for Decision Tree Splitting

ID3’s information gain heuristic is well-known to be biased towards multi-valued attributes. This bias is only partially compensated by the gain ratio used in C4.5. Several alternatives have been proposed, notably orthogonality and Beta. Gain ratio and orthogonality are strongly correlated, and all of the metrics share a common bias towards splits with one or more small expected values, under circumstances where the split likely ocurred by chance. Both classical and Bayesian statistics lead to the multiple hypergeometric distribution as the posterior probability of the null hypothesis. Both gain and the chi-squared significance test are shown to arise in asymptotic approximations to the hypergeometric, revealing similar criteria for admissibility and showing the nature of their biases. Previous failures to find admissible stopping rules are traced to coupling these biased approximations with one another or with arbitrary thresholds; problems which are overcome by the hypergeometric. Empirical results show that pre-pruning should be done, as trees pruned in this way are simpler, more efficient, and no less accurate than unpruned trees. Average training time is reduced by up to 30%, and expensive post-pruning avoided.

J. Kent Martin

Natural Language Processing

Frontmatter
39. Two Applications of Statistical Modelling to Natural Language Processing

Each week the Columbia-Presbyterian Medical Center collects several megabytes of English text transcribed from radiologists’ dictation and notes of their interpretations of medical diagnostic x-rays. It is desired to automate the extraction of diagnoses from these natural language reports. This paper reports on two aspects of this project requiring advanced statistical methods. First, the identification of pairs of words and phrases that tend to appear together (collocate) uses a hierarchical Bayesian model that adjusts to different word and word pair distributions in different bodies of text. Second, we present an analysis of data from experiments to compare the performance of the computer diagnostic program to that of a panel of physician and lay readers of randomly sampled texts. A measure of inter-subject distance with respect to the diagnoses is defined for which estimated variances and covariances are easily computed. This allows statistical conclusions about the similarities and dissimilarities among diagnoses by the various programs and experts.

William DuMouchel, Carol Friedman, George Hripcsak, Stephen B. Johnson, Paul D. Clayton
40. A Model for Part-of-Speech Prediction

Robust natural language analysis systems must be able to handle words that are not in the lexicon. This paper describes a statistical model that predicts the most likely Parts-of-Speech for previously unseen words. The method uses a loglinear model to combine a number of orthographic and morphological features, and returns a probability distribution over the open word classes. The model is combined with a stochastic Part-of-Speech tagger to provide a model of context. Empirical evaluation shows that this results in significant gains in Part-of-Speech prediction accuracy over simpler methods.

Alexander Franz
41. Viewpoint-Based Measurement of Semantic Similarity between Words

A method of measuring semantic similarity between words using a knowledge-base constructed automatically from machine-readable dictionaries is proposed. The method takes into consideration the fact that similarity changes depending on situation or context, which we call ‘viewpoint’. Evaluation shows the proposed method, although based on a simply structured knowledge-base, is superior to other currently available methods.

Kaname Kasahara, Kazumitsu Matsuzawa, Tsutomu Ishikawa, Tsukasa Kawaoka
42. Part-of-Speech Tagging from “Small” Data Sets

Probabilistic approaches to part-of-speech (POS) tagging compile statistics from massive corpora such as the Lancaster-Oslo-Bergen (LOB) corpus. Training on a 900,000 token training corpus, the hidden Markov model (HMM) method easily achieves a 95 per cent success rate on a 100,000 token test corpus. However, even such large corpora contain relatively few words and new words are subsequently encountered in test corpora. For example, the million-token LOB contains only about 45,000 different words, most of which occur only once or twice. We find that 3–4 per cent of tokens in a disjoint test corpus are unseen, that is, unknown to the tagger after training, and cause a significant proportion of errors. A corpus representative of all possible tag sequences seems implausible enough, let alone a corpus that also represents, even in small numbers, enough of English to make the problem of unseen words insignificant. Experimental results confirm that this extreme course is not necessary. Variations on the HMM approach, including ending-based approaches, incremental learning strategies, and the use of approximate distributions, result in a tagger that tags unseen works nearly as accurately as seen words.

Eric Neufeld, Greg Adams
Backmatter
Metadata
Title
Learning from Data
Editors
Doug Fisher
Hans-J. Lenz
Copyright Year
1996
Publisher
Springer New York
Electronic ISBN
978-1-4612-2404-4
Print ISBN
978-0-387-94736-5
DOI
https://doi.org/10.1007/978-1-4612-2404-4