Skip to main content

1994 | Buch

Selecting Models from Data

Artificial Intelligence and Statistics IV

herausgegeben von: P. Cheeseman, R. W. Oldford

Verlag: Springer New York

Buchreihe : Lecture Notes in Statistics

insite
SUCHEN

Über dieses Buch

This volume is a selection of papers presented at the Fourth International Workshop on Artificial Intelligence and Statistics held in January 1993. These biennial workshops have succeeded in bringing together researchers from Artificial Intelligence and from Statistics to discuss problems of mutual interest. The exchange has broadened research in both fields and has strongly encour­ aged interdisciplinary work. The theme ofthe 1993 AI and Statistics workshop was: "Selecting Models from Data". The papers in this volume attest to the diversity of approaches to model selection and to the ubiquity of the problem. Both statistics and artificial intelligence have independently developed approaches to model selection and the corresponding algorithms to implement them. But as these papers make clear, there is a high degree of overlap between the different approaches. In particular, there is agreement that the fundamental problem is the avoidence of "overfitting"-Le., where a model fits the given data very closely, but is a poor predictor for new data; in other words, the model has partly fitted the "noise" in the original data.

Inhaltsverzeichnis

Frontmatter

Overviews: Model Selection

Frontmatter
1. Statistical strategy: step 1

Before one can select a model one must formulate the research question that the model is being built to address. This formulation — deciding precisely what it is one wants to know — is the first step in using statistical methods. It is the first step in any statistical strategy. This paper contends that often the research question is poorly formulated, so that inappropriate models are built and inappropriate analyses undertaken. To illustrate this three examples are given: explanatory versus pragmatic comparisons in clinical trials, confused definitions of interaction, and two ways to measure relative change. A plea is made that statistics teaching should focus on higher level strategic issues, rather than mathematical manipulation since the latter is now performed by the computer.

D. J. Hand
2. Rational Learning: Finding a Balance Between Utility and Efficiency

Learning is an important aspect of intelligent behavior. Unfortunately, learning rarely comes for free. Techniques developed by machine learning can improve the abilities of an agent but they often entail considerable computational expense. Furthermore, there is an inherent tradeoff between the power and efficiency of learning techniques. This poses a dilemma to a learning agent that must act in the world under a variety of resource constraints. This article considers the problem of rational learning algorithms that dynamically adjust their behavior based on the larger context of overall performance goals and resource constraints.

Jonathan Gratch, Gerald DeJong, Yuhong Yang
3. A new criterion for selecting models from partially observed data

A new criterion PDIO (predictive divergence for indirect observation models) is proposed for selecting statistical models from partially observed data. PDIO is devised for “indirect observation models”, in which observations are only available indirectly through random variables. That is, some underlying hidden structure is assumed to generate the manifest variables. For example, unsupervised learning recognition systems, clustering, latent structure analysis, mixture distribution models, missing data, noisy observations, etc., or the models whose maximum likelihood estimator is based on the EM (expectation-maximization) algorithm. PDIO is a natural extension of AIC (Akaike’s information criterion), and the two criteria are equivalent when direct observations are available. Both criteria are expressed as the sum of two terms: the first term represents the goodness of fit of the model to the observed data, and the second term represents the model complexity. The goodness ot fit terms are equivalent in both criteria, but the complexity terms are different. The complexity term is a function of model structure and the number of samples and is added in order to take into account the reliability of the observed data. A mean fluctuation of the estimated true distribution is used as the model complexity in PDIO. The relative relation of the “model manifold” and the “observed manifold” is, therefore, reflected in the complexity term of PDIO from the information geometric point of view, whereas it reduces to the number of parameters in AIC. PDIO is very unique in dealing with the unobservable underlying structure “positively.” In this paper the generalized expression of PDIO is shown using two Fisher information matrices. An approximated computation method for PDIO is also presented utilizing EM iterates. Some computer simulations are shown to demonstrate how this criterion works.

Hidetoshi Shimodaira
4. Small-sample and large-sample statistical model selection criteria

Statistical model selection criteria provide answers to the questions, “How much improvement in fit should be achieved to justify the inclusion of an additional parameter in a model, and on what scale should this improvement in fit be measured?” Mathematically, statistical model selection criteria are defined as estimates of suitable functional of the probability distributions corresponding to alternative models. This paper discusses different approaches to model-selection criteria, with a view toward illuminating their similarities and differences. The approaches discussed range from explicit, small-sample criteria for highly specific problems to general, large-sample criteria such as Akaike’s information criterion and variants thereof. Special emphasis is given to criteria derived from a Bayesian approach, as this presents a unified way of viewing a variety of criteria. In particular, the approach to model-selection criteria by asymptotic expansion of the log posterior probabilities of alternative models is reviewed. An information-theoretic approach to model selection, through minimum-bit data representation, is explored. Similarity of the asymptotic form of Rissanen’s criterion, obtained from a minimum-bit data representation approach, to criteria derived from a Bayesian approach, is discussed.

S. L. Sclove
5. On the choice of penalty term in generalized FPE criterion

In variable selection, many existing selection criteria are closely related to the generalized final prediction error (FPE) criterion. In the linear regression context, the FPE criterion amounts to minimizing C(k, λ) = RSS(k)+λkσ2 over k, where k is the number of parameters in the model, RSS(k) is the residual sum of squares and σ2 is some estimate of the error variance. Different values of λ give different selection criteria. This article presents some useful results on the choice of λ. Some insights are obtained. Application to the study of the multifold cross validation criterion is also discussed.

Ping Zhang
6. Cross-Validation, Stacking and Bi-Level Stacking: Meta-Methods for Classification Learning

So many methods have been invented for the problem of classification learning that practitioners now face a meta-level problem of choosing among alternatives or arbitrating between them. A natural idea is to use cross-validation to select one of several learning algorithms. A more general approach is Wolpert’s stacking, which uses the predictions of several methods together; and this may be further generalized to bi-level stacking, in which ordinary attributes are allowed to play a role in arbitrating between methods. In this paper, we examine cross-validation, stacking and bi-level stacking and present empirical results to illustrate key points.

Cullen Schaffer
7. Probabilistic approach to model selection: comparison with unstructured data set

The problem of model selection by data is discussed. This is the problem of finding the internal structure of a given data set, such as signal or image segmentation, cluster analysis, curve and surface fitting. In many cases, there is limited background information, and the given set of data is almost the only source of analysis and decision about the model. A brief review of different approaches to the problem is presented and the probabilistic approach, based on the comparison of the results obtained for a given data set with the one obtained for a set of unstructured data, is introduced. This approach is presented in greater detail for two problems of regression analysis: selecting best subset of regressors and finding best piece-wise regression. Some theoretical results are presented and relations with other approaches to model selection are discussed.

Victor L. Brailovsky
8. Detecting and Explaining Dependencies in Execution Traces

AI systems in complex environments can be hard to understand. We present a simple method for finding dependencies between actions and later failures in execution traces of the Phoenix planner. We also discuss failure recovery analysis, a method for explaining dependencies discovered in the execution traces of Phoenix’s failure recovery behavior.

Adele E. Howe, Paul R. Cohen
9. A method for the dynamic selection of models under time constraints

Finding a model of a complex system that is at the right level of detail for a specific purpose is a difficult task. Under a time constraint for decision-making, we may prefer less complex models that are less accurate over more accurate models that require longer computation times. We can define the optimal model to select under a time constraint, but we cannot compute the optimal model in time to be useful. We present a heuristic method to select a model under a time constraint; our method is based on searching a set of alternative models that are organized as a graph of models (GoM). We define the application-specific level of prediction accuracy that is required for a model to be adequate, then use the probability of model adequacy as a metric during the search for a minimally complex, adequate model. We compute an approximate posterior probability of adequacy by applying a belief network to compute the prior probability of adequacy for models in the GoM, then by fitting the models under consideration to the quantitative observations. We select the first adequate model that we find, then refine the model selection by searching for the minimally complex, adequate model. We describe work in progress to implement this method to solve a model-selection problem in the domain of physiologic models of the heart and lungs.

Geoffrey Rutledge, Ross Shachter

Graphical Methods

Frontmatter
10. Strategies for Graphical Model Selection

We consider the problem of model selection for Bayesian graphical models, and embed it in the larger context of accounting for model uncertainty. Data analysts typically select a single model from some class of models, and then condition all subsequent inference on this model. However, this approach ignores model uncertainty, leading to poorly calibrated predictions: it will often be seen in retrospect that one’s uncertainty bands were not wide enough. The Bayesian analyst solves this problem by averaging over all plausible models when making inferences about quantities of interest. In many applications, however, because of the size of the model space and awkward integrals, this averaging will not be a practical proposition, and approximations are required. Here we examine the predictive performance of two recently proposed model averaging schemes. In the examples considered, both schemes outperform any single model that might reasonably have been selected.

David Madigan, Adrian E. Raftery, Jeremy C. York, Jeffrey M. Bradshaw, Russell G. Almond
11. Conditional dependence in probabilistic networks

In general, a probabilistic network is considered a representation of a set of conditional independency statements. However, probabilistic networks also represent dependencies. In this paper an axiomatic characterization of conditional dependence is given. Furthermore, a criterion is given to read conditional dependencies from a probabilistic network.

Remco R. Bouckaert
12. Reuse and sharing of graphical belief network components

A team of experts assemble a graphical belief network from many small pieces. This paper catalogs the types of knowledge that comprise a graphical belief network and proposes a way in which they can be stored in libraries. This promotes reuse of model components both within the team and between projects.

Russell Almond, Jeffrey Bradshaw, David Madigan
13. Bayesian Graphical Models for Predicting Errors in Databases

In recent years, much attention has been directed at various graphical “conditional independence” models and at the application of such graphical models to probabilistic expert systems. However, there exists a broad range of statistical problems to which Bayesian graphical models, in particular, can be applied.Here we demonstrate the simplicity and flexibility of Bayesian graphical models for one important class of statistical problems, namely, predicting the number of errors in a database. We consider three approaches and show how additional approaches can easily be developed using the framework described here.

David Madigan, Jeremy C. York, Jeffrey M. Bradshaw, Russell G. Almond
14. Model Selection for Diagnosis and Treatment Using Temporal Influence Diagrams

This paper describes the model-selection process for a temporal influence diagram formalization of the diagnosis and treatment of acute abdominal pain. The temporal influence diagram explicitly models the temporal aspects of the diagnosis/treatment process as the findings (signs and symptoms) change over time. Given the complexity of this temporal modeling process, there is uncertainty in selecting (1) the model for the initial time interval (from which the process evolves over time); and (2) the stochastic process describing the temporal evolution of the system. Uncertainty in selecting the initial Bayesian network model is addressed by sensitivity analysis. The choice of most appropriate stochastic process to model the temporal evolution of the system is also discussed briefly.

Gregory M. Provan
15. Diagnostic systems by model selection: a case study

Probabilistic systems for diagnosing blue babies are constructed by model selection methods applied to a database of cases. Their performance are compared with a system built primarily by use of expert knowledge. Results indicate that purely automatic methods do not quite perform at the level of expert based systems, but when expert knowledge is incorporated properly, the methods look very promising.

S. L. Lauritzen, B. Thiesson, D. J. Spiegelhalter
16. A Survey of Sampling Methods for Inference on Directed Graphs

We survey methods for performing inferences from measurement data to unknowns within a Bayesian network expressed as a directed graph, in which each node represents a scalar quantity capable of taking on values within a continuous range.

Andrew Runnalls
17. Minimizing decision table sizes in influence diagrams: dimension shrinking

One goal in evaluating an influence diagram is to compute an optimal decision table for each decision node. More often than not, one is able to shrink the sizes of some of the optimal decision tables without any loss of information. This paper investigates when the opportunities for such shrinkings arise and how can we detect them as early as possible so as to to avoid unnecessary computations. One type of shrinking, namely dimension shrinking, is studied. A relationship between dimension shrinking and what we call lonely arcs is established, which enables us to make use of the opportunities for dimension shrinking by means of pruning lonely arcs at a preprocessing stage.

Nevin Lianwen Zhang, Runping Qi, David Poole
18. Models from Data for Various Types of Reasoning

Often the primary objective of constructing a model from the data is to model the phenomenonthat produces the data in such a way that the model is useful for the desired type of reasoning objectives. In many graph models of probabilistic knowledge various aspects of these phenomena are represented by nodes while the edges represent the probabilistic dependencies among them. We demonstrate one type of reasoning objective that is better served by those models in which some qualitative relationships of the phenomena, are used as the edges and the hyperedges of the graph models. We have also outlined our methods for handling such models for reasoning and for learning the qualitative relationships of a domain from the empirical data.

Raj Bhatnagar, Laveen N. Kanal

Causal Models

Frontmatter
19. Causal inference in artificial intelligence

In recent years, artificial intelligence researchers have paid a great deal of attention to the statistical literature on recursive structural equation models and graphical models, especially to the literature on directed acyclic independence graphs. Some researchers have argued that graphical models are useful in studying causal relationships among variables, and that causal relations can be read off the conditional independence graph. Some authors have even attempted to use the data to derive both a causal ordering and causal relationships. A number of others have refrained from imparting a causal relationship to graphical models. This paper discusses the subject of causation and causal inference in general, and then applies the discussion to the case of graphical models. Specifically, a distinction between causation and causal inference is made; the inferential activity consists of making statements about causation. Using a counterfactual account of the causal relation that derives from work in statistics, I show that the usual types of inferences made in the structural equations literature and the graphical models literature do not square with this account; the results are relevant because a number of authors have claimed otherwise

Michael E. Sobel
20. Inferring causal structure among unmeasured variables

Linear structural equation models with latent variables are common in psychology, econometrics, sociology, and political science. Such models have two parts, a measurement model that specifies how the latent variables are measured, and a structural model which specifies the causal relations among the latent variables. In this paper I discuss search procedures for finding a ‘pure’ measurement model and for using this pure model to determine features of the structural model. The procedures are implemented as the Purify and MIMbuild modules of the TETRAD II program.

Richard Scheines
21. When can association graphs admit a causal interpretation?

We discuss essentially linear structures which are adequately represented by association graphs called covariance graphs and concentration graphs. These do not explicitly indicate a process by which data could be generated in a stepwise fashion. Therefore, on their own, they do not suggest a causal interpretation. By contrast, each directed acyclic graph describes such a process and may offer a causal interpretation whenever this process is in agreement with substantive knowledge about causation among the variables under study. We derive conditions and procedures to decide for any given covariance graph or concentration graph whether all their pairwise independencies can be implied by some directed acyclic graph.

Judea Pearl, Nanny Wermuth
22. Inference, Intervention, and Prediction

What can be predicted when the causal structure and the joint distribution among a set of random variables is known that cannot be predicted when only the joint distribution over the set of random variables is known? The answer is that with the former we can predict the effects of intervening in a system by manipulating the values of certain variables, while with only the latter we cannot. For example, if we know only the joint distribution of smoking and lung cancer, we cannot determine whether stopping people from smoking will reduce the rates of lung cancer. On the other hand, if we also know that smoking causes lung cancer (and that there is no common cause of smoking and lung cancer), we can predict that stopping smoking will reduce lung cancer, and by how much. As the quotations at the beginning of the article show, there is a debate within the statistics community about whether predicting the effects of interventions from passive observations is possible at all. In this paper we will describe some recent work for predicting the effects of interventions and policies given passive observations and some background knowledge. While addressing some of the claims just considered, the results we describe unify two traditions in statistics-one, beginning at least as early as Sewell Wright ([Wright 34]; [Simon 77]; [Blalock 61]; [Kiiveri 82]; [Wermuth 83]; [Lauritzen 84]; [Kiiveri 84]; [Wright 34]; [Glymour 87]; [Pearl 88]), connects directed acyclic graphs with probability distributions through constraints on conditional independence relations, while the other ([Neyman 35]; [Rubin 77]; [Rubin 78]; [PearVerm 91]) connects causal claims with “counterfactual distributions” and offers rules for predicting the distribution of a variable that will result if other variables are deliberately manipulated ([Rubin 77]; [Pratt 88]). We consider the following two cases.

Peter Spirtes, Clark Glymour
23. Attitude Formation Models: Insights from TETRAD

The goal of this paper is to examine the use of TETRAD to generate attitude formation models from data. We review the controversy regarding models of attitude formation in the psychology literature, and we re-analyze the data using different subroutines from TETRAD to shed further light on this controversy.

Sanjay Mishra, Prakash P. Shenoy
24. Discovering Probabilistic Causal Relationships: A Comparison Between Two Methods

This paper presents a comparison between two different approaches to statistical causal inference, namely Glymour et al.’s approach based on constraints on correlations and Pearl and Verma’s approach based on conditional independencies. The methods differ both in the kind of constraints considered while selecting a causal model and in the way they search for the model which better fits the sample data. Some experiments show that they are complementary in several aspects.

Floriana Esposito, Donato Malerba, Giovanni Semeraro
25. Path Analysis Models of an Autonomous Agent in a Complex Environment

We seek explanatory models of how and why AI systems work in particular environments. We are not satisfied to demonstrate performance, we want to understand it. In terms of data and models, this means we are not satisfied with descriptive summaries, nor even with predictive models. We want causal models. In this brief abstract we will present descriptive, predictive and causal models of the behavior of agents that fight simulated forest fires. We will describe the shortcomings of descriptive and predictive models, and summarize path analysis, a common technique for inducing causal models.

Paul R. Cohen, David M. Hart, Robert St. Amant, Lisa A. Ballesteros, Adam Carlson

Particular Models

Frontmatter
26. A Parallel Constructor of Markov Networks

PaCCIN is a program to build Markov networks from data, implemented in C on an nCube/ten MIMD parallel computer with 1024 nodes. We describe the algorithm on which the program is based, the program itself, and some experimental results.

Randy Mechling, Marco Valtorta
27. Capturing observations in a nonstationary hidden Markov model

This paper is concerned with the problem of morphological ambiguities using a Markov process. The problem here is to estimate interferent solutions that might be derived from a morphological analysis. We start by using a Markov chain with one long sequence of transitions. In this model the states are the morphological features and a sequence correponds to a transition from one feature to another. After having observed an inadequacy of this model, one will explore a nonstationary hidden Markov process. Among the main advantages of this latter model we have the possibility to assign a type to a text, given some training samples. Therefore, a recognition of “style” or a creation of a new one might be developped.

Djamel Bouchaffra, Jacques Rouault
28. Extrapolating Definite Integral Information

We propose a probability-logic based representation of temporal knowledge that relates point and interval information through temporal integration. We illustrate the use of deduction and (probabilistic) induction to formally specify rational inference under uncertainty. In particular, we consider a previously unexplored aspect of the frame problem: action and time invariance of statistical knowledge.

Scott D. Goodwin, Eric Neufeld, André Trudel
29. The Software Reliability Consultant

In this paper, we provide a concise survey of software reliability modeling procedures as an indication of the complex variety of such procedures. We then provide a preliminary design for the Software Reliability Consultant (SRC), describing the current state of its development as well as future plans. SRC provides guidance in the choice of an appropriate performance measure and an effective modeling procedure for an arbitrary software reliability dat set.

George J. Knafl, Andrej Semrl
30. Statistical Reasoning to Enhance User Modelling in Consulting Systems

This paper describes methods that combine statistical reasoning with qualitative reasoning in the process of creating and updating of stereotypes of users. If a consulting system is equipped with stereotype based user modelling techniques it is able to predict a tentative user model before a more complete individual user model is available. It is hoped that this kind of a system is more capable to adjust its actions to the user. Moreover, if information about the users of the system has been cumulated during the sessions this information can be utilised in updating the stereotypes. In this paper these ideas are illustrated in the context of statistical consulting systems.

Paula Hietala
31. Selecting a frailty model for longitudinal breast cancer data

Different units often exhibit different failure rates. Frailty models allow for heterogeneity across units over and above the variation attributable to covariates included in the analysis. This paper reviews frailty model methods, using parametric and semiparametric specifications, for longitudinal breast cancer data.

D. Moreira dos Santos, R. B. Davies
32. Optimal design of reflective sensors using probabilistic analysis

Linear stepper, or Sawyer, motors have become popular in robotic mechanisms because of their high positional accuracy in open loop control mode [1]. Precision and repeatability are prerequisites in manufacturing and assembly. However the motor’s actual position becomes uncertain when it is subject to external forces. Position sensors mounted on the motors can solve this problem and provide for force-control [2].This paper describes a sensor, a technique for determining the robot’s position, and an analysis technique for determining the optimal sensor configuration. Reflective optical sensors are used to generate raw data which is scaled and then processed using Bayesian probability methods. We had wanted to analyze different sensor configurations by marginalizing the performance over predicted data. Since marginalizing over the entire state space is infeasible due to its size, Monte Carlo techniques are used to approximate the result of the marginalization. We implemented the positional technique, and measured its performance experimentally; the sensors estimated the robot’s position to within 2/1000 ″, in line with the probabilistic analysis.

Aaron Wallack, Edward Nicolson

Similarity-Based Models

Frontmatter
33. Learning to Catch: Applying Nearest Neighbor Algorithms to Dynamic Control Tasks

This paper examines the hypothesis that local weighted variants of k-nearest neighbor algorithms can support dynamic control tasks. We evaluated several k-nearest neighbor (k-NN) algorithms on the simulated learning task of catching a flying ball. Previously, local regression algorithms have been advocated for this class of problems. These algorithms, which are variants of k-NN, base their predictions on a (possibly weighted) regression computed from the k nearest neighbors. While they outperform simpler k-NN algorithms on many tasks, they have trouble on this ball-catching task. We hypothesize that the non-linearities in this task are the cause of this behavior, and that local regression algorithms may need to be modified to work well under similar conditions.

David W. Aha, Steven L. Salzberg
34. Dynamic Recursive Model Class Selection for Classifier Construction

Before applying an automated model selection procedure, one must first choose the class (or family) of models from which the model will be selected. If there is no prior knowledge about the data that indicates the best class of models, then the choice is difficult at best. In this chapter, we present an approach to automating this step in classifier construction. In addition to searching for the best model, our approach searches for the best model class using a heuristic search strategy that finds the best model class for each recursive call of a divide-and-conquer tree induction algorithm. The end result is a hybrid tree-structured classifier, which allows different subspaces of a data set to be fit by models from different model classes. During search for the best model, the method considers whether and why a model class is a poor choice, and selects a better class on that basis. We describe an implementation of the approach, the MCS system, and present experimental results illustrating the system’s ability to identify the best model (and model class) efficiently.

Carla E. Brodley, Paul E. Utgoff
35. Minimizing the expected costs of classifying patterns by sequential costly inspections

In many applications, an expert classifier system has access to statistical information about the prior probabilities of the different classes. Such information should in principle reduce the amount of additional information that the system must collect, e.g., from answers to questions, before it can make a correct classification. This paper examines how to make best use of such prior statistical information, sequentially updated by collection of additional costly information, to optimally reduce uncertainty about the class to which a case belongs, thus minimizing the cost or effort required to correctly classify new cases. Two approaches are introduced, one motivated by information theory and the other based on the idea of trying to prove class membership as efficiently as possible. It is shown that, while the general problem of cost-effective classification is NP-hard, both heuristics provide useful approximations on small to moderate sized problems. Moreover, a hybrid heuristic that chooses which approach to apply based on the characteristics of the classification problem (entropy of the class probability distribution and coefficient of variation of information collection costs) appears to give excellent results. The results of initial computational experiments are summarized in support of these conclusions.

Louis Anthony Cox Jr., Yuping Qiu
36. Combining a knowledge-based system and a clustering method for a construction of models in ill-structured domains

Standard statistical methods usually ignore the additional information that an expert has about the domain structure. Direct treatment of symbolic information is not a very common characteristic of statistical systems. KLASS is a statistical clustering system that provides the possibility of using either quantitative and qualitative variables in the domain description. The user may also declare part of its knowledge about the domain structure. The system is especially useful when dealing with ill-structured domains (i.e. a domain where the consensus among the experts is weak as mental diseases, sea sponges, books, painters…). That is why it is also useful from the artificial intelligence point of view. The output is a partition of the target domain. Conceptual and extensional descriptions of the classes can also be achieved.

Karina Gibert, Ulises Cortés
37. Clustering of Symbolically Described Events for Prediction of Numeric Attributes

This chapter describes a new conceptual clustering system capable of constructing classes for the prediction of a single numeric attribute. The clustering for single numeric attribute prediction (CSNAP) system clusters data described by a variety of symbolic attributes. The system trades off the accuracy of the predicted values against the clarity of the descriptions produced to produce classes which are both predictive and meaningful to a human observer. CSNAP has been used to develop classes for time based events and has demonstrated an ability to learn complex cyclic patterns.

Bradley L. Whitehall, David J. Sirag Jr.
38. Symbolic Classifiers: Conditions to Have Good Accuracy Performance

Symbolic classifiers from Artificial Intelligence compete with those from the established and emerging fields of statistics and neural networks. Traditional view is that symbolic classifiers are good in that they are easier to use, are faster, and produce human understandable rules. However, as this paper shows, through a comparison of fourteen established state-of-the-art symbolic, statistical, and neural classifiers on eight large real-world problems, symbolic classifiers also have superior, or at least comparable, accuracy performance when the characteristics of the data suit them. These data characteristics are measured using a set of statistical and qualitative descriptors, first proposed in the present (and the related) work. This has implications for algorithm users and method designers, in that the strength of various algorithms can be exploited in application, and in that superior features of other algorithms can be incorporated into existing algorithms.

C. Feng, R. King, A. Sutherland, S. Muggleton, R. Henery

Regression and Other Statistical Models

Frontmatter
39. Statistical and neural network techniques for nonparametric regression

The problem of estimating an unknown function from a finite number of noisy data points has fundamental importance for many applications. Recently, several computational techniques for non-parametric regression have been proposed by statisticians and by researchers in artificial neural networks, but there is little interaction between the two communities. This paper presents a common taxonomy of statistical and neural network methods for nonparametric regression. Performance of many methods critically depends on the strategy for positioning knots along the regression surface. A novel method for adaptive positioning of knots called Constrained Topological Mapping(CTM) is discussed in some detail. This method achieves adaptive placement of knots of the regression surface by using a neural network model of self-organization.

Vladimir Cherkassky, Filip Mulier
40. Multicollinearity: A tale of two nonparametric regressions

The most popular form of artificial neural network, feedforward networks with sigmoidal activation functions, and a new statistical technique, multivariate adaptive regression splines (MARS) can both be classified as nonlinear, nonparametric function estimation techniques, and both show great promise for fitting general nonlinear multivariate functions. In comparing the two methods on a variety of test problems, we find that MARS is in many cases both more accurate and much faster than neural networks. In addition, MARS is interpretable due to the choice of basic functions which make up the final predictive equation. This suggests that MARS could be used on many of the applications where neural networks are currently being used.However, MARS exhibits problems in choosing among predictor variables when multicollinearity is present. Due to their redundant architecture, neural networks, however, do not share this problem, and are better able to predict in this situation. To improve the ability of MARS to deal with multicollinearity, we first use principal components to reduce the dimensionality of the input variables before invoking MARS. Using data from a polymer production run, we find that the resulting model retains the interpretability and improves the accuracy of MARS in the multicollinear setting.

Richard D. De Veaux, Lyle H. Ungar
41. Choice of Order in Regression Strategy

Regression analysis is viewed as a search through model space using data analytic functions. The desired models should satisfy several requirements, unimportant variables should be excluded, outliers identified, etc. The methods of regression data analysis such as variable selection, transformation and outlier detection, that address these concerns are characterized as functions acting on regression models and returning regression models. A model that is unchanged by the application of any of these methods is considered acceptable. A method for the generation of all acceptable models supported by all possible orderings of the choice of regression data analysis methods is described with a view to determining if two statisticians may reasonably hold differing views on the same data. The consideration of all possible orders of analysis generates a directed graph in which the vertices are regression models and the arcs are data-analytic methods. The structure of the graph is of statistical interest. The ideas are demonstrated using a LISP-based analysis package. The methods described are not intended for the entirely automatic analysis of data, rather to assist the statistician in examining regression data at a strategic level.

Julian J. Faraway
42. Modelling response models in software

We describe our software design and implementation of a wide variety of response models, which model the values of a response variable as an interpretable function of explanatory variables. A distinguishing characteristic of our approach is the attention given to building software abstractions which closely mimic their statistical counterparts.

D. G. Anglin, R. W. Oldford
43. Principal components and model selection

Let the kp-variate random vector X be partitioned into k subvectors X i of dimension p each, and let the covariance matrix Ψ of X be partitioned analogously into submatrices Ψ ij . Based on principal component analysis we suggest a hierarchy of models, where the lowest level assumes independence of all X i , with identical covariance matrices, and the highest level makes no assumptions about Ψ beyond positive definiteness. The intermediate levels are characterized by a common orthogonal matrix ß which diagonalizes Ψ ij for all pairs (i, j), i.e., Ψ ij = ßΛ ij ß′, where Λ ij is diagonal. The hierarchy is motivated by both a practical example and theoretical arguments. Model selection based on likelihood ratio tests and information criteria is discussed.

Beat E. Neuenschwander, Bernard D. Flury

Algorithms and Tools

Frontmatter
44. Algorithmic speedups in growing classification trees by using an additive split criterion

We propose a new split criterion to be used in building classification trees. This criterion called weighted accuracy or wacc has the advantage that it allows the use of divide-and-conquer algorithms when minimizing the split criterion. This is useful when more complex split families, such as intervals corners and rectangles, are considered. The split criterion is derived to imitate the Gini function as closely as possible by comparing preference regions for the two functions. The wacc function is evaluated in a large empirical comparison and is found to be competitive with the traditionally used functions.

David Lubinsky
45. Markov Chain Monte Carlo Methods for Hierarchical Bayesian Expert Systems

In a hierarchical Bayesian expert system, the probabilities relating the variables are not known precisely; rather, imprecise knowledge of these probabilities is described by placing prior distributions on them. After obtaining data, one would like to update those distributions to reflect the new information gained; however, this can prove difficult computationally if the observed data are incomplete. This paper describes a way around these difficulties—use of Markov chain Monte Carlo methods.

Jeremy C. York, David Madigan
46. Simulated annealing in the construction of near-optimal decision trees

The application of simulated annealing to the optimization of decision trees is investigated. An efficient perturbation procedure is described and used as the basis of the Simulated Annealing Classifier System or SACS algorithm. We show that the algorithm is asymptotically convergent for any choice of global cost function. The algorithm is then illustrated, using the Minimum Description Length Principle as cost function, by applying it to several problems involving both noisy and noise-free data.

James F. Lutsko, Bart Kuijpers
47. SAIGA: Survival of the Fittest in Alaska

SA/GA is a genetic-algorithm-based approach to combinatorial optimization. It differs from ordinary genetic algorithms in that the acceptance of offspring is subject to the Metropolis sampling criterion [Metropolis 53] which is the basis of simulated annealing. The idea is that of a population subject to a climate that is getting colder. On one hand SA/GA can be viewed as a genetic algorithm with a crossover probability depending on the quality of the children relative to a decreasing temperature. On the other hand SA/GA can be modeled as a Markov chain with temperature-dependent transition probabilities [van Laarhoven 87]. When the population is set to one, the algorithm degenerates to a normal simulated annealing algorithm. We have tested the algorithm by applying it to a set of so- called genetic-algorithm-deceptive problems which are specially constructed to cause genetic algorithms to converge to sub-optimal results. SA/GA is not fooled by the deception and consistently finds the optimal solution. SA/GA also outperforms the pure simulated annealing algorithms while being less sensitive to parameter tuning.

Kris Dockx, James F. Lutsko
48. A Tool for Model Generation and Knowledge Acquisition

Tools to automatically induce domain descriptions from examples are valuable aids for the knowledge acquisition stage of expert system construction. This paper presents a description of an algorithm that induces two domain descriptions: a conceptual model, which gives a broad understanding of variable interactions and their effect on the system, and a predictive model, which determines the system value associated with variable values input to the model. This induction algorithm is based on Entropy Data Analysis (EDA), which builds linear equations to approximate the system described by the training data.

Sally Jo Cunningham, Paul Denize
49. Using knowledge-assisted discriminant analysis to generate new comparative terms

In this paper we present a method — knowledge-assisted discriminant analysis — to generate new comparative terms on symbolic and numeric attributes. The method has been implemented and tested on three real-world databases: mushroom classification, letter recognition, and liver disorder diagnosis. The experiment results show that combining AI and statistical techniques is an effective and efficient way to enhance a machine learning system’s concept description language in order to learn simple and comprehensible rules.

Bing Leng, Bruce G. Buchanan
Metadaten
Titel
Selecting Models from Data
herausgegeben von
P. Cheeseman
R. W. Oldford
Copyright-Jahr
1994
Verlag
Springer New York
Electronic ISBN
978-1-4612-2660-4
Print ISBN
978-0-387-94281-0
DOI
https://doi.org/10.1007/978-1-4612-2660-4