Skip to main content

Über dieses Buch

COMPSTAT symposia have been held regularly since 1974 when they started in Vienna. This tradition has made COMPSTAT a major forum for the interplay of statistics and computer sciences with contributions from many well known scientists all over the world. The scientific programme of COMPSTAT '96 covers all aspects of this interplay, from user-experiences and evaluation of software through the development and implementation of new statistical ideas. All papers presented belong to one of the three following categories: - Statistical methods (preferable new ones) that require a substantial use of computing; - Computer environments, tools and software useful in statistics; - Applications of computational statistics in areas of substantial interest (environment, health, industry, biometrics, etc.).



Keynote Papers


Scientific Statistics, Teaching, Learning and the Computer

An important issue in the 1930’s was whether statistics was to be treated as a branch of Science or of Mathematics. To my mind unfortunately, the latter view has been adopted in the United States and in many other countries. Statistics has for some time been categorized as one of the Mathematical Sciences and this view has dominated university teaching, research, the awarding of advanced degrees, promotion, tenure of faculty and the distribution of grants by funding agencies. All this has, I believe, greatly limited the value and distorted the development of our subject. A “worst case” scenario of some of its consequences is illustrated in the flow diagram in Figure 1.

George Box

Trends in the Information Technologies Markets-The Future

This presentation deals with Information Technologies and their Markets. More specifically, the Computer Industry is described in global terms and its evolution over the years is briefly presented. Recent technological developments in hardware and software, as well as discernible trends are discussed. Attention is paid to technological developments and trends in Semiconductors, Large Computers, Workstations, Small Computers, as well as in Software. It is emphasized that successful companies in these industries are those that develop and exploit technology for products and services, but also have clear and visionary strategies in Marketing. The presentation anticipates the future by extrapolating lasting trends and anticipated technological developments.

Angel G. Jordan

Invited Papers


Robust Procedures for Regression Models with ARIMA Errors

A robust method for estimating the parameters of a regression model with ARIMA errors is presented. The estimates are defined by the minimization of a conveniently robustified likelihood function. This robustification is achieved by replacing in the reduced form of the Gaussian likelihood function the mean square error of the standardized residuals by the square of a robust scale estimate of standardized filtered residuals. The robust filtering procedure avoids the propagation of the effect of one outlier on several subsequent residuals. Computational aspects of these estimates are discussed and the results of a Monte Carlo study are presented.

A. M. Bianco, E. J. Martinez, M. Garcia Ben, V. J. Yohai

Functional Imaging Analysis Software — Computational Olio

Magnetic resonance imaging (MRI) is a modern technique for producing pictures of the internals of the human body. An MR scanner subjects its contents to carefully modulated electro-magnetic fields and records the resulting radio signal. The radio signal is the Fourier transform of the density of (for example) hydrogen atoms. Computing the inverse Fourier transform of the digitized signal reveals an image of the (hydrogen density of the) contents of the scanner. Functional MRI (fMRI) is a very recent development in which MRI is used to produce images of the human brain which show regions of activation reflecting the functioning of the brain.

William F. Eddy, Mark Fitzgerald, Christopher Genovese, Audris Mockus, Douglas C. Noll

Automatic Modelling of Daily Series of Economic Activity

Daily series of economic activity have not been the object of as a rigorous study as financial series. Nevertheless, the possibility of having adequate models available at a reasonable cost would give companies and institutions powerful management tools. On the other hand, the peculiarities that these series show advise specific treatment, differentiated from that of the series which show a higher level of time aggregation. In this article the previous problem is illustrated and an automatic methodology for the analysis of such series is proposed.

Antoni Espasa, J. Manuel Revuelta, J. Ramón Cancelo

New Methods for Quantitative Analysis of Short-Term Economic Activity

We concern ourselves with statistical treatment of economic time-series data used in short-term economic policy, control and monitoring. Although other frequencies are possible, our attention centers on monthly (also quarterly) series. The statistical treatment we have in mind includes short-term forecasting, seasonal adjustment, estimation of the trend, estimation of the business cycle, estimation of special effects and removal of outliers, perhaps for a large number of series.

Víctor Gómez, Agustín Maravall

Classification and Computers: Shifting the Focus

The aim of this paper is to examine recent progress in supervised classification, sometimes called supervised pattern recognition, to look at changes in emphasis which are occurring, and to make recommendations for the focus of future research effort. In particular, I suggest that effort should now be shifted away from the minutiae of improving the performance of classification rules, as measured by, for example, error rate, and should, instead be focused on a deeper understanding of the problem domains and a better matching of the methods to the problems. I illustrate with some examples to support this suggestion.

David J. Hand

Image Processing, Markov Chain Approach

A survey of methods in probabilistic image processing based on Markov Chain Monte Carlo is presented. An example concerning the problem of texture segmentation is included.

Martin Janžura

A Study of E-optimal Designs for Polynomial Regression

The present paper is devoted to studying E-optimal experimental designs for polynomial regression models on arbitrary or symmetrical segments. A number of papers (Kovrigin, 1979; Heiligers, 1991; Pukelsheim, Studden, 1993) was devoted to particular cases in which the minimal eigenvalue of E-optimal design information matrix had multiplicity one. In these cases points of E-optimal design can be directly calculated through extremal points of Tchebysheff polynomial. Here a review of results from (Melas, 1995a, b, 1996; Melas, Krylova, 1996) will be given. These results relate mainly to the study of dependence of E-optimal design points and weights on the length of the segment to be assumed simmetrical. Besides a number of results for the case of arbitrary segments is given.

V. B. Melas

From Fourier to Wavelet Analysis of Time Series

It is well known that Fourier analysis is suited to the analysis of stationary series. If {Xt,t = 0, ± 1, …} is a weakly stationary process, it can be decomposed into a linear combination of sines and cosines. Formally, 1.1$${X_t} = \int_{ - \pi }^\pi {{e^{i\lambda t}}dZ\left( \lambda \right)} ,$$ where Z(λ), ™π ≤ λ ≤ π is an orthogonal process. Moreover, 1.2$$Var\left\{ {{X_t}} \right\} = \int_{ - \pi }^\pi {dF\left( \lambda \right)} ,$$ with E|dZ(λ)|2 = dF(λ). F(λ) is the spectral distribution function of the process. In the case that dF(λ) = f(λ)dλ, f(λ) is the spectral density function or simply the (second order) spectrum of Xt. Relation (1.2) tells us that the variance of a time series is decomposed into a number of components, each one associated with a particular frequency. This is the basic idea in the Fourier analysis of stationary time series. Some references are Brillinger (1975) and Brockwell and Davis(1991).

Pedro A. Morettin

Profile Methods

In this paper, we describe uses of profile methods in statistics. Our goal is to identify profiling as a useful task common to many statistical analyses going beyond simple normal approximations and to encourage its inclusion in standard statistical software. Therefore, our approach is broader than deep and, although we touch on a wide variety of areas of interest, we do not present fundamental research in any of them and we do not claim our use of profiles is optimal. Our contribution lies in the realization that profiling is a general task and that it can be automated to a large extent.

C. Ritter, D. M. Bates

A New Generation of a Statistical Computing Environment on the Net

With the availability of the net a new generation of computing environments has to be designed for a large scale of statistical tasks ranging from data analysis to highly interactive operations. It must combine the flexibility of multi window desktops with standard operations and interactive user driven actions. It must be equally well suited for first year students and for high demanding researchers. Its design must has various degrees of flexibility that allow to address different levels of user groups. We present here some ideas how a new generation of a computing environment can be used as a student front end tool for teaching elementary statistics as well as a research device for highly computer intensive tasks, e.g. for semiparametric analysis and bootstrapping.

Swetlana Schmelzer, Thomas Kötter, Sigbert Klinke, Wolfgang Härdle

On Multidimensional Nonparametric Regression

In this paper we concentrate on models based on dimensionality reduction principles, with special attention on some additive decompositions. A selective survey of theoretical results that are available for these models is presented with emphasis on the principle estimation techniques.

Philippe Vieu, Laurent Pelegrina, Pascal Sarda

Contributed Papers


Parallel Model Selection in Logistic Regression Analysis

In [Adèr 1994], a parallel implementation of the model search method of [Edwards, Havranek 1987] was given for the case of linear regression modeling. The results were promising.

H. J. Adèr, Joop Kuik, H. A. van Rossum

On a Weighted Principal Component Model to Forecast a Continuous Time Series

In many real life situations information about a continuous time series is given by discrete-time observations not always evenly spaced. Our purpose is to develop a forecasting model for such a time series avoiding some of the restrictive hypotheses imposed by classical approaches. If the original series x(t) is cut in periods of amplitude h (h > 0) then the following process is obtained by rescaling 1$$\left\{ {{X_w}\left( t \right) = x\left( {\left( {w - 1} \right)h + t} \right):t \in \left[ {T,\,T + h} \right];\,w = 1,2,....} \right\}.$$ The forecasting model proposed in this paper is based on linear regression of the principal components (p.c.’s) associated to the process X(t) in the future against its p.c.’s in the past. This research was supported in part by Project PS94-0136 of DGICYT, Ministerio de Educación y Ciencia, Spain

A. M. Aguilera, F. A. Ocaña, M. J. Valderrama

Exact Iterative Computation of the Multivariate Minimum Volume Ellipsoid Estimator with a Branch and Bound Algorithm

In this paper we develop an exact iterative algorithm for the computation of the minimum volume ellipsoid (MVE) estimator that is more efficient than the algorithm of Cook, Hawkins and Weisberg (1993). Our algorithm is based on a branch and bound (BAB) technique and it is computationally feasible for small and moderate-sized samples.

José Agulló Candela

Automatic Segmentation by Decision Trees

We present a system for automatic segmentation by decision trees, able to cope with large data sets, with special attention to stability problems. Tree-based methods are a statistical operation for automatic learning from data, its main characteristic is the simplicity of the obtained results. It uses a recursive algorithm which can be very costly for large data sets and it is very dependent on data, since small fluctuations on data may cause a big change in the tree-growing process. First our purpose has been to define data diagnostics to prevent internal instability in the tree growingprocess before a particular split has been made. Then we study the complexity of the algorithm and its applicability to big data sets.

Tomàs Aluja-Banet, Eduard Nafria

Karhunen-Loève and Wavelet Approximations to the Inverse Problem

Investigation into the behaviour of certain physical phenomena frequently leads to the study of integral equations relating two random fields. The inverse problem of estimating the input random field from the output random field data may then be considered. In Hydrology, for example, the logtransmissivity and piezometric head random fields are related by a stochastic integral equation derived as an approximation of the non-linear aquifer flow equation modelling the relationship between these two random fields. In this context, several authors have recently studied different approaches to the inverse problem of transmissivity estimation from piezometric data (Kitanidis & Vomvoris, 1983; Dagan, 1985; Kuiper, 1986; Rubin & Dagan, 1988; Dietrich & Newsam, 1989).

J. M. Angulo, M. D. Ruiz-Medina

Bootstrapping Uncertainty in Image Analysis

This paper applies the bootstrap (Efron & Tibshirani, 1993) to problems in blind image restoration; i.e., an estimate has to be made of an image, using noisy, blurred data and a priori assumptions about the truth. These assumptions are made in the form of stochastic models, which themselves contain parameters that have to be estimated before an image restoration is performed.

Graeme Archer, Karen Chan

BASS: Bayesian Analyzer of Event Sequences

We describe the BASS system, a Bayesian analyzer of event sequences. BASS uses Markov chain Monte Carlo methods, especially Metropolis-Hastings algorithm, for exploring posterior distributions. The system allows the user to specify an intensity model in a high-level definition language, and then runs the Metropolis-Hastings algorithm on it.

E. Arjas, H. Mannila, M. Salmenkivi, R. Suramo, H. Toivonen

Assessing Sample Variability in the Visualization Techniques Related to Principal Component Analysis: Bootstrap and Alternative Simulation Methods

Bootstrap distribution-free resampling technique (Efron, 1979) is frequently used to assess the variance of estimators or to produce tolerance areas on visualization diagrams derived from principal axes techniques (correspondence analysis (CA), principal component analysis (PCA)). Gifi (1981), Meulman (1982), Greenacre (1984) have done a pionneering work in the context of two-way or multiple correspondence analysis. In the case of principal component analysis, Diaconis and Efron (1983), Holmes (1985, 1989), Stauffer et al. (1985), Daudin et al. (1988) have adressed the problem of the choice of the relevant number of axes, and have proposed confidence intervals for points in the subspace spanned by the principal axes. These parameters are computed after the realization of each replicated samples, and involve constraints that depend on these samples. Several procedures have been proposed to overcome these difficulties: partial replications using supplementary elements (Greenacre), use of a three-way analysis to process simultaneously the whole set of replications (Holmes), filtering techniques involving reordering of axes and procrustean rotations (Milan and Whittaker, 1995).

Frederic Chateau, Ludovic Lebart

A Fast Algorithm for Robust Principal Components Based on Projection Pursuit

One of the aims of a principal component analysis (PCA) is to reduce the dimensionality of a collection of observations. If we plot the first two principal components of the observations, it is often the case that one can already detect the main structure of the data. Another aim is to detect atypical observations in a graphical way, by looking at outlying observations on the principal axes.

C. Croux, A. Ruiz-Gazen

Hybrid System: Neural Networks and Genetic Algorithms Applied in Nonlinear Regression and Time Series Forecasting

Many authors try to combine the statistical techniques of linear and nonlinear regression with the connectionist approach. This is a way to incorporate the neural network theory in order to build an automatic modeling tool.We introduce a method to test the results and a heuristic to stop the learning process when the best model has been found. To find the best structure for the neural network, a genetic algorithm is used. This algorithm determines the activation functions and the number of hidden units needed in the model. Some of the results obtained can be applied in univariate time series analysis. The genetic algorithm provides the required inputs to the neural network, corresponding to the observations that need to be forecasted, this is, the dimensional time delay space. In nonlinear series, where traditional linear modelling fails, this method could be useful.

A. Delgado, K. Sanjeevan, I. Sole, L. Puigjaner

Do Parametric Yield Estimates Beat Monte Carlo?

Simulation models are playing an increasing role in industry and often form the basis for product design and optimization. They are used among other things for yield computation, the computation of the proportion of products that satisfy imposed quality requirements, given the natural variations inherent in the manufacturing process. In mathematical terms: a function q(x1,…, x p ) is given which determines product quality q as a function of process parameters x1,…,x p . Products with quality l < q < u are acceptable, others are scrapped. Furthermore, the random variation in the process parameters (x1,…, x p ) is described by a continuous distribution F. Then the yield y is given by $$\matrix{ y \hfill & = \hfill & {\int_{l < q\left( {{x_1},...,{x_p}} \right) < u} {dF\left( {{x_1},...,{x_p}} \right)} } \hfill \cr {} \hfill & = \hfill & {\int_{{{\left[ { - \infty ,\infty } \right]}^p}} {\phi \left( {l < q\left( {{x_1},...,{x_p}} \right) < u} \right)dF\left( {{x_1},...,{x_p}} \right).} } \hfill \cr } $$ So we are concerned with numerical integration of an indicator function, ⌽, multiplied by a density in p-dimensional space, say p = 5 to 100. This is in contrast with current numerical integration work aiming at the integration of a smooth integrand to compute the posterior distribution in a Bayesian setting; see, for example, Flournoy and Tsutakawa, 1991. Moreover, the integration of a discontinuous function ⌽ invalidates error bounds in quasi-Monte Carlo integration (see, for example, Niederreiter, 1992), as ⌽ is generally of infinite variation.

Dee Denteneer, Ludolf Meester

Testing Convexity

In a nonparametric regression framework, we present a procedure to test the null hypothesis that the regression function is not strictly convex. The empirical power of the test is evaluated by simulation.

Cheikh A. T. Diack

Zonoid Data Depth: Theory and Computation

A new notion of data depth in d-space is presented, called the zonoid data depth. It is affine equivariant and has useful continuity and monotonicity properties. An efficient algorithm is developed that calculates the depth of a given point with respect to a d-variate empirical distribution.

Rainer Dyckerhoff, Karl Mosler, Gleb Koshevoy

PADOX, A Personal Assistant for Experimental Design

This paper focuses on incorporating recent trends in human computing interaction to statistical applications. In particular, the paper describes the design and development of a prototype called PADOX that has its roots on DOX, presented in COMPSTAT′92

Ernest Edmonds, Jesús Lorés, Josep Maria Catot, Georgios Illiadis, Assumpció Folguera

Computing M-estimates

We consider a linear regression model $$y = X\beta + \varepsilon $$ where y is a response variable, X is an n×p design matrix of rank p, and ∈ is a vector with i.i.d. random variables.

Håkan Ekblom, Hans Bruun Nielsen

Survival Analysis with Measurement Error on Covariates

In survival analysis it is typical to assess the effect of covariates on a duration variable T. Even though the standard methodology assumes that covariates are free from measurement error, this assumption is often violated in practice. The presence of measurement error may alter the usual properties of the standard estimators of regression coeficients. In the present paper we first show, using Monte Carlo methods, that measurement error in covariates induces bias on the usual regression estimators. We then outline an estimation procedure that corrects for the presence of measurement error. Monte Carlo data are then used to assess the performance of the proposed alternative estimators.

Anna Espinal-Berenguer, Albert Satorra

Partial Imputation Method in the EM Algorithm

The expectation maximization(EM) algorithm is a general iterative algorithm for the maximum-likelihood estimation(MLE) in incomplete-data problems. Dempster, Laird and Rubin(1977, henceforth DLR) showed that convergence is linear with rate proportional to the ratio of the missing information to the complete information. When a large proportion of data are missing, the speed of convergence can be very slow.

Z. Geng, F. Tao, K. Wan, Ch. Asano, M. Ichimura, M. Kuroda

On the Uses and Costs of Rule-Based Classification

Classification in ill-structured domains is well known as a hard problem for the actual statistical and artificial intelligence techniques. Rule-based clustering is a new approach that combines statistical algorithms with some inductive learning elements in order to overcome the limitations of both Statistics and Artificial Intelligence in managing ill-structured domains. In this paper discussion on the cost of this new methodology is also presented.

Karina Gibert

Small Sequential Designs that Stay Close to a Target

Assume that we can control an input x n ∈ C ⊂ R, and observe an output y n such that y n (x n ) = f(x n ) + c n , where the e n are independent with E(∈ n ) = 0 and Var(∈n) = σ2. The function E(y|x) = f(x) is unknown, continuous and in our examples it will belong to a parametric family f(x;ß), known up to a vector ß of unknown parameters with a prior distribution G0(ß). We assume that there is a unique θ such that f(θ) = T, where T is a known target and f′ (θ) > 0. By subtracting T from each y n (x n ) we can take T to be 0. There is a rich literature on sequential designs for estimating the root θ of an unknown equation evaluated with noise; good references are Wu (1986) and Frees and Ruppert (1990). This paper adapts some of these designs to problems that involve a small number of observations AT, and cover the whole range that goes from the purely root estimation problem to stochastic control problems where the goal is to keep the N responses as close to T as possible.

Josep Ginebra

Statistical Classification Methods for Protein Fold Class Prediction

Prediction of the structure of proteins only from the amino acid sequence has been a challenge to biochemistry and biophysics since the early 1970’s. Despite the years of research and enormous development of experimental techniques like x-ray cristallography and NMR there still remains the increasing gap between the exploding number of known protein sequences and the slowly growing number of corresponding known three-dimensional structures.

Janet Grassmann, Lutz Edler

Restoration of Blurred Images when Blur is Incompletely Specified

Work in restoration of blurred images often assumes the form and extent of blurring to be known. In practice it may not be known or may not be easily quantified. Chan and Gray (1996) and Gray and Chan (1995) studied the effects of misspecifying the degree and/or form of blur in image regularization. This paper will consider the situation where these are not assumed known but are estimated as part of a restoration procedure. We describe several different simultaneous estimation-restoration algorithms, namely an extension of Green’s application of his One Step Late (OSL) approximation, for penalized maximum likelihood estimation, to the EM algorithm (Green 1990, 1993), an extension of quadratic image regularization, and an extension of a Bayesian method of Archer and Titterington (1995) which can be optimized either directly or by simulated annealing using the Gibbs sampler (Geman and Geman, 1984). Performance will be compared empirically by means of a simulation study.

Alison J. Gray, Karen P.-S. Chan

Loglinear Random Effect Models for Capture-Recapture Assessment of Completeness of Registration

The usefulness of a population-based cancer registry depends to a large extent on the completeness of registration, i.e. the degree to which reportable cases of cancer in the population of interest are actually detected and recorded in the registry (Wittes, 1974). Since most cancer registries use multiple data sources in their input process and since there are non standard procedures for assessing completeness, capture-recapture methods represent one valid alternative to other methods for estimating the quality of registration. The main idea is to mimic what happens in the estimation procedure of animal abundance, in which animals are caught several times and classified according to their presence on each occasion, with data sources standing for catches.

D. Gregori, L. Di Consiglio, P. Peruzzo

Estimation of First Contact Distribution Functions for Spatial Patterns in S-PLUS

An important tool in the exploratory analysis of random patterns are the so called first contact distribution functions. These functions give the distribution of first contact for increasing test sets contained in the void, and provide thereby important information on the “pore” space between particles. An introduction to the use of first contact statistics for exploratory analysis and statistical inference is e.g. given in Stoyan et al. (1987). The statistical aspects of the edge correction techniques presented here are mainly due to Baddeley & Gill (1993), Hansen et al. (1995, 1996) and Chiu & Stoyan (1994).

Martin B. Hansen

Barcharts and Class Characterization with Taxonomic Qualitative Variables

In many real applications, values of some variables are organized in taxonomies, i.e. there exists a hierarchy among the different values which can be taken by the variable. In this paper, we investigate the use of these taxonomies in two basic processes in statistical data analysis: construction of barcharts of qualitative variables, and characterization of classes of individuals by qualitative variables. Different problems appear regarding to this approach and are described in this paper: storage management of arrays of data with taxonomic qualitative variables, extension of the standard concepts of barcharts and class characterization, and graphical representation of results. To demonstrate the effectiveness of the approach, a first prototype has been developed, combining two tools: the SPLUS language for data management and statistical computations, and the VCG tool for easy visualization of results.

Georges Hebrail, Jane-Elise Tanzy

Prediction of Failure Events when No Failures have Occurred

Failure of some components is an extremely rare event. Even if such components have been in service for some time, it is possible that no failures have occurred. This paper will describe methods that have been developed to analyze data of this nature for an aircraft component. These methods involve modeling the failure distribution and aircraft fleet, and application of bootstrap methodology.

Stephen P. Jones

Generalising Regression and Discriminant Analysis: Catastrophe Models for Plasma Confinement and Threshold Data

It has since long been known that, for linear models, there exist strong formal interrelationships between regression analysis and discriminant analysis with equal covariance matrices (Flury and Riedwyl, 1988). These are related to invariance of estimating formulas and of the null-distribution of some statistics (such as the empirical correlation coefficient) under the duality transformation of interchanging the random aspect of the variables in a regression problem (Kshirsagar, 1972). In fusion-oriented plasma physics, both types of analyses have been used in the context of confinement time analysis and the determination of existence regions for particular types of confinement discharges (L-mode, H-mode, etc.), respectively (Yushmanov et al., 1990, Kardaun et al., 1992, Christiansen et al., 1992, H-mode Database Working Group, presented by O. Kardaun, 1992, H-mode DBWG, presented by D. Schissel, 1993, H-mode DBWG, presented by F. Ryter, 1996). Scientific interest is to provide a communicative summary between a wealth of experimental results and concepts from plasma physical theory as well as in making predictions for long-term international future devices such as ITER (Tomabeschi et al. 1991). Due to various complexities, both the physics and the empirical scaling behaviour of plasma confinement turns out to be an elusive matter, difficult to nail down accurately. This leads to considerable prediction margins for future machines, with consequently possibly increased construction costs, and may prove a serious obstacle for down-sizing the successors of ITER to commercially and environmentally viable reactors. To improve the situation, a concentrated long-term effort of experimental and theoretical plasma physics, in combination with applied modelling and statistical data analysis is needed. In this paper, we describe aspects of a unifying statistical procedure that might be helpful for fitting problems in this context.

O. J. W. F. Kardaun, A. Kus

Parallel Strategies for Estimating the Parameters of a Modified Regression Model on a SIMD Array Processor

The various problems associated with block modifying the standard regression model are described. The performance, on a SIMD computer, of a new bitonic algorithm for solving the updating problem is considered and its adaptation for solving the downdating problem is discussed.

Erricos J. Kontoghiorghes, Elias Dinenis, Maurice Clint

Stochastic Algorithms in Estimating Regression Models

The optimization problem may be formulated as follows: For a given objective function f: Ω → R, Ω⊂ Rd, the point x* is to be found such that $$f\left( {{\rm{x*}}} \right) = \mathop {\min }\limits_{{\rm{x}} \in \Omega } f\left( {\rm{x}} \right).$$ It is evident that the point x* represents the global minimum of real-valued function f (of d variables) in Ω.

Ivan Krivý, Josef Tvrdík

Generalized Nonlinear Models

Use of the generalized linear model framework makes it possible to fit a wide range of nonlinear models by a relatively fast and robust method. This involves fitting generalized linear models at each stage of a nonlinear search for a few of the parameters in the model. Applications include probit and logit analysis with control mortality, estimation of transformations for explanatory variables, and additive models requiring adjustment of a smoothing variable.

Peter W. Lane

The Use of Statistical Methods for Operational and Strategic Forecasting in European Industry

Forecasting is essential for business managers, economists, scientists and engineers. Forecasting techniques range from purely subjective guesses to complex quantitative techniques.As a recent investigation shows (Lewandowski (1996)), 90% of European companies with an annual turnover of more than 200 million ECU will have to use efficient sales planning and forecasting systems in marketing, logistics, production and in the financial sector in the next ten years for their operational and strategic planning, in order to guarantee their necessary productivity improvement.Great advances in the techniques used in forecasting have been made over the last few decades, partly due to the greater use of computer methods and systems for the processing of these statistics. However, it is still largely the case that these, often basic methods, are not employed in organisations because of the lack of skilled statisticians and experts to carry out the analysis, and the difficulties integrating forecasting systems in the management systems.In this paper we present FORCE4 a system for the use of advanced and powerful statistical techniques for forecasting, including seasonal analysis. The system is aimed at a wide spectrum of final users in industry, government and service organisations. The system is adapted to the role that forecasting plays in any organisation’s strategy.

R. Lewandowski, I. Solé, J. M. Catot, J. Lorés

Bayesian Analysis for Likelihood-Based Nonparametric Regression

In a framework of likelihood regression model, the estimator of the response function is constructed from a set of functional units. The parameters defining these functional units are estimated with the help of Bayesian approach. The sample from the Bayes posterior distribution is obtained from the MCMC procedure based on combination of Gibbs and Metropolis-Hastings algorithms. The method is described for the case of logistic regression model and for histogram and radial basis function estimators of response function.

A. Linka, J. Picek, P. Volf

Calculating the Exact Characteristics of Truncated Sequential Probability Ratio Tests Using Mathematica

As a result of pressures to cut pesticide use, schemes are in use or are being developed which involve sampling a crop for pests and only treating the crop if the pest is found in sufficient quantities to justify the use of the pesticide. Such schemes are known as supervised pest control. Although they produce a saving of pesticide, the cost of sampling is high and it is important that the sampling should be done as efficiently as possible.

James Lynn

How to Find Suitable Parametric Models using Genetic Algorithms. Application to Feedforward Neural Networks

Most of nonlinear models based on polynoms, wavelets or neural networks, have the universal approximation ability, [Barron, 1993]. This ability allows the nonlinear models to outperform linear models as soon as the problem includes nonlinear correlations between variables. This can be a strong advantage but this feature, plus the infinite variety of model structure, entail a danger named overfitting. Whatever is the problem you attempt to resolve using nonlinear parametric model (classification, regression, control…), in general, you have a certain amount of data (we denote these data learning base) that you use for the parameters estimation. What you want is a model which gives good performances on a set of novel data (named test set). If you observe significantly worse results on the the test set, the generalization ability of this model for this specific problem is poor and the model overfits the learning set. Estimating the parameters of a model having a lot of degrees of freedom (in general too many free parameters) for modeling not enough noisy data can yield an underestimation of the noise variance and overfitting of the data.

M. Mangeas, C. Muller

Some Computational Aspects of Exact Maximum Likelihood Estimation of Time Series Models

It is well known (Ansley and Newbold 1980; Hillmer and Tiao 1979) that exact maximum likelihood estimation (EMLE) of time series models is usually preferable to other approximate estimation criteria. This is especially true in the case of small-to moderate-sized samples and/or parameters close to the boundaries of the admissible regions. Instead of pursuing this issue further, this paper focuses on some relevant computational aspects concerning the numerical maximization of the exact likelihood function of several time series models. The range of models considered here covers, among other more usual specifications, a new kind of seasonal univariate autoregressive-moving average (ARMA) models, single-and multiple-output transfer-function-noise models, vector ARM A models and, in general, time series models with parameters subject to certain constraints.

José Alberto Mauricio

Estimation After Model Building: A First Step

Suppose that a set of data is used to build a model and the same data are then used to estimate parameters in the model. If classical methods such as least squares or maximum likelihood are used to estimate the parameters as if the model had been decided a priori then the parameter estimates will be biassed. Only regression subset selection procedures are considered here, but the same problems of over-fitting exist with most model building procedures. Chatfield (1995) has reviewed some of the problems of inference after model building, while Bancroft & Han (1977) give an extensive bibliography.

Alan J. Miller

Logistic Classification Trees

This paper provides a methodology how to grow exploratory trees enabling to understand, through statistical modeling, which variables are the most significant for determination why an object is in one class rather than in another. Logistic regression is used for modeling the dependence of the response dichotomous variable on the set of given predictors. The application on real data allows to discuss main advantages of the proposed procedure, especially for the analysis of real data sets whose dimensionality requires some sort of variable selection.

Francesco Mola, Jan Klaschka, Roberta Siciliano

Computing High Breakdown Point Estimators for Planned Experiments and for Models with Qualitative Factors

We consider a linear model y = Xβ + z, where y = (y1,…, y N )T ∈ IRN is the observation vector, z = (z1,…, Z N )T∈IRN the error vector, β ∈ IRr the unknown parameter vector, X = (x1, …,X N )T ∈ IRN×r the known design matrix.

Christine H. Müller

Posterior Simulation for Feed Forward Neural Network Models

We are interested in Bayesian inference and prediction with feed-forward neural network models (FFNN’s), specifically those with one hidden layer with M hidden nodes, p input nodes, 1 output node and logistic activation functions: we try to predict a variable y in terms of p variables x = (x1,…, xp), with regression function y(x) = $$y\left( x \right) = \sum\nolimits_{i = 1}^M {{\beta _j}\psi \left( {{\gamma _j}x} \right)} \,$$ where ψ(z) = $$\psi \left( z \right) = {{\exp \left( z \right)} \over {1 + \exp \left( z \right)}}$$ These and other neural network models are the central theme of recent research. Statistical introductions may be seen in Cheng and Titterington (1994) and Ripley (1993).

Peter Müller, David Rios Insua

Bivariate Survival Data Under Censoring: Simulation Procedure for Group Sequential Boundaries

In clinical trials, ethical considerations dictate that the accumulating data be analyzed for potential early termination due to treatment differences or adverse effects. Group sequential procedures take into account the effect of such interim analyses in univariate cases. When the outcome is correlated bivariate, often the problem is simplified to a univariate situation with corresponding loss of information. We consider the bivariate exponential distribution of Sarkar to develop a parametric methodology for interim analysis of clinical trials. We first present the procedure for testing the hypothesis of no treatment difference assuming complete uncensored data. Secondly, we incorporate three types of censoring schemes into the procedure. Finally, we show how group sequential methods apply to the bivariate censored case. The method is illustrated by simulating two equal samples of size 500 from the bivariate exponential distribution of Sarkar. The samples for the experimental and the control groups were generated having mean failure times for each of the organs of 20 months and 16 months, respectively. Different correlations between the failure times of the organs were also considered. A program in C++ was written to obtain the estimators and standard errors using the Newton Raphson procedure and then we incorporated the group sequential procedures. Numerical results are presented.

Sergio R. Muñoz, Shrikant I. Bangdiwala, Pranab K. Sen

The Wavelet Transform in Multivariate Data Analysis

Data analysis, for exploratory purposes, or prediction, is usually preceded by various data transformations and recoding. The wavelet transform offers a particularly appealing data transformation, as a preliminary to data analysis, for de-noising, smoothing, etc., in a natural and integrated way. For an introduction to the orthogonal wavelet transform, see e.g. Strang (1989), Daubechies (1992). We consider the signal’s detail signal, ξ m , at resolution levels, m. With the residual, smoothed image, x0, we have the wavelet transform of an input signal x as follows. Define ξ as the row-wise juxtaposition of all {ξ m } and x0, and consider W given by 1$${W_x} = \xi = {\left\{ {{\xi _{N - 1}},...,{\xi _0},{x_0}} \right\}^T}$$ with WTW = I (the identity matrix). Examples of these orthogonal wavelets are the Daubechies family, and the Haar wavelet transform (see Press et al., 1992; Daubechies, 1992). Computational time is O(n) for an n-length input data set.

F. Murtagh, A. Aussem, O. J. W. F. Kardaun

“Replication-free” Optimal Designs in Regression Analysis

Let $$\matrix{ {{y_i} = f\left( {{x_i},\theta } \right) + {e_i},} & {i = 1,...,n,} & {{x_i} \in B \subset {{\rm{R}}^1}} \cr } $$ be a regression model with a regression function f and i.i.d. error terms e i . The unknown parameter 8 may possess p ≤n components i.e. θT = (θ1,…θ p ) ∈ Ω ⊂Rp. We assume that the usual condition for the asymptotic least squares theory are fulfilled (see Rasch, 1995, chapter 16). By ^θ we denote the least squares estimator of θ and by V the (asymptotic) covariance matrix of ^θ which may or may not be dependent on θ. Let Φ be any functional of V monotonically decreasing with n which is used as an optimality criterion for an optimal choice of the x i ∈B (i=l,…,n); the x i chosen are called an exact design. We call a design (locally or globally) Φ-optimal in B of size n if the set of the x i defining the design is minimizing the functional Φ amongst all possible designs in B of size n. The design is locally optimal, if it depends on θ, otherwise the design is globally optimal. A design of size n with r support points is called an exact r-point design of size n and can be written as 1$$\left( {\matrix{ {{x_1}} \hfill & {{x_2}} \hfill & \ldots \hfill & {{x_r}} \hfill \cr {{n_1}} \hfill & {{n_2}} \hfill & \ldots \hfill & {{n_r}} \hfill \cr } } \right),\,\sum\limits_{i = 1}^r {{n_i} = n,\,{n_i}\,{\rm{integer}}{\rm{.}}} $$ See for more information Pukelsheim (1994).

Dieter A. M. K. Rasch

STEPS Towards Statistics

The STEPS (Statistical Education through Problem Solving) software will consist of about fifty modules designed to introduce students in Biology, Business Studies, Geography and Psychology to statistical ideas and concepts. They have been developed under the United Kingdoms Teaching and Learning Technology Programme (TLTP) by a consortium of Statistics departments from the Universities of Glasgow, Lancaster, Leeds, Nottingham Trent, UMIST, Sheffield and Reading. The first eight modules were released in September 1995 and the remainder will be completed and released by April 1996.

Edwin J. Redgern

File Grafting: a Data Sets Communication Tool

We present a statistical methodology, which we call file grafting, to visualise information coming from two different data sets. For this purpose it is necessary that the two data sets share a common space, defined by certain variables which act as a bridge between them. Moreover, certain conditions should be fulfilled to control the whole process and interpret the given results.

Roser Rius, Ramon Nonell, Tomàs Aluja-Banet

Projections on Convex Cones with Applications in Statistics

We investigate a Random-Search-Algorithm for finding the projection on a closed convex cone in Rp with respect to a norm defined by any positive definite matrix. It will be shown that this algorithm converges a. s.. The power of the algorithm will be demonstrated by examples for polyhedral cones.

Egmar Rödel

Partial Correlation Coefficient Comparison in Graphical Gaussian Models

In graphical Gaussian models the partial correlation coefficient is the natural measure of the interaction represented by an edge of the independence graph. In this paper we discuss the comparison of partial correlation coefficients in a graphical Gaussian model. Three tests of the null hypothesis H0: ρ12.3 = ρ13.2 in a trivariate Normal distribution with ρ23.1 = 0 are worked out. The methods include the likelihood ratio test and the restricted maximum likelihood estimates are provided in closed form. A sampling simulation study for comparing the three test statistics is carried out.

A. Roverato

The Robustness of Cross-over Designs to Error Mis-specification

Illustrations are given of the use of computer software developed by the authors to investigate the robustness of cross-over designs to a range of withinsubject correlation structures using two performance criteria. Since the form of the correlation structure is generally unknown in practice, such an investigation can be used to guide the choice of design or to provide reassurance on a design’s robustness to a range of plausible correlation structures.

K. G. Russell, J. E. Bost, S. M. Lewis, A. M. Dean

ISODEPTH: A Program for Depth Contours

Depth is a multivariate generalization of the concept of rank. The depth of a point relative to a data cloud gives an indication of how deep the point lies inside the cloud. The (average of the) point(s) with maximal depth can be thought of as a multivariate median.

I. Ruts, P. J. Rousseeuw

Non Parametric Control Charts for Sequential Process

In a recent paper, Scepi-Lauro-Balbi (1993) proposed an original approach to building multivariate control charts without imposing distributional assumptions on the variables of interest. The present paper extends the previous one along two directions: (i) dealing with time dependent data not necessarily based on the hypothesis of i.i.d variables (ii) controlling single observations of multivariate time series by means of a suitable control chart. In the following, it will be shown that by adopting a resampling algorithm, called “stationary bootstrap” (Politis, Romano, 1994), together with a three way method of data analysis, (STATIS; Escoufier, 1987), it is possible to derive non parametric control charts for both controlling a process observed in successively periods of time and detecting units responsible for out-of-control situations (section 2). The ARL function for non parametric control charts is supplied in section 3. Finally, an application to industrial data drawn from literature (Box, Jenkins, Reinsel, 1994) illustrates the sensibility of our control charts in signalling out-of-control.

Germana Scepi, Antonio Acconcia

An Iterative Projection Algorithm and Some Simulation Results

An iterative projection method for large linear equation systems is described. It has favourable properties with respect to many statistical applications. A major advantage is that convergence can be established without restrictions on the system matrix. Hence diagonal dominance or regularity are not required. The reason why this numerical method has not been much used in computational statistics is its slow convergence behaviour. In this paper we introduce a relaxation concept and the optimal choice of the relaxation parameter, even for nearly singular systems, is studied in a simulation experiment.

Michael G. Schimek

Computational Asymptotics

This paper focuses on both numerical and symbolic methods that may prove to be useful for the purposes of data analysis, but have, so far, not been implemented for routine use in most of the popular statistical packages. Emphasizing likelihood methods I hope to demonstrate that (i)there are situations, where standard numerical algorithms may easily be adapted to yield results more accurately related to respective likelihood quantities than those obtained by quadratic, ‘Wald-type’ approximations;(ii)there are instances, where relying on numerical algorithms may yield results highly sensitive on some quantities that may not be computed to adequate precision; and(iii)symbolic computations may be successfully employed to obtain numerically accurate and mathematically correct results, even if derivations envolved are tedious and too messy to be done by hand.

G. U. H. Seeber

An Algorithm for Detecting the Number of Knots in Non Linear Principal Component Analysis

Principal Component Analysis (PCA) aims at finding “few” linear combinations of the original variables which have “maximal” variance, losing in that summarizing process as little information as possible. The usual computational tool in PCA consists in the singular value decomposition of the observed individuals-variables matrix, centred with respect to the mean vector, and in its lower-rank approximation (in the least-squares sense). Determining this lower rank is a critical point for the method.

Gerarda Tessitore, Simona Balbi

Generation and Investigation of Multivariate Distributions having Fixed Discrete Marginals

Let n be the number of observations. We say that a distribution P is empirical (corresponding to the sample size n), if it is discrete having the probability function P(ai = ni/n), ∑ni = n. During the paper we regard n as fixed and all distributions will be empirical corresponding to the value n. Let P1, P2, …, P k be univariate empirical distributions. The aim of the paper is to investigate the set ∏ (P1,…, P k ; n) = ∏n of all k-variate empirical distributions having P i as marginals. It is evident, that in the case of empirical distributions the set ∏ is finite. We will discuss the following problems: 1.Find the algorithm for generation of all distributions belonging to ∏ n and calculate/estimate the power of the set ∏.2.Describe the set ∏ using the concepts of multivariate extremal distributions and partial ordering in the set ∏.3.Find the connection between the ordering and some coefficients of dependence.4.Define a probability measure in the set ∏n.

E.-M. Tiit, E. Käärik

A Simulation Framework for Re-estimation of Parameters in a Population Model for Application to a Particular Locality

A model describing the population dynamics of a given animal population can help wildlife managers to explore the consequences of management strategies. A useful model will give predictions of the future size and structure of a population. The parameters of a population dynamics model are survival and reproduction rates, which can themselves be functions of other parameters. Often the data are too sparse to obtain reliable estimates for these parameters for a population of interest, so that data from a different but hopefully similar population of the same species or, failing that, knowledge about related species must be used to provide estimates. Demographic parameters may be a function of environmental factors, so that we need to tailor these estimates to the population the model is to be applied to.

Verena M. Trenkel, David A. Elston, Stephen T. Buckland

A Semi-Fuzzy Partition Algorithm

Aim of the present paper is to introduce a semi-fuzzy partition algorithm in order to take into account both the advantages of fuzzy and hard classification methods. It keeps the information of mixed objects without losing the sharpness of the pure objects. The assignment rule of the objects to the classes, in fuzzy or in hard way, is based on the empirical distributions of the squared Mahalanobis distances of the objects from the baricenters (or prototypes) of each fuzzy class.The proposed algorithm, initialized by the classical fuzzy k-meons algorithm (Bezdek, 1974; Dunn, 1974), computes, iteratively, the optimal number k of fuzzy clusters and the optimal fuzziness degree of the memberships of the objects to the clusters.

Rosanna Verde, Domenica Matranga

Estimation in Two — Sample Nonproportional Hazards Models in Clinical Trials by an Algorithmic Method

A regression nonproportional hazards model in which the structural parameter is the vector of regression coefficients is considered. Jointly (implicitly) defined estimators of the structural and nuisance parameters are proposed and for the special case of the two — sample problem, an algorithmic procedure that provides these estimators is designed. The behavior of the algorithm is illustrated through extensive simulation of survival data.

Filia Vonta

How to Obtain Efficient Exact Designs from Optimal Approximate Designs

During the last decade most statistical general purpose packages have included a module for experimental design. These modules offer a wide variety of designs with built-in-structure, such as Graeco-Latin squares, factorial designs, Plackett-Burman-Designs, orthogonal arrays, block designs, balanced incomplete block designs, and rotatable designs, but there are only a few routines found which investigate the optimality aspect of designs.

Adalbert Wilhelm


Weitere Informationen