main-content

## Über dieses Buch

This Volume contains the Keynote, Invited and Full Contributed papers presented at COMPSTAT 2000. A companion volume (Jansen & Bethlehem, 2000) contains papers describing the Short Communications and Posters. COMPST AT is a one­ week conference held every two years under the auspices of the International Association of Statistical Computing, a section of the International Statistical Institute. COMPST AT 2000 is jointly organised by the Department of Methodology and Statistics of the Faculty of Social Sciences of Utrecht University, and Statistics Netherlands. It is taking place from 21-25 August 2000 at Utrecht University. Previous COMPSTATs (from 1974-1998) were in Vienna, Berlin, Leiden, Edinburgh, Toulouse, Prague, Rome, Copenhagen, Dubrovnik, Neuchatel, Vienna, Barcelona and Bristol. The conference is the main European forum for developments at the interface between statistics and computing. This was encapsulated as follows on the COMPST A T 2000 homepage http://neon. vb.cbs.nlIrsml compstat. Statistical computing provides the link between statistical theory and applied statistics. As at previous COMPSTATs, the scientific programme will range over all aspects of this link, from the development and implementation of new statistical ideas through to user experiences and software evaluation. The programme should appeal to anyone working in statistics and using computers, whether in universities, industrial companies, research institutes or as software developers. At COMPST AT 2000 there is a special interest in the interplay with official statistics. This is evident from papers in the area of computerised data collection, survey methodology, treatment of missing data, and the like.

## Inhaltsverzeichnis

### The broad role of multiple imputation in statistical science

Nearly a quarter century ago, the basic idea of multiple imputation was proposed as a way to deal with missing values due to nonresponse in sample surveys. Since that time, the essential formulation has expanded to be proposed for use in a remarkably broad range of empirical problems, from many standard social science and biomedical applications involving missing data in surveys and experiments, to nonstandard survey and experimental applications, such as preserving confidentiality in public-use surveys and dealing with noncompliance and “censoring due to death” in clinical trails, to common “hard science” applications such as dealing with below-threshold chemometric measurements, to other scientific or medical applications such as imaging brains for tumors, and exploring the genetics of schizophrenia. The purpose of this presentation is to provide some links to this broad range of applications and to indicate the associated computing requirements, primarily using examples in which I am currently involved.

Donald B. Rubin

### Official Statistics: an estimation strategy for the IT-era

This paper gives an overview of the efforts at the Department of Statistical Methods of Statistics Netherlands to develop a design based strategy aimed towards full numerical consistency of its statistical estimates, given a matched data file containing all available register and survey data about the target population. The key element of the estimation procedure is repeated re-calibration of survey data sets, i.e. a new set of raising weights is derived each time additional estimates are being produced. This allows us to take account of any related statistics obtained in earlier rounds of estimation.

P. Kooiman, A. H. Kroese, R. H. Renssen

### Bayesian model selection methods for nonnested models

In the Bayesian approach to model selection and prediction, the posterior probability of each model under consideration must be computed. In the presence of weak prior information we need using default or automatic priors, that are typically improper, for the parameters of the models. However this leads to ill-defined posterior probabilities.Several methods have recently been proposed to overcome this difficulty. Of particular interest are the intrinsic and fractional methodologies introduced by Berger and Pericchi (1996) and O’Hagan (1995), respectively. In the specific case of comparing nested models, a significant feature of these methods is that they allow to derive proper priors to be used in the analysis.Unfortunately, the above methods do not apply to nonnested models: either the priors derived depend on the label we assign to models under comparison, and this may give two possible values to the posterior probability of each model, or they depends on the form of “encompassing”.We first consider a particular nonnested model selection problem: the one-sided testing problem. The encompassing approach for converting the non-nested problem into a nested one is discussed, and an alternative solution is proposed. Its behavior is illustrated on exponential distributions.For more general nonnested model selection problems we argue that an accessible piece of prior information on the observable random variable helps to find the posterior probability of the models. The way to deal with such a prior information is developed and illustrated on the comparison of separate normal and double exponential families of distributions.

Francesco Bertolino, Elias Moreno, Walter Racugno

### Spatio-temporal hierarchical modeling of an infectious disease from (simulated) count data

An infectious disease spreads through “contact” between an individual who has the disease and one who does not. However, modeling the individual-level mechanism directly requires data that would amount to observing (imperfectly) all individuals’ disease statuses along their space-time lines in the region and time period of interest. More likely, data consist of spatio-temporal aggregations that give small-area counts of the number infected during successive, regular time intervals. In this paper, we give a spatially descriptive, temporally dynamic hierarchical model to be fitted to such data. The dynamics of infection are described by just a few parameters, which can be interpreted. We take a Bayesian approach to the analysis of these space-time count data, using Markov chain Monte Carlo to compute Bayes estimates of all parameters of interest. As a “proof of concept,” we simulate data from the model and investigate how well our approach recovers important hidden features.

Noel Cressie, Andrew S. Mugglin

### GBMs: GLMs with bilinear terms

Generalized bilinear models can hide under a variety of denominations. These correspond broadly to two types of statistical activities which are often combined: exploratory analysis and explicit modeling. It turns out that the wide diversity of correlative methods can be considered as minor variations of a single model which is introduced in the standard framework set for generalized linear models. The paper presents this unifying approach and illustrates some its strengths.

Antoine de Falguerolles

### Generalized calibration and application to weighting for non-response

A generalised theory for calibration is developed distinguishing two set of variables and leading to instrumental regression estimation in the linear case. The dissymmetry of the variables receives a very interesting application when we apply generalised calibration to the problem of weighting for non-response: one set of variables is connected to factors inducing nonresponse, the second one to variables correlated to the variable of interest. A calibration principle is proposed as an estimation method for the parameters of the response model. Its advantage is to produce a reduction of the variance bound to the calibration. A complete treatment is given in the case of an exhaustive survey, and some indication for the general case. We show also that imputation « weighting-like » can be performed by using of balanced sampling techniques.

Jean-Claude Deville

### Methodological issues in data mining

The nature of the new science of data mining is examined, drawing attention to the concepts and ideas it has inherited from areas such as statistics, machine learning, and database technology. Particular issues looked at include the emphasis on algorithms as well as models, and the fundamental importance of data quality to data mining exercises.

David J. Hand

### Practical data mining in a large utility company

We present in this paper the main applications of data mining techniques at Electricité de France, the French national electric power company. This includes electric load curve analysis and prediction of customer characteristics. Closely related with data mining techniques are data warehouse management problems: we show that statistical methods can be used to help to manage data consistency and to provide accurate reports even when missing data are present.

Georges Hébrail

### HGLMs for analysis of correlated non-normal data

Hierarchical generalized linear models (HGLMs) are developed as a synthesis of (i) generalized linear models (GLMs) (ii) mixed linear models, (iii) joint modelling of mean and dispersion and (iv) modelling of spatial and temporal correlations. Statistical inferences for complicated phenomena can be made from such a HGLM, which is capable of being decomposed into diverse component GLMs, allowing the application of standard GLM procedures to those components, in particular those for model checking.

Youngjo Lee, John A. Nelder

### Bootstrapping impulse responses in VAR analyses

Because the parameters of vector autoregressive processes are often difficult to interpret directly, econometricians use quantities derived from the parameters to disentangle the relationships between the variables. Bootstrap methods are often used for inference on the derived quantities. Alternative bootstrap methods for this purpose are discussed, some related problems are pointed out and proposals are presented to overcome the difficulties at least partly. Some remaining problems are presented.

Helmut Lütkepohl

### An application of TRAMO-SEATS; model selection and out-of-sample performance. The Swiss CPI series

The programs TRAMO, “Time Series Regression with ARIMA Noise, Missing Observations and Outliers”, and SEATS, “Signal Extraction in ARIMA Time Series” (Gomez and Maravall, 1996) have experienced an explosion in their use by data producing agencies and short-term economic analysts. TRAMO is a program for estimation and forecasting of regression models with possibly nonstationary ARIMA errors and any sequence of missing values. The program interpolates these values, identifies and corrects for several types of outliers, and estimates special effects such as Trading Day and Easter and, in general, intervention-variable type effects. Fully automatic procedures are available. SEATS is a program for estimation of unobserved components in time series following the so-called ARIMA-model-based (AMB) method. The trend-cycle, seasonal, irregular and perhaps transitory components are estimated and forecasted with signal extraction techniques applied to ARIMA models. The two programs are structured so as to be used together both for in-depth analysis of a few series or for automatic routine applications to a large number of series. When used for seasonal adjustment, IRAMO preadjusts the series to be adjusted by SEATS. The two programs are officially used (and recommended) by Eurostat and, together with X12 ARIMA, by the European Central Bank

Agustín Maravall, Fernando J. Sánchez

### Spreadsheets as tools for statistical computing and statistics education

Spreadsheets are an ubiquitous program category, and we will discuss their use in statistics and statistics education on various levels, ranging from very basic examples to extremely powerful methods. Since the spreadsheet paradigm is very familiar to many potential users, using it as the interface to statistical methods can make statistics more easily accessible.

Erich Neuwirth

### An algorithm for deepest multiple regression

Deepest regression (DR) is a method for linear regression introduced by Rousseeuw and Hubert (1999). The DR is defined as the fit with largest regression depth relative to the data. DR is a robust regression method. We construct an approximate algorithm for fast computation of DR in more than two dimensions. We also construct simultaneous confidence regions for the true unknown parameters, based on bootstrapped estimates.

Peter J. Rousseeuw, Stefan Van Aelst

### Non-proportional hazards models in survival analysis

Cox’ proportional hazard model is usually the model of choice in survival analysis. It is shown that this model can be embedded in a GLMmodel by proper discretization of the time axis. That approach easily allows non-proportional hazard models, that are special cases of time-varying coefficients models. It is shown how the effective dimension of the general non-proportional hazards model can be controlled by either reduced rank regression methods of P-splines methodology.

Hans C. van Houwelingen, Paul H. C. Eilers

### A spatio-temporal analysis of a field trial

A study involving the growth of Australian Eucalypts under irrigation regimes with varying levels of salinity and nutrients was conducted in Loxton South Australia. The field experiment was conducted over a period of six years. The aim of the study was to determine the impact of salinity on the growth of eucalypts and to provide recommendations on the commercial suitability of growing eucalypts using saline drainage water. The data has both spatial and temporal aspects which are examined in this paper. Spatial modelling follows current methods for field trials while the temporal modelling involves smoothing splines. A joint mixed model is developed which uses a mixed model representation of the smoothing spline.

Arūnas Verbyla, Michelle Lorimer, Robert Stevens

### Principal component logistic regression

The objective of this paper is to develop an extension of principal component regression for multiple logistic regression with continuous covariates. A practical application with simulated data will be included where the accuracy of the proposed principal component logistic regression model will be evaluated starting from the estimated parameters and probabilities.

A. M. Aguilera, M. Escabias

### Sieve bootstrap prediction intervals

When studying a time series, one of the main goals is the estimation of forecast confidence intervals based on an observed trajectory of the process. The traditional approach of finding prediction intervals for a linear time series assumes that the distribution of the error process is known. Thus, these prediction intervals could be adversely affected by departures from the true underlying distribution.

Andrés M. Alonso, Daniel Peña, Juan Romo

### Clustering by maximizing a fuzzy classification maximum likelihood criterion

Basing cluster analysis on mixture models has become a classical and powerful approach. In this paper we propose an extension of this approach to fuzzy clustering. We define a fuzzy clustering criterion which generalizes both the maximum likelihood and the classification maximum likelihood. Finally using a generalization of the well-known EM and CEM algorithms we design an algorithm to optimize this criterion.

Christophe Ambroise, Gérard Govaert

### Tree-based algorithms for missing data imputation

Let X be a N × (p+q) data matrix, with entries partly missing in the last q columns. A problem of practical relevance is that of drawing inferences from such an incomplete data set. We propose to use a sequence of trees to impute missing values. Essentially, the two algorithms we introduce can be viewed as predictive matching methods. Among their advantages, is their flexibility, which makes no assumptions about the type or distribution of the variables.

M. J. Bárcena, F. Tusell

### MiPy: a system for generating multiple imputations

Multiple imputation has proven to be a useful mode of inference in the presence of missing data. It is a Monte-Carlo based methodology in which missing values are imputed multiple times by draws from a (typically explicit) imputation model. Generating “proper” imputations under reasonable models for the missing-data and complete-data mechanisms is often a computationally challenging task. The lack of software for generating multiple imputations is a serious impediment to the routine adoption of multiple imputation for handling missing data. Several groups have developed software for generating imputations, most of which is model specific, but none of this software is open, flexible, or extensible. In this paper I will introduce a computer software system, called MiPy, for generating multiple imputations under a wide variety of models using several computational approaches. The system is constructed from a combination of Python, an object-oriented and portable high-level interpreted language, and compiled modules in C, C++, and Fortran. MiPy features a clean syntax, simple GUI, open source, and portability to all of the major operating system platforms. In MiPy, Python can be viewed as the glue language that ties together computationally intensive modules written in lower-level languages.

John Barnard

### A linear approximation to the wild bootstrap in specification testing

The specification of a nonlinear regression model E[Y|X=x] = f(x, θ) for (Y, X) ~ D for a known function f: ℝd × ⊝ → ℝ is to be tested. A possible test statistic is $${\hat T_n} = \tfrac{1}{n}\sum _{1 \leqslant i < j \leqslant n}^n{\hat U_i}{K_{ij}}{\hat U_j}$$, where $${\hat U_i}$$ denote parametrically estimated residuals, and K ij are kernel weights. Usually the wild bootstrap algorithm (Wu, 1986) is used for for deriving the critical values. Using the structure of an U-statistic inherent to $${\hat T_n}$$, it is possible to approach its limiting distribution directly. The resulting Monte-Carlo-approximation can be viewed as a linear approximation to the wild bootstrap that consumes substantially less computer time for nonlinear models. In simulations this Monte-Carlo-approximation was demonstrated to be applicable. The theoretical foundation lies in an asymptotic consideration that differs from the usual assumptions: The kernel weights K ij depend on a bandwidth h that is held fixed here, in contrary to the usual setting h = h n →0. Thus the effects for n→∞ and h n →0 are separated.

Knut Bartels

### The size of the largest nonidentifiable outlier as a performance criterion for multivariate outlier identification: the case of high-dimensional data

Various procedures exist for identifying outliers in multivariate data. To decide which identification rule should be chosen, several performance criteria can be used. We investigate here the problem of multivariate simultaneous outlier identification and concentrate on the criterion of the size of the largest nonidentifiable outlier. Four outlier identification rules are compared with respect to this criterion. Our main focus is on a comparison of the rules in high-dimensional data situations, and we present the results of a simulation study in an accordingly chosen 10-dimensional setting.

Claudia Becker

### The influence of data generation and imputation methods on the bias of factor analysis of rating scale data

This paper focuses on the bias as a result of imputation methods applied to psychological questionnaire data. Multidimensional rating scale data were generated using three different models. A simulation was carried out with, among others, factors Method of Data Generation, and Imputation Method. It was found that imputation of the mean for each person separately had little bias whereas item mean imputation could result in severe underestimation of factor loadings.

Coen A. Bernaards

### Model on a population and prediction on another one: a generalized discriminant rule

Traditionally, discriminant analysis in a decision purpose proceeds in the following manner (McLachlan 1992): A sample is drawn from a population and a partition of this sample in two classes, males and females, is known. Using some variables, an allocation rule is established in order to classify other elements of the previous population. An underlying assumption of this procedure is that the learning sample is representative of the population, i.e. its parameters about the predictive features are statistically not different from the ones of the population.

Christophe Biernacki, Farid Beninel, Vincent Bretagnolle

### Disclosure control on multi-way tables by means of the shuttle algorithm: extensions and experiments

In this paper we re-examine our algorithm for calculating the lower and upper bounds of an array given the complete set of its marginals, proposing some extensions. The algorithm has some interesting properties that can be useful in various fields of application, such as statistical disclosure control of count tables. These properties involve both theoretical and computational issues: in particular, the algorithm has relevant links with probabilistic and statistical aspects (e.g. Fréchet and Bonferroni bounds) and is particularly easy to implement, has a low storage requirement and is very fast.

Lucia Buzzigoli, Antonio Giusti

### An MLE strategy for combining optimally pruned decision trees

This paper provides a maximum likelihood estimation strategy to identify a tree-based model which, being a function of a set of observed optimally pruned trees, represents the final classification model. The strategy is based on a probability distribution and it uses a metric based on structural differences among trees. An example on a real dataset is also presented to show how the procedure works.

Carmela Cappelli, William D. Shannon

### Semi-parametric models for data mining

In order to combine the exactness of a very large data set with the major predictability of statistical modeling, we introduce a two-step methodology that makes use of partitioning algorithms and semi-parametric models. The result is an alternative strategy for supervised classification and prediction problems when dealing with huge and complex data sets, in order to improve the predictability of the dependent variable on the basis of the previous detection of homogeneous sub-populations1.

Claudio Conversano, Francesco Mola

### Preliminary estimation of ARFIMA models

In this article we propose a preliminary estimator for the parameters of an ARFIMA(p,d,q) model. The estimation procedure is based on the search of the element in the class of ARFIMA models closest to the estimated ARMA model which best fits the observed time series.

Marcella Corduas

### A collection of applets for visualizing statistical concepts

This paper describes a set of didactic tools for statistical teaching, implemented as JAVA applets. The tools allow to visualize a number of statistical concepts, and to experiment with them interactively.

P. Darius, J-P. Ottoy, A. Solomin, O. Thas, B. Raeymaekers, S. Michiels

### Non-parametric regression and density estimation under control of modality

Most available methods in non-parametric regression and density estimation are not directly concerned with modality. New methods are presented that avoid artifacts and yield estimates that have asymptotically the correct modality.

P. L. Davies, A. Kovac

### Multivariate approaches for aggregate time series1

The problem of choosing between the direct and the indirect method for the seasonal adjustment of aggregate time series is addressed. Different multivariate approaches are proposed in order to define properly optimised aggregation weights for the component time series. The new weighing systems help in discriminating between the classical methods and may perform better than the weights assigned a-priori. Applications on real economic series are shown.

Cristina Davino, Vincenzo Esposito

### Some space-time models: an application to NO 2 pollution in an urban area

In this paper a class of product-sum covariance models is introduced, in order to estimate and model realizations of space-time random fields, which are very common in environmental applications. Some constraints on the coefficients of this class of models are given in order to guarantee the positive definiteness condition. An overview of some classes of space-time covariance models and a short comparative study is presented; moreover, an application is considered.

S. De Iaco, D. Posa

### SLICE: generalised software for statistical data editing and imputation

Statistical offices have to face the problem that data collected by surveys or obtained from administrative registers generally contain errors. Another problem they have to face is that values in data sets obtained from these sources may be missing. To handle such errors and missing data efficiently, Statistics Netherlands is currently developing a software package, called SLICE (Statistical Localisation, Imputation and Correction of Errors). SLICE will contain several edit and imputation modules. Examples are a module for automatic editing and a module for imputation based on tree-models. In this paper I describe SLICE, hereby focussing on the above-mentioned modules.

Ton de Waal

### Improved PCB inspection: computational issues

Process control on Printed Circuit Boards (PCB’s) involves measurement of the volumes of a large number of small heaps of solder paste deposit, that are used to fix components, such as IC’s, on the board. Surface estimation is crucial to their measurement but a traditional approach based on local polynomials was a major contributor to measurement variation. In this paper we show that the principles pioneered in Cleveland and Grosse can be successfully applied to reduce measurement variation. In addition, we discuss some new computational issues that arise in such an industrial application of local fitting: speeding up computations by reordering the data, fast adaptation, and some spatial design issues.

Dee Denteneer

### Optimization of the antithetic Gibbs sampler for Gaussian Markov random fields

The efficiency of Markov chain Monte Carlo estimation suffers from the autocorrelation of successive iterations, which is typical for this sampling method. In order to improve the efficiency, antithetic methods attempt to reduce this autocorrelation or even introduce negative autocorrelation. In this paper the antithetic method is adopted to Gibbs sampling of the spatial correlation structure of Gaussian Markov random fields and a rule for the optimal choice of the antithetic parameter is developed. The antithetic Gibbs sampler turns out to perform much better than the classical Gibbs sampler and could compete with i.i.d. sampling, which indeed is usually intractable for this kind of application.

Johannes M. Dreesman

### Computing zonoid trimmed regions of bivariate data sets

In data analyis an important task is to identify sets of points that are central in a data set {x1, …, xn} ⊂ ℝd A set of points that is central in some sense is called a trimmed region. In the univariate case a trimmed region is, e.g., the closed interval between two proper quantiles. Nolan (1992) and Massé and Theodorescu (1994) introduced concepts of trimmed regions, based on a multivariate analogue of the quantile function, suggested by Tukey (1975) and Eddy (1983). These concepts can be seen as generalizations of the univariate interquantile intervals.

Rainer Dyckerhoff

### Outlier resistant estimators for canonical correlation analysis

Canonical correlation analysis studies associations between two sets of random variables. Its standard computation is based on sample covariance matrices, which are however very sensitive to outlying observations. In this note we introduce, discuss and compare different ways for performing a robust canonical correlation analysis. Two methods are based on robust estimators of covariance matrices, the others on projection-pursuit techniques.

P. Filzmoser, C. Dehon, C. Croux

### Graphical and phase space models for univariate time series

There are various approaches to model time series data. In the time domain ARMA-models and state space models are frequently used, while phase space models have been applied recently, too. Each approach has got its own strengths and weaknesses w.r.t. parameter estimation, prediction and coping with missing data. We use graphical models to explore and compare the structure of time series models, and focus on interpolation in e.g. seasonal models.

Roland Fried

### The use of the Tweedie distribution in statistical modelling

This paper discusses the estimation of the parameters of the so-called Tweedie distribution, TP(μ, σ2). Two special cases are considered, namely the Compound Poisson (1 < p < 2) and the Stable form (p > 2). The former is appropriate for data with a non-zero probability of zero observations and the latter is appropriate for data with a large dispersion. Our models will assume that we have data Yi, i = 1,…, N, with differing means μi, with common p and σ2. The Tpi, σ2) distribution can be characterised by Var(Yi) = σ2μip, i = 1,…, N. In general, we shall model the μi in terms of explanatory variates xij, i = 1,…, N, j= 1,…, m. We discuss how it is straightforward to construct the maximum likelihood estimates of p, μi, and σ 2 in a GLM oriented computer package. The Tweedie distribution is used to model the alcohol consumption of British 16 and 17 year olds and randomised quantile residuals are used to validate the modelling.

Robert Gilchrist, Denise Drinkwater

### Predictive dimension: an alternative definition to embedding dimension

In this paper we propose an alternative definition to the embedding dimension that we call predictive dimension. This dimension does not refer to the number of delayed variables needed to characterize the system but to the best predictions that can be obtained for the system. This kind of definition is particularly useful in a forecasting context because it leads to the same value of the traditional embedding dimension for chaotic time series and it is always finite for stochastic ones.

Dominique Guègan, Francesco Lisi

### Post-stratification to correct for nonresponse: classification of ZIP code areas

Presently used weighting procedures for the Dutch Labor Force Survey are suspected not to correct sufficiently for bias due to nonresponse. In this paper, a new correction procedure is proposed that is based on post-stratification. Starting from small geographical units, i.e., ZIP code areas, homogeneous clusters of individuals are sought. For this purpose a two-stage strategy for cluster-analyzing the large number of ZIP code areas is presented. Because hierarchical clustering procedures cannot be applied to large numbers of objects, initial clusters of objects are sought by categorizing the cluster variables, in the first stage of the procedure. In the second stage, these initial clusters are further combined with an hierarchical procedure, resulting in a final classification. The two-stage procedure is applied to classify the ZIP code areas in the Netherlands with respect to socio-economic variables.

Mark Huisman

### Approximate Bayesian inference for simple mixtures

Exact likelihoods and posterior densities associated with mixture data are computationally complex because of the large number of terms involved, corresponding to the large number of possible ways in which the observations might have evolved from the different components of the mixture. This feature is partially responsible for the need to use an algorithm such as the EM algorithm for calculating maximum likelihood estimates and, in Bayesian analysis, to represent posterior densities by a set of simulated samples generated by Markov chain Monte Carlo; see for instance Diebolt and Robert (1994).

K. Humphreys, D. M. Titterington

### Correlated INAR(1) process

We introduce the concept of correlated integer valued autoregressive process of order 1. It is based on equi-correlated binary responses, instead of independent responses involved with the usual INAR(1) process. Results related to the extended Steutel and van Harn operator are presented, their correlation structure, and additional properties are shown. A procedure for conditional likelihood estimation of the parameters of the model is proposed. The case of Poisson innovations illustrates the considered processes.

Nikolai Kolev, Delhi Paiva

### Confidence regions for stabilized multivariate tests

Stabilized multivariate tests as proposed by Läuter (1996) and Läuter, Glimm and Kropf (1996, 1998) are becoming more and more interesting for clinical research. In this paper we consider tests for the comparison of two independent groups and investigate the corresponding confidence regions by numerical methods. The confidence regions reflect the properties of the tests from another perspective. The construction method is, however, applicable only for small dimensions. It turns out that — due to their involved shape — these confidence regions are difficult to handle.

Siegfried Kropf

### Comparison of stationary time series using distribution-free methods

In this paper we propose distribution-free procedures based on the moving blocks bootstrap for differentiating between two stationary time series that are not necessarily independent. A chi-square type statistic and a KolmogorovSmimov type statistic, each of which are based on the differences between the autocorrelations and the differences between the partial autocorrelations of the two series, are constructed. Monte Carlo studies carried out to assess the tests, show that they perform reasonably well. The tests are applied to real financial time series

Elizabeth Ann Maharaj

### Generation of Boolean classification rules

An algorithm to generate a class of Boolean classification rules is described. The algorithm is implemented in search partition analysis software (SPAN), a program designed to find an optimal binary data partition. Some comments on the relationship of the procedure with tree-based search procedures are discussed.

Roger J. Marshall

### A statistical package based on Pnuts

We are developing a statistical system named Jasp in order to utilize recent advanced computational environments. We design Jasp language based on Pnuts, a script language written in and for Java language. Pnuts is a functional language without type declaration, and is easy to use for tentative and experimental work. We add tools for statistical analysis, and object oriented syntax mainly to bundle related functions. Besides a Jasp language window, Jasp user interface also has a graphical user interface window to show the history of analysis and to operate the system by pop-up menus. These two windows are tightly connected and can be used alternatively. Jasp is realized by client/server approach, and one client can execute calculations on more than one servers and can perform distributed computing. Jasp is able to use programs written in foreign languages such as C, C++ and Fortran.

Junji Nakano, Takeshi Fujiwara, Yoshikazu Yamamoto, Ikunori Kobayashi

### Generalized regression trees1

At present regression trees tend to be accurate, however they can be incomprehensible to experts. The proposed algorithm Economic Generalized Regression (EGR) induces regression trees that are more logical and convenient. EGR uses domain knowledge. The domain knowledge contains “is-a” hierarchies and cost associated to each variable. After generating several subtrees from training examples, EGR selects the best one according to a user-defined balance between accuracy and average classification cost. The user can define the degree of economy and generalization. This information will influence directly on the quality of search that the algorithm must undertake.

Marlon Núñez

### Generalized linear mixed models: An improved estimating procedure

The analytically intractable integrated-likelihood in the generalized linear mixed models (GLMM) is approximated in terms of Gauss-Hermite (GH) quadrature. Maximizing the approximated likelihood leads to an improvement over the existing penalized quasi-likelihood (PQL) estimation. The bias caused by the PQL estimation can be eliminated by adding GH quadrature nodes.

Jian-Xin Pan, Robin Thompson

### The stochastic dimension in a dynamic GIS

Coping with random fields in a time-dynamic geographic information system (GIs) increases the computational burden and storage requirements with a large amount, and calls for a number of custom functions to enable easy analysis of the resulting random components, as well as specialised output reporting functions. This paper addresses the computational and implementation issues when a Monte Carlo approach is taken, and shows some results from a rainfall-runoff model running within a GIS.

Edzer J. Pebesma, Derek Karssenberg, Kor de Jong

### A robust version of principal factor analysis

Our aim is to construct a factor analysis method that can resist the effect of outliers. We start with a highly robust initial covariance estimator, after which the factors can be obtained from maximum likelihood or from principal factor analysis (PFA). We find that PFA based on the minimum covariance determinant scatter matrix works well. We also derive the influence function of the PFA method. A new type of empirical influence function (EIF) which is very effective for detecting influential data is constructed. If the data set contains fewer cases than variables, we estimate the factor loadings and scores by a robust interlocking regression algorithm.

G. Pison, P. J. Rousseeuw, P. Filzmoser, C. Croux

### TESS: system for automatic seasonal adjustment and forecasting of time series

TESS, SysTEm for Automatic Seasonal Adjustment and Forecasting of Time Series, is an ESPRIT IV project (Number 29.741) that started on January 1999 and finished on June 2000. More information can be found in http://www.esl.jrc.it/tessThe principal objective of the project was the production of a system for automatic seasonal adjustment and forecasting of time series, with special attention to the problem of aggregation. In order to achieve this goal, we enhanced the preexisting software FORCE4/R (1997), based on SEATS and TRAMO (Maravall and Gomez (1992), Gomez and Maravall (1992), with the methods required to perform the above stated functions in an optimal way. Furthermore, the system offers a homogenous interactive environment employing visualisation techniques.

Albert Prat, Victor Gomez, Ignasi Solé, Josep M. Catot

This paper describes a project, which aims at the creation of a database of indicators and models of the Czech economy. Datasets with time series in a form convenient for analysis are prepared and analysis results are published on a website. Information about the statistical computational environment is also provided.

Hana Rezanková, Luboš Marek

### Improving Statistics Canada’s cell suppression software (CONFID)

The need for cell suppression software is discussed. A brief review of the theoretical framework is given. A recent modification to improve the treatment of the common respondent problem in CONFID (software in use at Statistics Canada) will be described.

Dale A. Robertson

### The multimedia project MM*STAT for teaching statistics

The multimedia project MM*STAT was developed to have an additional tool for teaching statistics. There are some important facts which influenced this development. First, teaching statistics for students in socio-economic sciences must include a broad spectrum of applications of statistical methods in these fields. A pure theoretical presentation is generally considered by the students to be tedious. Second, in practice no statistical analysis is carried out without a computer. Thus, teaching statistics must include the acquisition of computational capabilities. Third, statistics has become more and more complicated over time, because of increasingly complex data structures, statistical methods and models. Thus, an ever-increasing special knowledge of statistics is required and has to be taught. Fourth, notwithstanding these high demands on teaching statistics, the available lecture time, especially for the introductory courses, has remained constant over the years or has even been cut down.

Bernd Rönz, Marlene Müller, Uwe Ziegenhagen

### Projection pursuit approach to robust canonical correlation analysis

Projection pursuit techniques are used to build new robust estimators for the parameters of the canonical correlation model. A simulation study shows that for non-ideal data these estimators can perform as well as other robust estimators. However, they can have much higher breakdown points. This advantage makes these estimators the right choice for use with real data, where potential outlying observations are very frequent.

M. Rosário de Oliveira, João A. Branco

### A fast algorithm for highly robust regression in data mining

Data mining aims to extract previously unknown patterns or substructures from large databases. In statistics, this is what robust estimation and outlier detection were constructed for, see e.g. Rousseeuw and Leroy (1987). Our goal is to construct algorithms which allow us to compute robust results in a data mining context. Such algorithms thus need to be fast, and able to deal with large data sets.

Peter J. Rousseeuw, Katrien Van Driessen

### Optimal classification trees

Classification and regression trees have been traditionally grown by recursive partitioning, i.e. by a top-down search for “locally optimal” splits. The “local”, or “one-step”, optimization of splits can to some extent, using the present power of computer hardware, be substituted by the full optimization of whole trees. In this paper, two bottom-up optimization algorithms are outlined and first experimental experience is presented.

Petr Savický, Jan Klaschka, Jaromír Antoch

### GAM spline algorithms: a direct comparison

For some while backfitting has been the numerical technique associated with generalized additive models. More recently alternatives such as relaxed iterative projection or penalized likelihood fitting have emerged. What they have in common is the use of spline methodology and S code to execute them. Here we check these algorithms for performance differences in the S-Plus environment. The main results: They do not differ much in standard situations, however, vary much under concurvity (near-singularity).

Michael G. Schimek

### Markov Chain Monte Carlo methods for handling missing covariates in longitudinal mixed models

Handling missing covariates in longitudinal mixed effect models is demonstrated on a medical example.

Ernst Schuster

### Robust Bayesian classification

This paper describes a method for learning robust Bayesian classification rules from incomplete data by producing interval-based classification. We provide two scoring methods to perform interval-based classification and a decision theoretic approach to choose the scoring method best suiting the problem at hand.

Paola Sebastiani, Marco Ramoni

### Social science measurement by means of item response models

The basic ideas of measurement in the social and behavioral sciences is explained, followed by a discussion of item response theory, which supplies the family of modern statistical measurement methods. Four specialized topics in item response modeling are discussed, that are at the core of present-day research in item response theory.

Klaas Sijtsma

### Multivariate DLMs for forecasting financial time series, with application to the management of portfolios

This paper considers a Bayesian approach to the multivariate forecasting of financial time series based on dynamic linear models (DLMs). It is shown how a marginal posterior forecast distribution may be simulated, and how this may be used directly in order to implement a fully Bayesian decision-theoretic approach to the selection of optimal stock portfolios. This is briefly compared with more traditional approaches to portfolio selection.

Andrew Simpson, Darren J. Wilkinson

### An algorithm for the multivariate Tukey median

The halfspace location depth of a point θ relative to a data set X n is defined as the smallest number of observations in any closed halfspace with boundary through θ. As such, halfspace depth can be seen as a kind of multivariate ranking. The deepest location, i.e. the θ with maximal halfspace depth, is a multivariate generalization of the median. Until now the deepest location could only be computed for bivariate data. In this paper, we construct an algorithm (called DEEPLOC) to approximate the deepest location in higher dimensions.

Anja Struyf, Peter J. Rousseeuw

### Analyzing and synthesizing information from a multiple-study database

An important public-health issue receiving much discussion in the literature today concerns whether there is an ideal relative weight associated with minimum mortality for all populations. Standards developed by WHO and U.S. organizations propose that ideal weights be determined by Body Mass Index, a single measure incorporating weight and height ($$BMI = \frac{{wt\left( {kg} \right)}}{{ht{{\left( m \right)}^2}}}$$). They recommend BMIs of ≤ 25 for all. To better understand the relationship between BMI and mortality in diverse populations and to discern whether a single BMI of minimum mortality is appropriate, we examine person-level data from 18 studies.

### Bootstrapping neural discriminant model

In recent years, neural computing techniques have received increasing attention from statisticians. Statistical techniques are formulated in terms of the principle of the likelihood of neural network models, where the connection weights of the network are treated as unknown parameters in classification problems. Some comparisons with standard statistical techniques are included.

Masaaki Tsujitani, Takashi Koshimizu

### An improved algorithm for robust PCA

In Croux and Ruiz (1996) a robust principal component algorithm is presented. It is based on projection pursuit to ensure that it can be applied to high-dimensional data. We note that this algorithm has a problem of numerical stability and we develop an improved version. To reduce the computation time we then propose a two-step algorithm. The new algorithm is illustrated on a real data set from chemometrics

Sabine Verboven, Peter J. Rousseeuw, Mia Hubert

### Applying techniques of dynamic programming to sequential mastery testing

The purpose of this paper is to derive optimal rules for sequential mastery testing, that is, deciding on mastery, nonmastery, or to continue testing and administering another random item. The framework of minimax sequential decision theory is used; that is, optimal rules are obtained by minimizing the maximum expected losses associated with all possible decision rules at each stage of testing.

Hans J. Vos

### The introduction of formal structure into the processing of statistical summary data

This paper reviews the technologies required for the production and publication of large volumes of statistical information, such as produced by National Statistical Offices (NSOs). We review both academic and commercial developments, and present these in the context of an analysis of the tools and structures needed.

Andrew Westlake

### Validation of association rules by interactive mosaic plots

Association Rules have been proposed by Agrawal et al. (1993) in the context of market basket analysis. They were invented to provide an automated process, which could find connections among items, that were not known before, especially to answer questions like: “which items are likely to be bought together?”. Typically, the data to be examined consists of customer purchases, i.e. a set of items bought by a customer over a period of time. The standard way of storing such data is the following: To be able to identify each customer the transactions are stored with unique numbers, the transaction identification (TID). Beside that, we have a set of different items, the so-called itemset. $$\mathcal{I} = \{ {i_1},{i_2}, \ldots,{i_m}\}$$. The data or database D is a set of purchases (transactions), where each transaction T includes a set of items, such that $$T \subset \mathcal{I}$$. A transaction T is said to contain a set of items X, if X is a subset of T.

### Optimality models for PRAM

The paper deals with the application of a technique to protect microdata against disclosure. This technique is called the Post Randomization Method (PRAM), and it works through a random perturbation of categorical variables, identifiers in fact, in a microdata set. In this paper two optimization models are formulated to arrive at nonsingular PRAM matrices, which are stochastic (Markov) matrices. The models presented are Nonlinear Programming (NP) models. The first model turns out to be equivalent to a Linear Programming (LP) problem. The second model, which is close to a geometric programming model, can be reformulated as a nonlinear optimization problem under linear constraints.

Leon Willenborg

### Dealing with real ordinal data: recent advances in analyzing tied, censored, and multivariate observations

When ordinal data are inexact (tied, censored, multivariate), it is demonstrated that both asymptotic and “exact” rank tests (including the sign test) may yield liberal results. It is demonstrated that using the unconditional variance (without “correction for ties”) for inexact data ensures that the probability of a “significant” result does not exceed the level. The proposed approach is based on the marginal likelihood principle. It is easily extended to multivariate ordinal data, allows for adjustment for confounding, is computationally feasible, and suggests criteria for the design of user interfaces.

Knut M. Wittkowski

### Testing for differences in location: a comparison of bootstrap methods in the small sample case

We compare several bootstrap methods for testing location differences in the two-sample case for small and moderate sample. We concentrate especially on the empirical shape of the underlying samples in order to allow a simple and obvious empirical identification of the properties (and limitations) of the used method. Our results show that there is an urgent need for more detailed empirical investigations of bootstrap properties in the finite sample case.

Karin Wolf-Ostermann

### Two principal points for location mixtures

Two principal points for location mixtures of symmetric or spherically symmetric distributions with equal proportions are investigated in this paper. In the univariate case, a sufficient condition on density functions is given for uniqueness. In the multivariate case, we give a lemma which enables us to compare candidates for two principal points geometrically and to restrict the region to search principal points. With this lemma, a subspace theorem is proved, which states that there exist two principal points in the linear subspace spanned by the component means. Further, a sufficient condition for uniqueness of two principal points is given for two component cases.

Wataru Yamamoto, Nobuo Shinozaki

### Vector splines and other vector smoothers

Vector smoothers are nonparametric regression methods for smoothing a vector response y against a scalar x. Some theory and software details for two popular classes of vector smoothers are presented—one is based on splines and the other on local regression.

Thomas W. Yee

### Backmatter

Weitere Informationen