main-content

## Über dieses Buch

When dealing with the design or with the application of any technical system, which is not quite simple and trivial, one has to face to the problem to determine the allowable de­ viations of the system functions and the optimal vector of system parameter tolerances. The need for the solution of this problem is stimulated with various serious economic and maite­ nance aspects, between them the tendency to reach the minimal production cost, the maximal system operation reliability are the most frequent. Suppose that we are dealing with an system S, consisting of N components represented by the system parame­ ters xi' i = 1, 2 . . . N, which are arranged in certain structu­ re so, that the K, system functions F k' k = 1, 2 . . . IG , expres­ sing the considered system properties, fullfil the condition F-FO~ AF, /1/ \'Ihere F = l F k} Ie is the set of the actual system functions, FO = lFOk}~ is the set of the nominal system functions and A F = l A F k 1(;. } is the set 0 f the a 11 0 w a b 1 e s emf y s t u n c ion t s de­ viations. The set F depends besides the system structure also on the vector X = [Xi}N of the system parameters. Suppose, that the system structure is invariant.

## Inhaltsverzeichnis

### Information Science and Statistics

The interface between information science and statistics is more often implicit that explicit. It is true that the rapid growth of computer technology and the development of a better understanding of how to handle large amounts of data, has shaped the development of statistics. It is egually true, however, that the way the development of computer technology has influenced the development of statistics seldom in the past was subject to scientific studies on the part of the statisticians. It is of interest, therefore, to discuss the extent to which the development of computer technology has influenced the de facto use of statistical methods.

E. B. Andersen

### Catastrophe Theory as a Tool for Statistical Analysis of Systems

In the contemporary technology, science and industry the still more and more complicated and expensive systems are used. Their increasing complexity causes naturaly the still harder requirements on the respective system analysis and optimized synthesis. This leads to need to improve the conventional procedure of the system synthesis by the determination of allowable deviations and tolerances of the system parameters and system functions. Because for many systems the deterministic approach to this problem is not suitable, the methods of statistical analysis of system tolerances are used, often with very good results. However, there exist an interesting group of systems, expecially the systems with multivaluability and hysteresis, where the conventional statistical methods based on Monte Carlo approach are not advantageous. In this contribution there will be shown, how in such cases the problem of the tolerance analysis can be solved by modelling, based on some results from the catastrophe theory.

M. Novàk

### New Procedures for Generating Optimal Experimental Designs

The procedure CDOP for generating of D-optimal experimental designs is presented. An empirical comparison of CDOP with some of existing algorithms for the computer generation of exact D-optimal designs is carried out. Among algorithms considered were those due to Fedorov, Wynn, Mitchell, and Johnson and Nachtsheim. It is shown that by the use of CDOP efficient designs can be obtained on convex design spaces.

H. A. Yonchev

### Some Guidelines for Principal Component Analysis

Principal Component Analysis is most often used as a tool of Exploratory Data Analysis to generate graphical displays. The fixed effect model is then a convenient framework to set up a check list of questions to be adressed to the user, in order to help him to perform the analysis as suitably as possible with respect to his goals and his data. The paper attempts to develop this point of view by integrating several previous discussions into this framework and giving some new developments.

P. Besse, H. Caussinus, L. Ferre, J. Fine

### Comparison of Least Squares with Least Absolute Deviation Forecasting Using Simulation Techniques

Linear least squares forecasting is used on two data sets and these resulting equations are compared, along with their larger total absolute deviations, to regular least absolute deviation (LAD) equation fits done with multi stage Monte Carlo optimization (MSMCO). Then two nonlinear models are fitted using LAD with MSMCO. One is a regular LAD fit and the other uses mini max LAD curve fitting.

W. C. Conley

### Modeling the Reliability of Fault-Tolerant Software Systems

Software Reliability: With the advent of highly complex computer programs whose correct execution are essential to the proper functioning of a critical system, the concept of reliability has been extended to software. This is necessary for complex software because it is impossible to verify that it will execute correctly under all conceivable inputs. By testing a software product extensively and attempting to correct the errors that are discovered, confidence that it will execute correctly in a give situation is increased. The “time to failure” between corrections (measured, for example, in numbers of executions or CPU cycles) may be used to gauge the increasing reliability of the product. Indeed, contracts for the purchase of custom software frequently include specifications on the minimum allowable mean time to failure.

T. M. Gerig, J. R. Cook

### 7. Computerized Exploratory Screening of Large-Dimensional Contingency Tables

This paper is concerned with programs and strategies for the initial analysis of large-dimensional contingency tables, i.e. contingency tables which include so many categorical variables that it is not technically possible to undertake a unified statistical analysis of the full table.

S. Kreiner

### 8. A Fast Algorithm for Some Exploratory Methods in Categorical Data Problems

Two methods for categorical data problems, a discriminant analysis procedure and a procedure for the reduction of dimensionality can be summarized as an attempt to give a good forecast knowing an individuals categories on a subset of the variables for the category of a fixed variable and for the categories of the rest of the variables, respectively. The most simple algorithm checks all subsets that are to be considered and chooses the best one. This needs several re-orderings of the observed frequencies. In this paper an eager and fast algorithm is proposed that skips the cells of the contingency table in descending order of their observed frequencies and puts them into a set as long as they can be correctly forecasted together based on one of the subsets of the variables considered. Such a maximal subset of cells defines a subset of the variables that is optimal in some cases and is ‘not very bad’ at least.

T. Rudas

### 9. Easy to Generate Metrics for Use with Sampled Functions

Generalized PCA for use with sampled functions using metrics designed to “filter” known variation in the data so as to uncover the subtle kinds of variation is discussed. Examples are presented.

S. Winsberg, J. Kruskal

### 10. Sequential Inference Processes in Computational Aspects

Since Wald’s stimulative work [11], a number of literature related to several kinds of sequential plans have been presented until today, although there exist difficulty and complication on seeking mathematical properties and concretely numerical evaluation.

C. Asano, K. Kurihara, Z. Geng

### Variance Components as a Method for Routine Regression Analysis of Survey Data

We discuss a general modelling framework for variance component analysis of data from hierarchical structures. Areas of application include large scale surveys, small area statistics, longitudinal analysis, multivariate data and repeated measurements and experiments. The framework is also applicable for variance heterogeneity modelling.

N. Longford

### On the Assessment of Case Influence in Generalized Linear Models

This paper deals with likelihood distances as proposed by Cook and Weisberg (1982) as a general approach to the assessment of case influence. Approximate solutions are given in the framework of generalized linear models. An example demonstrates their implementation into GLIM.

G. U. H. Seeber

### Algoritmic Development in Variable Selection Procedures

Robust procedure for variable selection in linear model closely related to the all possible model approach is described. Method combines idea of α-acceptability and principle of coherence and use the class of γ-trimmed least squares estimators as the basis for evaluation of estimates.

J. Antoch

### Linear Models of Categorical Variables

Linear models of factors play an important role in current data analytic practice. The computational aspects of these models are fairly well established, given the numerous statistical packages that treat this class of models, but the literature on this subject is scarce. Yet the subject is non trivial and has several surprising and elegant results. This paper surveys the main results, including some generalizations.

D. Denteneer

### A Fast Algorithm for Curve Fitting

In this article I propose a new algorithm for nonpara-metric curve estimation using the Fast Fourier Transform. The proposed method is extremelly useful when the calculation of the curve is required at a grid of points, for example in order to plot the estimate. Applications to other curve estimation methods are discussed.

A. A. Georgiev

### Generalized Multiplicative Models

An algorithm is described for the Maximum Likelihood fitting of models where yk has a distribution from the exponential family and μk = E(yk) has the form, $${\mu _k} = \prod\limits_{i = 1}^m {{\mu _k}^{\left( j \right)}\;,\;{\mu _k}^{\left( j \right)} = {h^{\left( j \right)}}\left( {{n_k}^{\left( j \right)}} \right)} \;,\;{n_k}^{\left( j \right)} = \sum\limits_i {{\beta _i}^{\left( j \right)}} x_{ki}^{\left( j \right)}$$ The proposed algorithm is easily implemented in GLIM for particular models. The method is illustrated by examples in time series analysis, analysis of survival data and probit analysis.

M. Green

### A Convenient Way of Computing ML-Estimates: Use of Automatic Differentiation

A standard Pascal-program system for the maximization of log-likelihood functions envolving several regressor variables is presented. The user must supply a data matrix, e.g. the x-and y-variables of a linear model. Only FORTRAN-like characterizations for the densities must be available. Gradients and Hessians needed for quadratically convergent optimization algorithms are computed by automatic differentiation.There is a wide area of application, e.g. generalized linear models, quadratic logistic discriminant analysis, nonlinear curve fitting and regression models with time series errors.

C. Kredler, W. Kowarschick

### On the Numerical Solutions of Bounded Influence Regression Problems

Several iterative procedures to solve numerically the problem of robust multiple regression have been proposed in Huber (1972), Huber and Dutter (1974) and Dutter (1977a). According to Dutter (1977b), the H-algorithm distinguished itself for its programming simplicity and rapid computation. The H-algorithm was adapted to the problem of bounded influence regression by Marazzi (1980) and introduced in ROBETH. A similar adaptation was also described by Samarov and Welsch (1982) and introduced in TROLL. Experience shows that when high influential leverage points are present the convergence of the H-algorithm is very poor and the reliability of the result is doubtful. In this paper, alternative procedures with better convergence are discussed. These procedures have been implemented in ROBSYS, a conversational package for robust statistical computing (see Marazzi and Randriamiharisoa (1985)).

A. Marazzi

### Testing of Statistical Algorithms and Programs with the Help of the Multivariate Sample, Having Described Values of Empirical Parameters

The construction of multivariate samples having given values of sample marginal moments and given sample correlation matrix is described. Such “exact samples” are useful for testing the validity and correctness of the algorithms and programs of multivariate statistical analysis. Per one family of test distributions — the family with constant correlation matrix — the exact values of parameters of classical multivariate procedures (regression, factor, component, canonical analyses) are given, depending on the given correlation and dimensionality.

E.-M. Tiit

### Analysis of Three-Way Data Matrices Based on Pairwise Relation Measures

The basic types of 3-way data matrices are described. The special case of one or more sets of qualitative characters observed on one or more groups of individuals is then examined. Several exploratory techniques of analysis are suggested, based on appropriately defined measures of relationship between individuals or between categories of the investigated characters.

R. Coppi

### Factor Analysis of Evolution and Cluster Methods on Trajectories

In this paper, we study the evolution of objects indexed by i(i∈I) and associated with the lines of a two-way contingency table, observed at several times. At each time t (t∈T), the j-th column of the slice {nijt, i∈J} is associated with the j-th category of one or seveveral qualitative variables. The adaptation of the Factor Analysis of Evolutions (FAE) to such a cube of data, proposed in (5) and (6) produces factors, correlation circles and graphical displays of trajectories in factorial planes. Here, after a brief overview of FAE, we propose a hierarchical agglomerative clustering (HAC) method, which is complementary to FAE. Thus the results of the HAC can be represented by mean of classes of trajectories in the factorial planes found independently using FAE. This allows us to see the agreement between the results of the two methods, and enhances the interpretation of both. In this way, the difficulty of interpreting a large number of trajectories in the factorial planes is overcome by using the HAC method. We apply this classification procedure to the example already analysed in (5) and (6).

A. Carlier

### Links Between Clustering and Assignment Procedures

In the situation in which a set of objects is summarized in terms of a smaller number of groups of similar objects, links are noted between two definitions of the word ‘classification’: obtaining groups of similar objects, and assigning objects to one of a set of existing groups. Some problems on which further work could be undertaken are described.

A. D. Gordon

### Identification of Linear Regression Models by a Clustering Algorithm

This paper is a contribution to the problem of representation of clusters by a set of regression models.A new divisive clustering algorithm, reffered to as CLUREG, is presented. The algorithm belongs to the class of exchange methods, is able to deal with large sets of units and is such that each cluster of units is “well represented” by a specific regression model. The quality of the representation is measured by the sum of squares of the errors of the fitted regression model.CLUREG, by the k- median algorithm, optimally clusters the variables. Then, after an initial partition of the units, based on the clusters of the variables, reallocates the units by the optimization of the overall sum of squares of errors in the clusters.The computational complexity is a linear function of the number of units and a non linear function of the number of variables.

B. Baldessari, A. Bellacicco

### Validity Tests in Cluster Analysis Using a Probabilistic Teacher Algorithm

It is well known that widely used clustering algorithms are related with the Gaussian mixture problem. Now, we have developed a probabilistic teacher algorithm for the mixture problem. In this paper, we develop validity tests in cluster analysis using the properties of this algorithm and we present some applications of these tests.

G. Celeux

### New Kinds of Graphical Representation in Clustering

Having chosen a dissimilarity index, hierarchical clustering is one of the most popular techniques used to represent clusters: in fact, the fit between the graphical representation of hierarchies (i.e. dendrograms) and the data is very good only when the two following conditions are satisfied.1)The dissimilarity index satisfies the “ultrametric inequality”.2)there exists an order which is “compatible” with the dissimilarity index.This paper is devoted to new kinds of graphical representation when those conditions are not satisfied. When the first condition is not satisfied at all, we introduce a complementary notion of ultrametric called “ultramine”. When the second condition is not satisfied at all, a new dissimilarity index called “anti-pyramidal” which is an extension of ultramine is introduced. Graphical representations of those two notions are given. We give a measure which indicate how much the data are ultrametric or ultramine and order compatible or not for the chosen dissimilarity index. Each one of the new graphical representations sheds new and interesting light on the data.

E. Diday

### Projection on an Acute Symmetrical Simplicial Closed Convex Cone and its Application to Star Graphs

Additive trees extend hierarchical representations, and a one-to-one correspondence has been set up between these trees and qua-dripolar semi-distances, i.e. semi-distances satisfying the four-point condition. Among these dissimilarities, star semi-distances are noteworthy. They verify : d(i,j) =ai +aj , for all distinct units i and j, and yield a special representation, called a star graph.After exploring some geometrical properties of star semi-distances, the least squares approximation of a dissimilarity by such a semi-distance is examined. It corresponds to a least squares problem with positive constraints. This problem is shown to be related to the projection of a vector x on a very special cone, the closed convex hull of independent vectors which have the same acute angle.It is proved that a coordinate of the solution is null when the associated coordinate of x is non-positive. Then a descent algorithm follows, and at each step, a very simple analytic expression yields the projection on the corresponding face. Moreover, when the coordinates of x are ranked by increasing order, the suitable face for the solution is directly obtained. This gives a second algorithm.Finally, some specific properties of the least squares star sémi-distance are discussed.

B. Fichet

### Design of a Command Language for Interactive or Batch Use in a Statistical Package

The paper describes a new command language for the statistical package Clustan which is especially suitable for interactive use from a computer terminal or PC workstation. It features immediate error correction, flexible help (and more help) facilities, defaults for unspecified parameters, verification before execution, simulation of execution and simultaneous storage of commands as a program. The interactive method of operation is fully compatible with batch operation by a Clustan program.

J. Henderson, D. Wishart

### Cross Association Measures and Optimal Clustering

The purpose of this short paper is to show how the introduction of the concept of ‘Paired Comparisons’ allows us to linearize most of the cross association criteria on Contingency Tables in Statistics. These statistical cross association measures derived from contingency tables amount to ‘at least’ quadratic mathematical, formulations in the classical statistical approach of cross association for categorical variables (qualitative nominal variables). This possibility of linearizing some of these criteria allows us to transform ‘classification problems’ into Linear Programming ones without any type of hypothesis of cluster size fixation. The ‘paired comparisons principle’ plays a prominent part in this paper. Let us mention here that the first mathematician who spoke about ‘paired comparisons’ was A. Condorcet in 1785, in order to solve the difficult problem of ‘Collective Choice’ in Preferences Aggregation. We shall not speak, once more, about the axiomatic of the Condorcet’s criterion, widely discussed in [15], nor about the solving of the associated aggregation problems, for which we shall find details in [14]. We shall only focus on the ‘central’ part of the paired comparisons principle’ in a ‘broad sense’, thus enlarging the possibility of this concept in a context not reserved to Similarity aggregation but to a very large theory, called “Maximal Association”.1 In the paired comparisons approach, we compare the partitions induced by two modality variables noted: Y and C, pairs of objects with pairs of objects two by two. y ij and cij, that needs to consider, starting from n objects, data tableaux of size n2. On the contrary, in the statistical approach, for two categorical variables, one builds a contingency’ table, crossing both the variables Y and C. The distribution of values in this table n„v represents the number of objects having simultaneously the modality u of C and the modality v of Y. Then, these tables are of size p × q, where p is the number of modalities of C and q the one of Y.

F. Marcotorchino

### Multivariate Data Analysis, Contributions and Shortcomings of Robustness in Practice

We advocate the use of weighted-mean vectors and weighted-covariance matrices in order to perform analyses with the ordinary linear algebra techniques. The weights are allocated in placing the emphasis either on the essential data part or on the outlying observations.

W. J. J. Rey

### On the Use of Bootstrap and Jackknife in Covariance Structure Analysis

Applications of bootstrap and jackknife for covariance structure analysis under violation of standard maximum likelihood assumptions (small sample size, analysis of correlation matrices, and nonnormality) are discussed. Procedures are illustrated for a factor analysis model, using LISREL. Facta and failures of the resampling methods are covered. For bootstrap and jackknife a computer program with graphical facilities is introduced.

A. Boomsma

### On a Class of Robust Methods for Multivariate Data Analysis

A class of robust methods for multivariate data analysis was proposed to enable processing of data that do not satisfy assumptions necessary for the application of classical data analysis methods. Canonical covariance analysis (Momirović, Dobrić and Karaman, 1983) was developed as the general method to analyse relationships between two sets of variables. A model of robust redundancy analysis (Prot, Bosnar and Momirović, 1983), as well as models of robust regression (Štalec and Momirović, 1983) and discriminant (Dobrić and Momirović, 1984) analysis, were derived as special cases from the canonical covariance model. In order to clarify the interpretation of the obtained results, relationships between robust methods and corresponding classical methods were described in our previous papers. A set of programs for all proposed methods was written in GENSTAT and SS languages, and the behaviour of the methods was examined on real data.

V. Dobrić

### Robust Recursive Estimation and Detection of Shifts in Regression

Robust recursive estimation provides considerable computational advantage over iterative robust regression estimation, especially for large and ordered (e.g., with time) data sets. The robust recursive estimates are less sensitive than recursive least squares to the outliers and structural shifts, and produce residuals which are more effective in constructing tests for detecting a shift. In this paper we consider a problem of detecting a shift in regression when it is masked by outliers, and summarize results of a simulation study comparing several tests and estimates of the change point.

E. Kuh, A. Samarov

### How Robust is One Way Anova with Respect to within Group Correlation

The paper shows that One Way ANOVA is highly nonrobust with respect to positive within group correlation because the true level of significance α* of the procedure is much higher than the prescribed level α. Non robustness is also stressed by power considerations.

### An Interactive Graphic System for Designing and Accessing Statistical Data Bases

An integrated graphic computer system is presented that supports the whole life cycle of a statistical data base. The system provides design and documentation tools for creating and mantaining a statistical data base, and a user friendly interface for the interactive retrieval of statistical data.Two types of schemas are adopted: Entity-Relationship schemas for elementary data, and special schemas for aggregated data. For each schema, a diagrammatic representation is provided, so that the user can operate on schemas through the corresponding diagrams.

G. Di Battista, R. Tamassia

### Non-Standard Graphical Presentation

The use of general affine, projective, and curvilinear coordinates for graphical presentation is considered, In addition, geometric techniques other than point representation (general properties of algebraic manifolds, projection from three- or multidimensional spaces) are treated.

J. Gordesch

### Computer Graphics and Data Presentation, a First Step Toward a Cognitive and Ergonomic Analysis

After a brief review of some major papers on statistical graphs and tables, this paper sketches some general principles for choosing a display form and layout suitable for compact and yet faithful transformation of statistical information. An example is provided. It is argued that cognitive task analysis and controlled experimentation are required for understanding how people read (and sometimes misread) tables and graphs. Both improved software and a better training of students are necessary.

I. W. Molenaar

### Expert Systems and Data Analysis Package Management

the possible connections between expert systems and data analysis softwares will be briefly developped, but with a little touch of pessimism. Then an example of such a connection shall be presented. It concerns the semantic analysis of a command file used in a statistical software on micro computers.

J. Jida, J. Lemaire

### Developing Intelligent Software for Non-Linear Model Fitting as an Expert System

A prototype rational front-end for assisting users of the MLP program to input or generate data and formulate a model has been developed as an expert system using the shell EXPERT. Relationships between (a) MLP structure (b) selection of the task for the expert system (c) front-end design and (d) choice of programming tool are described.

C. Berzuini, G. Ross, C. Larizza

### Express — An Expert System Utilizing Standard Statistical Packages

EXPRESS is an expert system based on the idea that complex chains of statistical analyses can be simplified using a standard package for each set of intermediate computations. This principle is illustrated by two-sample testing for equal location parameters, relying on various BMDP programs. Distributional assumptions are decided on after preliminary runs of BMDP. The system each time generates control language for the package, and then extracts relevant information from the print files produced. It is argued that such a stepwise procedure, considering basic results before proceeding to more complex analyses, is close to the actual behaviour of most statisticians. However, this approach raises fundamental questions concerning overall statistical characteristics of the strategy built into the system. It is suggested that these characteristics may be assessed through simulation.

F. Carlsen, I. Heuch

### Muse: An Expert System in Statistics

This paper is aimed at describing an approach and some methods to be used in defining the field of applications and the structure of an expert system in statistics. These concepts were applied to develop the expert system MUSE (Multivariate Statistical Expertise). This expert system is intented to be implemented in an industrial environment. The main features and specifications of MUSE are also described in this document.

E. Dambroise, P. Massotte

### Building Expert Systems with the Help of Existing Statistical Software: An Example

This paper discusses the implementation of an expert system shell written completely in SAS (*). The features of the shell and its interface with the statistical procedures are described . The system provides a flexible tool for executing sequences of statistical analyses guided by a knowledge base in the form of rules.

P. L. Darius

### Statistical Data Validation and Expert Systems

Automated procedures for validating experimental data as they are recorded are described. We outline the development of these procedures as an expert system for data validation.

J. M. Dickson, M. Talbot

### Knowledge Base Supported Analysis of Longitudinal Data

It is the aim of the paper to describe an expert system supporting the — at least moderately experienced — user in the analysis of longitudinal data. The system’s knowledge about typical statistical processing strategies for such kind of data is used interactively to derive an action plan suitable for a given situation. The basic design features of the system are discussed along with some considerations on the actual decision processes. A brief example and a few implementation issues are presented, too.

K. A. Froeschl, W. Grossmann

### How to Assist an Inexperienced User in the Preliminary Analysis of Time Series: First Version of the Estes Expert System

In this paper we describe the first version of a statistical expert system, called ESTES (Expert System for TimE Series), intended to assist in the preliminary analysis of time series. The ESTES system provides guidance for an inexperienced time series analyst in detecting the following essential properties of time series: trend, seasonality, level shifts and outliers. The system can utilize user’s own knowledge of the time series being considered and can also give instructions to the user if he doesn’t know or is not sure what is good way of dealing with that particular time series.

P. Hietala

### Object-Oriented Data Representations for Statistical Data Analysis

We discuss the design and implementation of object-oriented datatypes for a sophisticated statistical analysis environment The discussion draws on our experience with an experimental statistical analysis system, called DINDE. DINDE resides in the integrated programming environment of a Xerox Interlisp-D machine running LOOPS. The discussion begins with our implementation of arrays, matrices, and vectors as objects in this environment. We then discuss an additional set of objects that are based on statistical abstractions rather than mathematical ones and describe their implementation in the DINDE environment.

R. W. Oldford, S. C. Peters

### Algorithms for the Beta Distribution Function

The present paper provides a set of useful algorithms for evaluating the beta distribution function in easily programmed form. These algorithms are the most important of the currently available procedures. Each is expressed in recursive form and a discussion of the quality of each is included. Applications of these algorithms to the evaluation of t, F and noncentral F distribution functions is also included.

H. O. Posten

### Interactive User-Friendly Package for Design and Analysis of Experiments

An interactive package for design and analysis of experiments is being developed for use both by statisticians and by scientists, engineers and managers with a minimal knowledge of statistics. For the non-statistician user the package contains guidance, explanation, validation and interpretation facilities, which aim to represent the assistance of a statistical consultant. The present scope of the package includes aspects of simple comparative, factorial and response surface experiments.

A. Baines, D. T. Clithero

### NL: A Statistical Package for General Nonlinear Regression Problems

NL is a statistical package designed for nonlinear regression problems, taking into account the heteroscedasticity of variance, if any. The algorithm for the estimation of the regression parameters is adapted to the topic but it also presents possibilities for future extensions. NL is composed of independent routines, each of them devoted to a statistical task. In the present version, an host system is used to ensure the interface between the user and the routines.

S. Huet, A. Messéan

### Statistical Software for Personal Computers

In this paper we will discuss 8 selected statistical packages for Personal Computers. These packages are tested and the results are presented in a number of tables. We concentrate on packages for IBM PC’s and compatibles.

W. J. Keller

### Recursive Techniques in Statistical Programming

Synopsis Recursive algorithms in the language Pascal are given to display useful features in statistical computing procedures such as branch and bound searching and indexing multi- factor tables. Concepts of provability and efficiency of such algorithms are also discussed and study of principles strongly recommended.

B. P. Murphy, J. S. Rohl, R. L. Cribb

### Design Criteria for a Flexible Statistical Language

Statistical packages that constrain their users to choosing from a fixed set of analyses restrict the types of investigation that scientists may undertake, or lead to approximate or inappropriate analyses. We describe methods and criteria for designing a flexible package that avoids such deficiencies, with illustrations from our involvement with Genstat, in particular from the design of the new version called Genstat 5. Use of comprehensive algorithms and the ability to include user-supplied source code are important, but the crucial feature is that the commands of the package should themselves constitute a programming language for defining statistical, and other, computations.

R. W. Payne, P. W. Lane

### Statistical Software SAM — Sensitivity Analysis in Multivariate Methods

A statistical software SAM is developed for performing sensitivity analysis in multivariate statistical methods. The main objectives are to detect influential observations and to evaluate the stability or reliability of the results. The outline of SAM is shown with a numerical example.

T. Tarumi, Y. Tanaka

### Data Handling and Computational Methods in Clinical Trials

After outlining the complexity of the chronic diseases and commenting briefly upon the levels of strength of evicence of different studies commonly adopted in bio-medical research, the paper focuses on the problem of using data-bases fcr the assessment of therapies and some tipical subjects arising in analyzing data collected in cancer randomized clinical trials and in combining findings from trials carried out in different situations.

E. Marubini

### Database Assisted Conduct of Clinical Trials

In the last few years increasing use of the computer and of software in planning, conducting and analyzing clinical trials has been made. With regard to large multicenter studies it is now no longer conceivable to dispense with these tools. The practical importance attached to them by the trialists is shown by the agendas of recent conferences on clinical trials. During the 85-meeting of the Society for Clinical Trials, two whole sessions were spent on discussing this subject alone (v. SOC. CLIN. TRIALS, 1985). In books on clinical trials, however, this aspect is often omitted, and where it is treated at all the presentation lags behind the rapid development (see e.g. POCOCK (1983), LEE (1984)). A description of the tasks which can be facilitated by the computer and an explanation of how to make use of it are given in the papers of GILLESPIE & FISHER (1985) and VICTOR (1986).

N. Victor, R. Holle

### An APL-System for Analysing Multivariate Data

This caper introduces an interactive computer system based on the generalized multivariate analysis of the variance model (GMANOVA). This approach provides comprehensive possibilities for processing multivariate data. Handling of missing data is an integrated part of the system and it is done by using the EM algorithm. Missing data analyses can be easily carried out by using this system. This system also contains flexible possibilities for diagnostic checking of the model. The problem of influential measurements is considered. The influence analysis envisaged also serves as a means for comparing the robustness of various models to missing measurements and to different study designs. In addition to the usual tests of hypotheses within the GMANOVA model, also the single curves and their residuals can be analysed. Comprehensive summaries over individual curves can be performed. Graphical options for looking at the data and the family of growth curves, for example, are available. These programs are written in APL. The users familiar with APL can easily carry out further calculations, prepare further programs and intervene any sequence computations, and check the results immediately.

E. P. Liski, T. Nummi

### AMS — A Computer Program for Survival Analysis, General Linear Models and a Stochastic Survival Model

AMS is a computer program of general use in the field of statistics which resort to generalized linear models theory. It permits a discrete handling of given data and estimated vectors making use of a simple language, documented on line.

A. Morabito

### A Multilingual Specification System for Statistical Software

An important development to improve the maintainability of software is the use of specification systems. In these systems a specification language is defined in which the design of software can be expressed. The specification is automatically checked for consistency and completeness. In this paper CONDUCTOR, a multilingual specification system for statistical software is discussed. The implementation of a bootstrap estimator is used as an example.

V. J. de Jong

### An Efficient Algorithm for Time Series Decomposition

The aim of this work is to present an efficient algorithm for time series decomposition. Following Akaike (1980), who uses a linear bayesian model, it is possible to take advantage of the model structure to reduce the computational effort.Corradi and Scarani (1984) used regularization methods, whereas Ishiguro (1984) trasformed the coefficient matrix to a banded matrix by suitable permutations. Both methods reduce the computational cost with respect to the original code (Akaike and Ishiguro, 1980).In this work we propose a new approach based on Cholesky factorization of a banded system. Following George (1981) we observed that it is possible to set up an iterative algorithm to find the solution. In typical cases it converges in few steps. The proposed algorithm is very cheap with respect to previous methods. It requires a storage of order N, where N is the observation number of the time series. The operation number is of the same order.Some applications to economic series are presented.

C. Scarani

### Database Management and Statistical Data Analysis: The Need for Integration and for Becoming More Intelligent

Database management and statistical data analysis are often considered as two rather independent parts. In this paper we want to point out that database management should be integrated into statistical analysis systems and that difficulties arise when database management is done within a database system whereas statistical data analysis is done separately in a statistical analysis system. Within such a system it is also possible to obtain a more “intelligent” system which is able to support a user more efficiently.

R. Haux, K.-H. Jöckel

### Privacy, Informatics and Statistics

Statisticians have always been quite sensitive to the need for protection of privacy of information regarding both people and legal bodies (institutions, associations and commercial business). This attitude is, as a matter of fact, one of the key conditions to achieve cooperation and reliable data. Information collected through a statistical sample, a census, fiscal registrations, social security sources, etc. has a high risk of different kinds of improper use. Therefore the national statistical offices make every possible effort to render inaccessible individual data by protecting the original files and avoiding publication of statistical tables in which extreme values with a very low frequency appear.

G. Marbach

### Italian Statistical Data Dictionary System Prototype Developed with a 4th Generation Language

Statistical Information Systems are systems conceived to collect, structure and make available the statistical data needed for the correct and prompt investigation and the management of an organism, reducing its entropy level. Data Dictionaries handle data about the data of information systems; they represent a fundamental environment for their development and management; in statistical applications they play an even more important role and assume some peculiarities originated by the special nature and complexity of data. 4th Generation Languages are interactive systems for developing and exploiting automated solution methods and information systems; they increase productivity by 1000%, asking a learning period of a few days and offering the direct visibility of data and procedures; this enhanced performance makes feasible prototypal analyses, incremental developments, case simulations. A prototype of the Statistical Data Dictionary System for the Italian National Institute of Statistics has been realized exploiting the efficiency and the power of the MAPPER 4th Generation Language.

P. Costa, F. Marozza, F. Vinciguerra

### A Methodology for Conceptual Design of Statistical Databases

In statistical applications, data are described at different levels of aggregation, from elementary facts of the reality to complex aggregations such as classifications, time series, indexes. This paper describes a methodology for conceptual design of statistical databases that provides the designer suitable strategies for defining such different levels of aggregation starting from user requirements, and checking the completness, coherence and minimality of the conceptual schema at the different levels.

G. Di Battista, G. Ferranti, C. Batini

### Towards Natural Integration of Database Management System and Statistical Software

The paper presents an integrated software architecture of the data processing system which consists of relational data management tools and selected statistical functions. A complete set of ReDS software facilities is available at the uniform level of end-user interface. Statistical component of the system can be tuned easily to meet the changing requirements of the user.

P. J. Jasiński, B. Butlewski, S. Paradowski

### Easy-Link: An Expert Natural Language Interface to a Statistical Data Bank

The paper describes the general framework of the EASY-LINK system: an expert interface in Italian natural language, which allows an intelligent interrogation of the economic, territorial, statistical data bank “SUPERSTAT” developed by SARIN.After a discussion on the problems of misuse of statistical packages by naive-user and on the challenge for Artificial Intelligence techniques in statistics, attention is focused on the system’s capabilities and architecture.Our approach is to increase the user’s gain by making the computer’s response more informative, and hence to reduce his uncertainty about his problem.The system consists of two modules: the natural language interface and the expert system on data bank enquiry integrated in an only system where processing takes place in a cooperative way.

G. Lella, S. Pavan, P. Bucci

### Farandole: A RDBMS for Statistical Data Analysis

We propose a RDBMS (Relational Data Base Management System) for the management of statistical data. The contribution of FARANDOLE is mainly to free the statisticians from the problems related to the manipulation of the files’ physical organization in one hand, and to check automatically the validity of the data in another hand.An interface with well known data analysis programs such as SAS, SPSS, MODULAD will enable FARANDOLE to include the functions of these programs within its own RDBMS functions.The kind of data used, as well as the-operations executed over these data, differ from “classical” data.Therefore new functions of a RDBMS oriented towards data analysis are to be implemented.The data model of FARANDOLE is a relational data model which has been enhanced in order to capture in a better way the needs of data analysis. It is possible with FARANDOLE to implement complex data structures and to store results of data analysis’ methods. Concerning the storage of the data, we make use of a particular file organization: the transposed files. This kind of file organization considerably simplifies the specific operations made on statistical data and makes the compression of data easier. The need for data compression comes from the important number of missing and redundant data in the field of statistical data analysis.A specific language is developed in order to enable the statisticians to declare a data scheme using their usual vocabulary and concepts.

M. Leonard, J. J. Snella, A. Abdeljaoued

### Model- and Method Banks (MMB) as a Supplement to Statistical Data Banks

Using large data bases and computer aided technologies (CA Analysis, CA Interpretation, ADA, EDA) creates the need for more sophisticated and faster accessible statistical models and methods. One possible solution is to organize them in the same way as it proved to be succesful with the data. This can be done in different variants, e. g. as a special part of existing data banks or as seperate MMB either on the meta level or containing the real models and methods.MMB are expected to ease the acces to models and methods in all phases of statistical analysis so helping the end user to analyze large data sets faster, more reliable and more complex. They are computerized tools using modern information processing technologies.The paper describes the steps to organize MMB and the use of them which could lead to knowledge bases as part of expert systems.

K. Neumann

### A Security Mechanism Based on Uniform Rounding to Prevent Tracker Techniques

In the last decade powerful snooping tools like the tracker have been developed, with which a questioner is able to deduce confidential information about individuals by processing statistical queries. In the first section of this paper we discuss a security mechanism based on Bernoulli experiments. Experimental results show that although this method has good statistical properties the relative errors of the estimators are high. The idea of exclusion resp. multiple inclusion of individuals (belonging to the specified query set) together with uniform rounding leads to a security mechanism with small relative errors. However, the statistics are no longer unbiased, so small query set compromise and trackers are no longer possible.

H. Sonnberger

### A Data Analysis Course in a Computer Science Faculty

We discuss here the “what”, “how” and “why” of that course. First we examine the nature of Data Analysis, this concept being very broad and employed with different senses. We present a general scope of the different techniques covered by this concept and their relations and uses.

T. Aluja-Banet, M. Marti-Recober

### Teaching Computational Statistics on Workstations

A statistical computing laboratory of ten workstations with 32-bit architecture has recently been installed at the University of Lancaster. This article discusses the potentiality of these machines for teaching statistics.

B. Francis, J. Whittaker

### Teaching Regression Analysis Using MIMOSA — an Interactive Computer System for Matrix Calculus

This paper describes the use of MIMOSA in the teaching of regression analysis. MIMOSA is an interactive computer system for matrix calculations developed at the Computer Centre of the University of Tampere.

S. Puntanen, H. Kankaanpää

### Practical Use of Color Imaging in Automatic Classification

Some proposals are presented for utilizing color information in data analysis in computer graphics environments. The possibilities presented by using color in data analysis and the problems involved are presented first, then the systematization of color models in data analysis is dealt with. A proposal is submitted regarding the user interface software in color graphics system design. Examples of data analysis using color are provided showing color imaging patterns of data matrices, color plots of principal component scores, and so on.

N. Ohsumi

### Algorithm and Software of Multivariate Statistical Analysis of Heterogeneous Data

A new approach to the analysis of statistical data is proposed. The essence of the approach is in the use of the class of logical decision functions of heterogeneous variables (Boolean, nominal, rank and quantitative ones) for data analysis. On the basis of the present approach the methods of solving various problems of statistical analysis have been developed: discriminant and regression analysis, cluster-analysis, analysis of multivariate time series. The above problems for shortness will further be called data analysis problems.

G. S. Lbov

### Backmatter

Weitere Informationen