main-content

## Über dieses Buch

The papers assembled in this book were presented at the biannual symposium of Inter­ national Association for Statistical Computing in Neuchcitel, Switzerland, in August of 1992. This congress marked the tenth such meeting from its inception in 1974 at Vienna and maintained the tradition of providing a forum for the open discussion of progress made in computer oriented statistics and the dissemination of new ideas throughout the statistical community. It was gratifying to see how well the groups of theoretical statisti­ cians, software developers and applied research workers were represented, whose mixing is an event made uniquely possible by this symposium. While maintaining traditions certain new features have been introduced at this con­ ference: there were a larger number of invited speakers; there was more commercial sponsorship and exhibition space; and a larger body of proceedings have been published. The structure of the proceedings follows a standard format: the papers have been grouped together according to a rough subject matter classification, and within topic follow an approximate aphabetical order. The papers are published in two volumes ac­ cording to the emphasis of the topics: volume I gives a slight leaning towards statistics and modelling, while volume II is focussed more on computation; but this is certainly only a crude distinction and the volumes have to be thought of as the result of a single en terprise.

## Inhaltsverzeichnis

### Issues in Computational Data Analysis

Computers have been widely used to analyze data since the 1950’s. Around 1970 a few experimental installations reached a high enough performance to permit exploratory, interactive analysis of non-trivial data sets. We are now reaching another plateau, where the scatterplot performance of inexpensive high-end PC’s approaches the limits of the human perceptual system. Our discussion concentrates on general computing aspects rather than on specific algorithms. With a view towards the more distant future, we should also begin to think about exploratory data analysis strategies, tools and techniques that go beyond those in present use and that extend to larger data sets.

Peter J. Huber

### Editorial Interface in Statistical Computing and Related Areas

A general environment for statistical computing and other related areas is described. The functions within this framework are based on an editorial interface. It permits the statistician to control all stages of the work by a general text editor. Even statistical data sets can be written in the edit field which is a visible work sheet on the screen. For large data bases, special data files are provided. Information from other sources, like text files, is easily imported.The ideas presented are realized in the Survo system. This system covers besides statistical computing and analysis various functions in general data management, graphics, spreadsheet computing, matrix operations, text processing, and desktop publishing. The newest extension to the editorial interface is a special macro language. This language is used for making various teaching and expert applications.

S. Mustonen

### Inside ESIA: An Open and Self-Consistent Knowledge Based System

The European project ESIA includes a generic representation of statistical knowledge in a “fact-based” shell. This representation enables to check the consistency of the knowledge entered by the different levels of users. A prototype has been tested in January 1992 with the interface and the knowledge needed for the two lowest levels of users, that is to say the first rank intended audience of ESIA.

H. Augendre, G. Hatabian

Dynamical graphics and linking of plots are features already realized in today’s statistical analysis systems. Also hypertext systems exist in many variations. But only few systems exist which provide statistical analyses and make use of hypertext features. Using hypertext (or hypermedia) statistical analyses and especially statistical consulting can be made more efficient.

Axel Benner

### A Consultation System for Statistical Analysis on Hypertool

In this paper, we describe a statistical consultation system supporting non statisticians. There are many kinds of statistical software, but most of them require the knowledge and experience of statistics. So, developments of a consultation system with knowledge of data analyses are expected by many users.The semantic network is one representation model of the knowledge. This model is effective for the hierarchical knowledge and easy to modify. The hypertool has the same structure of semantic network.We treat the consultation system based on hypertool. This system is extensible and has the advantage of adding/modifying statistical knowledge.

Atsuhiro Hayashi, Tomoyuki Tarumi

### Computer-Intensive Statistical Methods

We review sketchily bootstrap and other resampling methods for statistical inference. Topics covered include: regression problems; complex dependence; efficient resamping methods; confidence limits; bootstrap hypothesis tests; iterative resampling; and bootstrap and other empirical likelihoods.

A. C. Davison, D. V. Hinkley

### Exact Logistic Regression: Theory, Applications, Software

We provide an alternative to the maximimum likelihood method for making inferences about the parameters of the logistic regression model. The method is based on appropriate permutational distributions of sufficient statistics. It is useful for analysing small or imbalanced binary data with covariates. It is also applicable to small-sample clustered binary data. We illustrate the method by analysing several biomedical data sets with LogXact, a new module of the StatXact software package.

Cyrus R. Mehta, Nitin R. Patel

### Optimal Choice for a Hodges-Lehmann Type Estimator in the Two-Sample Problem with Censoring

The Hodges-Lehmann estimator (modified to take into account right censoring) can be used to estimate the treatment effect in the two-sample problem with right censoring under a shift or scale change model assumption. In this work we investigate, with the help of the bootstrap, the optimal choice of calibration parameters that yield a shift estimator with minimum variance. A set of simulations is presented, covering a variety of underlying survival and censoring distributions. The simplicity of use of the bootstrap and the robustness of the shift estimator (under minimal assumptions) can give wide applicability to the estimation method discussed here.

Y. Bassiakos

### Simulation in Extreme-Value Estimation by a Combined S-PLUS and C-Program

Extreme-value estimation is of great importance when one is interested in e.g. protecting land against sea by dikes. In Dekkers [1] asymptotic theory for some estimates for the extreme-value index, estimates for very large quantiles (outside the sample range) are studied as well as estimators for the right endpoint of a distribution function in case this endpoint is finite.In order to obtain some insight in the behaviour of these estimates for finite samples, some simulation studies are carried out and comparisons between the asymptotic theory and the simulation results are made. The statistical package S-PLUS [2,3] is used for the simulation and the further analysis in combination with a C-program. This combination is the main object of this paper, but first some theoretical results are given.

Arnold L. M. Dekkers

### The Usefulness of Exact Statistical Methods in Equal Employment Litigation

Statisticians giving testimony in the courts face a special problem in presenting their results as courts tend to make intuitive judgements about the adequacy of the sample size without inquiring as to the statistical implications. In the equal employment context, this is compounded by the fact that when it accepted statistical tests, the U.S. Supreme Court adopted the large sample normal approximation to the binomial distribution (Castenada vs. Partida, 1977). Although Baldus and Cole (1987), Gastwirth (1988) have mentioned the arbitrariness of letting each judge accept or reject an analysis on the basis of the sample size, without any analysis of statistical concepts of power (Fienberg and Straf, 1982), courts have only recently begun to consider small samples. With the development of convenient computer packages, not only can issues involving the accuracy of large sample approximations be avoided, the important concept of power (Goldstein, 1985; Gastwirth and Wang, 1987) can be presented to courts. Hopefully, this will enable judges to reflect on the propriety of choosing a non- significant result when a test of power zero is used in preference to a significant result with a powerful test (Gastwirth and Wang, 1987) or of accepting an argument that significance would disappear had one or two more minorities been hired (Kadane, 1990) without considering the consequent reduction in power.

Joseph L. Gastwirth, Cyrus R. Mehta

### On Classifier Systems, Disintegrated Representation and Robust Data Mining

Let x be a boolean vector of predictors and y a scalar response associated with it. Consider the problem of learning the relationship between predictors and response on the basis of a sample of observed pairs (x,y). Various proposed models for automated inference in this case are discussed, and some general dimensions for comparison among them are laid out. These dimensions are inspired by basic characteristics of (inductive) human learning. For example, our ability to receive, encode and modify knowledge in an unlimited number of ways would require a highly flexible representation, while our ability to detect and not be misguided by exceptional phenomena would entail some kind of uncertainty assessment. Systems that emulate such features of human learning may inherit some of its intrinsic robustness.Classifier systems (CSs) provide a rich framework for general computation and learning, which are intimately related to cognitive and statistical modelling [6,7]. These systems combine low-level building blocks, called classifiers, according to broad plausible heuristics, to form emerging high-level knowledge structures. Knowledge is indeed distributed among classifiers, and these are subject to mild representational constraints. This is an example of disintegrated representation. It is argued that this kind of representation provides a highly flexible approach to automated inference. A new prototype based on a simple CS architecture introduces disintegrated representation in the above data analysis context.

Jorge Muruzabal

### Variance Reduction for Bernouilli Response Variables

A method to reduce the sampling errors in simulations performed to estimate the expectation of a dichotomous variable is suggested. It is equivalent to the variance reduction technique known as Control Variates. The new estimator is unbiased. Some ways to estimate its variance (and to estimate the true amount of variance reduction) are suggested. A simulation study (a sort of simulation into a simulation, requiring the use of supercomputing techniques) is presented in order to show the validity of this approach in the determination of the power of a test.

Esteban Vegas, Jordi Ocaña

### Graphical Aids for Nonlinear Regression and Discriminant Analysis

In nonlinear regression and discriminant analysis applications, graphical displays can facilitate understanding and communication with subject area researchers. For nonlinear regression, Cook-Weisberg confidence curves, Wald intervals, and contour plots of the loss function in the parameter space provide information about the certainty of estimates and also about estimates of functions of parameters. For linear and quadratic discriminant analysis models, scatterplots bordered by box plots aid transformation selection as the data analyst uses a GUI to quickly transform plot scales. A lasso plot tool provides a link from points (cases) in displays to a data worksheet with all values. Within-group histograms help outlier detection and to study the spread across groups. A failure of the equal covariance matrix assumption may be seen in within-group scatterplot matrices embellished with ellipses of concentration. Enhancements to canonical variable plots allow easy identification of misclassified cases.

MaryAnn Hill, Laszlo Engelman

### Intervention Analysis with SPSS/PC+ Trends

Intervention analysis is a technique used to examine the pattern of a time series before and after the occurrence of an event that changes the normal behaviour of the series. The goal is to find a quantitative assessment of the impact of this event on the series pattern.

K. Christof

### A New Windows-Based Statistical Analysis Environment

Modern techniques for data exploration and computer-intensive modelling require a new form of statistical computing environment. Advances in systems architecture and software engineering techniques are available to support an intuitive yet powerful interface to both routine and experimental statistical analyses. A layered windows-based system provides two levels of interaction. A statistical graphical interface is available, providing high quality graphical output and numerical computation on demand, and masking the usual input/output boundary. A statistical computation langauge supports the user interface. A novel series of GUI extensions allow the user specify the analysis using graphical or command syntax modes.

G. C. FitzGerald, T. A. Hurley

### Aspects of the “User Friendliness” of Statistical Software: A Pleading for Software Adaptability

Statistical Software has proven to be an indispensable tool for statistical analysis. Increasingly the user interface of statistical software is a question of concern. The request for ‘user friendliness’ of statistical software is often discussed without reference to the type of user in mind. This paper addresses the current needs of statistical software users, and discusses to what extent those needs are met today. It will become apparent that these needs vary considerably according to the statistical, programming, or subject knowledge background of the potential user.

G. Held

### Statistics Software vs. Problem Solving

Despite the large offering of statistics packages available today, most data analysis is carried out by non-statisticians using minimal statistical concepts. This essay outlines how the shortcomings of statistics programs prevent their being integrated into the everyday data analysis process. It describes Datavision’s approach to applications development and how PC-ISP is used to solve the problem at hand, from the early experimental design all the way to the routine analysis system used by non-expert users.

Thomas M. Huber

### S-PLUS Version 3

S-PLUS is a very modern interactive language and system for graphical data analysis, statistical modeling and mathematical computing. This paper provides a brief overview of the significant new features and capabilities of the Version 3 (both 3.0 and 3.1) of S-PLUS.

Stephen P. Kaluzny, R. Douglas Martin

### Generalizing GLMs in Genstat

Generalized linear models have become widely used and are available in several statistical packages. Recently there have been several extensions to this family of models. This paper describes three such extensions and how they can be implemented in the statistical package Genstat.

G. W. Morgan

### The New Simulation Packages: New Powers Bring New Users

We discuss certain of new simulation programs here, not so much in terms of the problems they solve or the minutiae of their code, but in terms of the major feature that makes them attractive to a new class of users — non-programmers who find their visual representation of statistical information, real-world models and the performance monitoring process useful.

Brian P. Murphy

### Statistical Packages and Systems Evaluation

Statistical packages and systems evaluations are not simple matters. Deterministic evaluations are very important as far as they may provide objective indexes, but are rather difficult to apply. Many attempts to evaluate statistical program products, based on lists of criteria, were performed by statisticians. We propose some attempts to pursue a new integrated design to evaluate statistical software.

A. Giusti

### An Environment for Montecarlo Simulation Studies (EMSS)

EMSS is an Integrated Development Environment intended to help the user write, execute and analyze programs to peform Monte Carlo Simulation Studies. Its operation may be exclusively menu-based or its libraries may be adapted by the user to complement his or her own programs. The program is written in OOP-Pascal and it is strongly Object-Oriented in the sense that it is based on a class of “Statistical Objects” which may be used, created or extended by the user.

Alex Sanchez, Jordi Ocaña, Carmen Ruíz de Villa

### Generation of Optimal Designs for Nonlinear Models when the Design Points are Incidental Parameters

This paper considers the problem of finding D-optimal designs for nonlinear exponential models when the design points are not known, and have to be approximated or estimated. The procedure in this paper not only deals with this problem, but also handles the problem that for nonlinear models the information of the parameters is a function of the values of these parameters. The results will be discussed and illustrated for item response theory models that are frequently used in psychometric research.

Martijn P. F. Berger

### I-Optimality Algorithm and Implementation

Exact designs of experiments are frequently sought that are optimal at producing predictive response-surface models. Until recently, however, there have not been any software systems capable of finding such designs on continuous spaces, except for very specific models. We present the algorithmic and implementation details of our software program, I-OPTTM [1], for finding exact, continuous-space designs that minimize the integrated expected variance of prediction over the region of interest (sometimes known as either the I- or IV-optimality criterion) for general quantic models.

Selden B. Crary, Ling Hoo, Mark Tennenhouse

### Use of a Symbolic Algebra Computer System to Investigate the Properties of Mixture Designs in Orthogonal Blocks

The symbolic algebra system MAPLE is used to investigate the properties of mixture design families, consisting of orthogonal block designs formed from pairs of Latin squares. In particular, D-optimal designs within the families are found using the algebraic forms of |X’X|, resulting from fitting the Scheffé quadratic model with design matrix X.

A. M. Dean, S. M. Lewis, P. Prescott, N. R. Draper

### A-Optimal Weighing Designs with n ≡ 3(mod4) and their Information Matrix Form

In cases of constructing weighing (or factorial) designs, the problem of optimality arises. In this paper we study the form of the information matrices of A-Optimal Weighing Designs (n,k,sopt) (abrv. A-OWD), when the number of observations is n≡3(mod4). We are interested in critical values ko(n) for which the next statement is valid: “If k ≥ko(n) then Mk*≠ (n+1)·Ik−Jk, where Ik is the identity matrix of order k and Jk is a k×k matrix with all its entries equal 1”. The symbol (*) on the information matrix Mk means A-optimality.It is known that for values k ≥ ko, the information matrix Mk* of an A-OWD Rn×k* is a block diagonal matrix with sopt blocks of one size r or two contiguous sizes r and r+1: $${{B}_{q}}=\left( n-3 \right)\cdot {{I}_{q}}+3\cdot {{J}_{q}},q=r,r+1$$. Some very simple formulae for ko(n) are proposed, n ≤ 203, n≡3(mod4).

Nicolas Farmakis, Maria Georganta

### Quasi-Sequential Procedures for the Calibration Problem

The calibration problem has been discussed widely, by many authors adopting different lines of thought: Classical, Bayesian, Structural. We face the problem of a monlinear experimental design problem and we introduce a quasi-sequential procedure to overcome the poor initial knowledge about the parameters we want to estimate. A simulation study provides empirical evidence that significant improvements can be achieved.

C. P. Kitsos

### An Interactive Window-Based Environment for Experimental Design

Although current statistical packages are convenient and powerful tools for statistical analysis, their usefulness in the planning stage of the experiment is limited. In this paper we describe a system that assists the user in building up a design with many possibilities to examine its properties and change them.

M. Nys, P. Darius, M. Marasinghe

### The Use of Identification Keys and Diagnostic Tables in Statistical Work

Identification keys and diagnostic tables are tools for efficient deterministic identification using discrete-valued characteristics. Their potential use in statistical packages is described, together with the necessary algorithms and data structures.

R. W. Payne

### Construction of a Statistical KBFE for Experimental Design Using the Tools and Techniques Developed in the Focus Project (Esprit II num. 2620)

Commercial application of powerful statistical techniques will be possible only if they are designed to be “user-friendly” to managers and engineers. One way to achieve this goal is through the use of information technologies -in particular, KBFEs(Knowledge Based Front Ends). One of the products of the basic research conducted in the FOCUS project1 is a collection of tools and techniques directed towards the development of such KBFE’s, that integrates all of them in a general architecture.Several partners of the FOCUS consortium are using this architecture to build prototypes which are useful in many domains of quality improvement. The UPC (Universitat Politecnica de Catalunya) has developed three prototypes, all sharing the same philosophy: to make sophisticated and powerful statistical techniques easy to use. One of these prototypes, DOX (Design of Experiments) is described in this report. DOX is aimed at helping engineers in the design, analysis and interpretation of factorial and fractional factorial designs with factors at two levels. It uses a sequential strategy, with or without blocking, taking into account economic and technical restrictions. It also provides extensive help facilities to guide the user.

A. Prat, J. M. Catot, J. Lores, J. Galmes, A. Riba, K. Sanjeevan

### MRBP Tests and their Empirical Power Performance for Symmetric Kappa Distribution

Two MRBP rank test statistics are applied to simulated data from symmetric kappa populations with three different values of the parameter. Based on four moment approximations of their permutation distributions, empirical powers of these tests are computed and compared for 3 and 4 treatments and 80 blocks.

D. S. Tracy, K. A. Khan

### Optimal Experimental Designs in Regression: A Bootstrap Approach

This paper presents a new family of optimal design criteria for parameter estimation in nonlinear regression, based on minimization of expected volumes of, at least second-order correct, bootstrap confidence regions.This approach relies on the bootstrap use of previous fitting information, and is free of any probability distribution hypothesis for the errors (except that they are centered and i.i.d.), while improving the classic first-order asymptotic approximation of D-optimality.

J. P. Vila

### KEYFINDER — A Complete Toolkit for Generating Fractional-Replicate and Blocked Factorial Designs

KEYFINDER is a comprehensive, menu-driven, Prolog program for generating, randomizing and tabulating factorial designs in general situations. Its particular forte is the use of search procedures to generate fractional-replicate and blocked designs meeting the user’s detailed a priori specifications vis-a-vis design dimensions and aliasing/confounding properties. This paper gives a broad overview of the system and its facilities.

P. J. Zemroch

### ICM for Object Recognition

The Bayesian approach to image processing based on Markov random fields is adapted to image analysis problems such as object recognition and edge detection. Here the input is a grey-scale or binary image and the desired output is a graphical pattern in continuous space, such as a list of geometric objects or a line drawing. The natural prior models are Markov point processes and random sets. We develop analogues of Besag’s ICM algorithm and present relationships with existing techniques like the Hough transform and the erosion operator.

A. J. Baddeley, M. N. M. van Lieshout

### Bootstrapping Blurred and Noisy Data

We consider the use of the bootstrap within the context of the restoration of an unknown signal from a version which has been corrupted by blur and noise. We briefly discuss three issues, namely using the bootstrap to select the smoothing parameter, to perform an adaptive restoration and to construct an interval estimate of the unknown signal at one or several points. We discuss some empirical results.

Karen Chan, Jim Kay

### Multiresolution Reconstruction of a Corrupted Image Using the Wavelet Transform

We extend the work of Mallat on the use of wavelets in the multiresolution analysis of two-dimensional images to the situation where the unknown image has been corrupted by blur and noise. We introduce multiresolution versions of the iterated conditional modes and simulated annealing algorithms which allow the simultaneous estimation of the unknown image and also the location of discontinuities; the discontinuities may be modelled implicitly or explicitly. We compare our multiresolution algorithms with fixed resolution versions and illustrate their superiority in terms of both quality of reconstruction and computational speed using synthetic images.

Bing Cheng, Jim Kay

### Are Artificial Neural Networks a Dangerous Adversary for Statistical Data Analysis?

This paper makes a comparison between Artificial Neural Networks (ANN) and Statistical Data Analysis (SDA), and more precisely between the “statistical process” and the “neuronal process”. ANN and SDA must co-exist, so we need to know what the conditions of this future common life will be : competition ? complementarity ? redundancy ?

Gérard Hatabian

### Canonical Correlation Analysis Using a Neural Network

We introduce an artificial neural network which performs a canonical correlation analysis. Our approach is to develop a stochastic algorithm which converges to the stationarity equations for the determination of the canonical variables and the canonical correlations. Although the intrinsic algorithm is not local in the sense that the computation at a particular unit involves the values of distant units in the network, it is possible to employ a simple recursive from of communication between neighbouring nodes in order to achieve local computation. Some non-linear possibilities are discussed briefly.

Jim Kay

### Artificial Neural Networks as Alternatives to Statistical Quality Control Charts in Manufacturing Processes

The concept of quality has reentered the vocabulary of American business. The perception, whether founded in reality or not, that American products are inferior to their foreign counterparts, has contributed to the competitive disadvantage now faced by many American firms.

David Kopcso, Leo Pipino, William Rybolt

### Analytical Analysis of Decentralised Controlled General Data Communication Networks

A new approach is considered to analyse decentralised controlled of general data communication networks. This approach solved the network analytically based on the principle of maximum entropy. Two user classes are considered. All traffic are generally distributed with known first two moments. The performance measures of the users can be obtained by using the maximum entropy analysis of the two user class G/G/1 finite capacity queue with partial and joint restricted arrival process.

A. T. Othman, K. R. Ku-Mahamud, M. H. Selamat

### Meta Data

#### Frontmatter

Properties of microdata can assist in the process of statistical model building, providing information additional to that arising from background theory or observed distributions. Macrodata has storage and manipulation problems different from those of microdata. Metadata describes the properties of both microdata and macrodata and the relationships between them.

D. J. Hand

### Semi-Filled Shells and New Technology of the Subject-Oriented Statistical Expert Systems Construction

The main idea of the proposed technology of the Subject-Oriented Statistical Expert Systems (SOSES) construction is to create every new SOSES not from a “zero” point, but from a particular universal non-empty shell (it will be further referred to as “semi-filled” shells to distinguish from ordinary shells used for the expert systems). This special shell has to be filled, first, with the mathematical tools commonly used in all different kinds of SOSES, and, second, with the part of the technology of its agjustment to the subject field that is invariable as regards the field proper.The implementation of this idea is based on a system of extraction of statistical knowledge from an experienced applied statistician, formalization and representation of this expertise in a corresponding knowledge base that includes: (α) the most common types of applied statistical problems and methods for their solution; (β) methodology of identification of the considered real problem (from the analyzed application field) in terms of the type (or types) of statistical problems it belongs to; (γ) recommendations on how to determine the optimal technological chain of processing modules of the system for a real problem (according to its passport); (δ) recommendations on how to interpret intermediate and final results of computations; (ε) promptings on how to avoid the most typical “traps” occuring during the applied statistical analysis. The methodology to a considerable extent is based on the methods for classification and pattern recognition, as well as a special concept of the “passport of the problem”.The proposed ideas were first sketched in [1] and developed in [2]. Related problems of intellectualization of statistical software were discussed in [3].This paper presents some results of the Canadian-Russian SBIT-project funded by Prof. S.Wamer, York University, North York, Canada, and of the project funded by Russian-American JV DIALOGUE.

S. Aivazian

### Conceptual Models for the Choice of a Statistical Analysis Method; A Comparison Between Experts and Non-Experts

Based on research on the conceptual model of expert statisticians for the choice of a statistical analysis method, a computerized statistical support system has been designed to assist making this choice. In order to better guarantee the usability of this support system, a study is designed to determine the differences between conceptual models of experts and non-experts for this task. The process of choosing an analysis method is studied by observing and questioning intended users. The resulting non-experts’ model is used to design a usable support system.

Peter Diesveld, Gerda M. van den Berg

### Toward a Formalised Meta-Data Concept

Meta-data, or information about data, are considered as useful or even essential for statistical data processing. We believe that a formalised meta-data concept would be extremely useful for communication of data, for (partially) automated processing, and it is a prerequisite for reasoning about statistical analysis. This paper presents an approach to formalise meta-data. It is based on analysis methods from the software engineering. One may perform a functional analysis using hierarchical data flow diagraming to obtain a rough outline and structuring of what needs to be included in the meta-data concept. Subsequent object analysis using a data modelling technique similar to Entity-Relationship diagraming provides an unambiguous and almost formal definition of the meta-data concept.

Ed. de Feber, Paul de Greef

### Semantic Metadata: Query Processing and Data Aggregation

This paper outlines an approach to model semantic meta data such that statistical query processing is supported effectively. The model comprises both types of queries: checking the availability of data, and deriving formally specified target tables of aggregate data.

K. A. Froeschl

### A System for Production and Analysis of Statistical Reports

Specialists on surveys that involve large data collection, like pannels, and specifically household surveys are convinced that the existing procedures for production of statistics are too costly and too time consuming (over three years of production time). The tools presented here are parts of a wider project, CASIP “A complete automated system for information processing in family budget research”(DOSES-B1) whose main aim is to allow the automatic production of statistical tables, usually produced by statistical offices, within a few hours after the data collection is finished [Sari92].

Carina Gibert, Manuel Marti-Recober

### MetaInformation — Production, Transfer, Consumption, Maintenance in Official Statistics

Production, Transfer, Consumption, Maintenance in official Statistics

During the last years new techniques and technologies changed not only the production process in the Statistical Offices but also the design of surveys, the dissemination of much more diversified statistical products and last but not least the consumption of statistics. Characteristic for these changes is i.a. the extended use of meta data and meta-information. They are as well a result of the application of new procedures as a powerful tool for a permanent adaptation of the official statistics towards new needs. This is reflected both by practical experience and R & D with new tendencies in meta data handling.

K. Neumann

### An Experimental System for Navigating Statistical Meta-Information — The Meta-Stat Navigator

In research organizations handling statistical information, the volume of stored information resources, including research results, materials, and software, is increasing to the point that conventional separate databases and information management systems have become insufficient to deal with the amount. Increasing diversification in the media used these days interferes with the rapid retrieval and use of the information needed by users. A new system that realizes a presentation environment based on new concepts is needed to inform potential users of the value and effectiveness of using the vast amount of diverse data.

Noboru Ohsumi

### Data and Meta-Data in the Blaise System

Handling data and meta-data is a cumbersome and time-consuming part of survey processing. Version 3 of the Blaise system will offer features to facilitate this process. An extended version of the Blaise language serves as the heart of the process. Tools enable the user to manipulate the data and the meta-data in an easy and straightforward way throughout the survey process.

Maarten Schuerhoff

### Sample Size Estimation for Several Trend Tests in the k-Sample Problem

Sample size estimations for trend tests on proportions and continuous endpoints (both all-pair and many-to-one) are described. Examples of estimated sample sizes were computed for these trend tests using a PC-based program and presented graphically. Moreover, using concept of asymptotic efficiency, the sample sizes were compared under asymptotic and finite situations.

L. Hothorn

### A Generic Schema for Survey Processing at IBGE

The INSTITUTO BRASILEIRO DE GEOGRAFIA E ESTATÍSTICA (IBGE) is the responsible for all kind of Census (Economics and Demographic) and several continuous (monthly and annual) surveys. This enormous responsibility to disseminate these available data to the society involves much time and sometimes an elevated amount of money. IBGE coordinates the National Statistics System, embracing data collection and dissemination for social, industry, trading, services, and agriculture statistics.

Aluizio Pimentel Guedes, Mauro Sergio dos Santos Cabral

### A Tool for the Automatic Generation of Data Editing and Imputation Application for Surveys Processing

In this paper we describe an Automatic Generator of Data Editing and Imputation Applications called CRIPTA, developed at IBGE, which goes a long way that in how the computer is used to provide a more powerful assistance to the subject matter specialist, promoting a change in the role of the specialist (end user) and providing automatic powerful tools ‘not tailor made’ to perform the editing and imputation functions, going ahead automation.

Reina Marta Hanono, Dulce Maria Rocha Barbosa

### Tools for Data Entry

This paper handles about the input of data, mainly for statistical applications. There is no arguing about the importance of correctness of data for analysis. Another known fact is that many things can go wrong between the collection of the data and the analysis. So, controlling the data is a very important topic. It is stressed that controlling and correcting the data should be done during the input of the data and not during the analysis stage. Another interesting aspect is the possibility of automatic insertion of values if a certain condition is met.

D. M. van der Sluis

### StEM: A Deductive Query Processor for Statistical Databases

Statistical data are complex data structures, which traditional data models are not sufficient to represent. Such complexity transfers directly to statistical data manipulations, imposing on the user a great effort in specifying detailed algorithms. A high level language should be desirable, by which the user can specify only the starting data and the logical structure of the result. Such conceptual interaction requires an automatic mapping, transforming the high level operator into a number of elementary operations.This paper describes a statistical data model suitable to represent the complexity of statistical data structures and suitable to be processed by high level operators.This is followed by the description of a knowledge-based mechanism to deduce the elementary operations underlying a high level operator. Finally, a deductive query processor, StEM, is demonstrated, which provides transparency of complex data manipulations, using both the above data model and deduction mechanism.

Carla Basili, Leonardo Meo-Evoli

### VIEWS: A Portable Statistical Database System

This paper describes a statistical database system primarily intended for data dissemination purposes. The database is meant to be made physically available to users, in the form of a high-density diskette, a removable hard disk, or an optical disk. To this end the carrier contains all necessary data plus the retrieval system. The database provides information equivalent to that available from printed publications, but enhances upon that in portability, cost, flexibility, transferability, and graphic capabilities.

A. L. Dekker

### Database Management for Notifiable Diseases at the Federal Health Office of Germany

An application of a hierarchical database structure in epidemiology, which has been developed at the Federal Health Office of Germany and in which techniques of data aggregation have been used primarily, is described in Chapter 2.

Joachim Eichberg

### Sampling Accounting Populations: A Comparison of Monetary Unit Sampling and Sieve Sampling in Substantive Auditing

A population of debtors in an Irish company is audited and the results are analyzed to establish the error rates and error patterns. Audit populations are created based on the error patterns found in the sample audit. Simulation studies are carried out, using monetary unit sampling and sieve sampling, to examine the comparative performance of an estimator of total error amount in monetary unit sampling and sieve sampling. The empirical results indicate that sieve sampling is more efficient than simple random sampling for most sample sizes in all of the populations studied and is more efficient than systematic sampling for the larger sample sizes.

Jane M. Horgan

### Backmatter

Weitere Informationen