Skip to main content

Über dieses Buch

The 13th Symposium on the Interface continued this series after a one year pause. The objective of these symposia is to provide a forum for the interchange of ideas of common concern to computer scientists and statisticians. The sessions of the 13th Symposium were held in the Pittsburgh Hilton Hotel, Gateway Center, Pittsburgh. Following established custom the 13th Symposium had organized workshops on various topics of interest to participants. The workshop format allowed the invited speakers to present their material variously as formal talks, tutorial sessions and open discussion. The Symposium schedule was also the customary one. Registration opened in late afternoon of March 11, 1981 and continued during the opening mixer held that evening: The formal opening of the Symposium was on the morning of March 12. The opening remarks were followed by Bradley Efron's address "Statistical Theory and the Computer." The rest of the daily schedule was three concurrent workshops in the morning and three in the afternoon with contributed poster sessions during the noon break. Additionally there were several commercial displays and guided tours of Carnegie-Mellon University's Computer Center, Computer Science research facilities, and Robotics Institute.



Keynote Address


Statistical Theory and the Computer

Everyone here knows that the modern computer has profoundly changed statistical practice. The effect upon statistical theory is less obvious. Typical data analyses still rely, in the main, upon ideas developed fifty years ago. Inevitably though, new technical capabilities inspire new ideas. Efron, 1979B, describes a variety of current theoretical topics which depend upon the existence of cheap and fast computation: the jackknife, the bootstrap, cross-validation, robust estimation, the EM algorithm, and Cox’s likelihood function for censored data.

Bradley Efron, Gail Gong

Automated Edit and Imputation


Developing an Edit System for Industry Statistics

All survey data, in particular economic data, must be validated for consistency and reasonableness. The various fields, such as value of shipments, salary and wages, total employment, etc., are compared against one another to determine if one or more of them have aberrant values. These comparisons are typically expressed as so-called ratio edits and balance tests. For example, historical evidence indicates that the ratio between salary and wages divided by total number of employees in a particular industry usually lies between two prescribed bounds. Balance tests verify that a total equals the sum of its parts. When a data record fails me or more edits, at least one response item is subject to adjustment, and the revised record should pass all edits. An edit system being developed at the Census Bureau has as its core a mathematical based procedure to locate the minimal number of fields to impute and then to make imputations in the selected fields.Subject-matter expertise will be called upon to enhance the performance of the core edit, especially in the area of imputation. Among the factors that will be brought to bear are patterns of response error, relative reliability of the fields, and varying reliability of the edits.

Brian Greenberg

Design of Experiments to Investigate Joint Distributions in Microanalytic Simulations

The heart of a microsimulation model is the microdata on which it operates. Often, a single survey is insufficient to meet a model’s informational needs and two separate survey files are merged. This paper focuses on the statistical problems accompanying the file merge process. Empirical evidence of data distortions induced by some procedures is presented as is an experimental design to investigate the impact of various merging methods on microdata statistics.

Richard S. Barr

Do Statistical Packages Have a Future?


The Effect of Personal Computers on Statistical Practice

Statistical computer packages for large computers, both in batch and interactive environments, have focused on traditional statistical analyses and the computations associated with them. With few exceptions, statistical software development has emphasized numerical computations and data management, and few statistical procedures have been designed specifically for use on computing machines. Personal computers offer new and fundamentally different computing tools around which new methods for exploratory data analysis can be developed. These methods make essential use of the personal machine’s capabilities for data presentation and display.

Ronald A. Thisted

Statistical Software Design and Scientific Breakthrough: Some Reflections on the Future

Data analysis software has played a major role in the recent evolution of certain types of scientific disciplines which are characterized by weak theory and intercorrelated independent variables. The evolution of these fields of inquiry has depended as much upon data analysis packages for their progress as astronomy has upon the telescope, cellular biology the microscope, and particle physics the accelerator. Three new developments in the capabilities and organization of these software packages are pending or will emerge in the foreseeable future, and are discussed in terms of their potential impact on accelerated scientific discovery in the fields of inquiry that such software packages serve. They are: research-oriented graphics, true conversational analysis, and voice controlled software. These developments may help produce a revolution in scientific insight in a number of disparate fields.

Norman H. Nie

Some Thoughts on Expert Software

Current successes in making computers available for data analysis will intensify the challenge to make them more useful. Cheaper, smaller and more reliable hardware will make computers available to many new users with data to analyse: few of them will have much statistical training or routine access to professional statisticians. We have a moral obligation to guide such users to good, helpful and defensible analysis. The best of current statistical systems allow users to produce many statistical summaries, graphical displays and other aids to data analysis. We have done relatively well in providing the mechanics of the analysis. Can we now go on to the strategy?This paper examines expert software for data analysis; that is, software which tries to perform some of the functions of a statistician consulting with a client in the analysis of data. Is such software needed? Is it possible? What should it do? What should it not do? How should it be organized? Some tentative answers are proposed, with the hope of stimulating discussion and research. A comparison with expert software for other applications is made. A recent experiment at Bell Laboratories is used as an example.

John M. Chambers

Fourier Transforms in Applied Statistics


How Fast is the Fourier Transform?

The average running time for several FFT algorithms is analyzed. The Cooley-Tukey algorithm is shown to require about n1.61 operations. The chirp algorithm always works in 0(n log n) operations. Examples are given to show that padding by zeros to the nearest power of 2 can lead to real distortions.

Persi Diaconis

Polynomial Time Algorithms for Obtaining Permutation Distributions

Polynomial time algorithms are presented for finding the permutation distribution of any statistic which is a linear combination of some function of either the original observations or the ranks. The algorithms require polynomial time as opposed to complete enumeration algorithms which require exponential time. This savings is effected by first calculating and then inverting the characteristic function of the statistic.

Marcello Pagano, David L. Tritchler

Efficient Estimation for the Stable Laws

This paper is concerned with Fourier procedures in inference which admit arbitrarily high asymptotic efficiency. The problem of estimation for the stable laws is treated by two different approaches. The first involves FFT inversion of the characteristic function. A detailed discussion is given of truncation and discretization effects with reference to the special structure of the stable densities. Some further results are give also concerning a second approach based on the empirical characteristic function (ecf). Finally we sketch an application of this method to testing for independence, and also present a stationary version of the ecf.

Andrey Feuerverger, Philip McDunnough

Algorithms and Statistics


Applications of Statistics to Applied Algorithm Design

The field of Applied Algorithm Design is concerned with applying the results and techniques of Analysis of Algorithms to the real problems faced by practitioners of computing. In this paper we will study the applications of probability and statistics to that endeavor from two viewpoints. First, we will study a general methodology for building efficient programs that employs the tools of data analysis and statistical inference, probabilistic analysis of algorithms, and simulation. Second, we will see how these techniques are used in a detailed study of an application involving the Traveling Salesman Problem, and in a brief overview of several other applications.

Jon Louis Bentley

Algorithms with Random Input

Randomness arises in connection with algorithms and their analysis in at least two different ways, Some algorithms (sometimes called coin-flipping algorithms) provide their own randomness, perhaps through the use of random number generators, Sometimes, though, it is useful to analyze the performance of a deterministic algorithm under some assumption about the distribution of inputs, We briefly survey some work which gives a perspective on such problems, Next we discuss some of the techniques which are useful when carrying out this type of analysis, Finally, we briefly discuss the problem of choosing an appropriate distribution

George S. Lueker

Recent Results on the Average Time Behavior of Some Algorithms in Computational Geometry

We give a brief inexhaustive survey of recent results that can be helpful in the average time analysis of algorithms in computational geometry. Most fast average time algorithms use one of three principles: bucketing, divide-and-conquer (merging), or quick elimination (throw-away). To illustrate the different points, the convex hull problem is taken as our prototype problem. We also discuss searching, sorting, finding the Voronoi diagram and the minimal spanning tree, identifying the set of maximal vextors, and determining the diameter of a set and the minimum covering sphere.

Luc Devroye

Pattern Recognition


Applications of Pattern Recognition Methods to Hydrologic Time Series

The paper surveys hydrologic studies by the speaker and others in which pattern recognition concepts play a prominent role. Data sets gathered from measurements of riverflow and water table pressure heads motivate relatively delicate statistical questions. Techniques such as cluster analysis, feature extraction, and non-parametric regression are ingredients of the state-of-the art solutions to these questions.

Sidney Yakowitz

Recent Advances in Bump Hunting

For speeding up the algorithm for the method of maximum penalized likelihood it is tempting to try to make use of the Fast Fourier Transform (FFT). This can be done by circularizing the data and making the x axis discrete. For circular data the circularization is of course unnecessary. Other methods are discussed and some comparisons are made.For multivariate data one could use a multidimensional FFT by putting the data on a torus. Apart from circularization or toroidalization one can speed up the estimation of the hyperparameter by repeatedly doubling or otherwise increasing the fineness of the grid.Roughness penalties of the form $$\beta \int {\left\{ {\left[ {\left( \text{f} \right)^\xi } \right]^{\prime \prime } } \right\}} ^2 \,\text{dx,}$$ where f(x) is the density function, are also considered, where ξ is not necessarily ½ or 1, and corresponding algorithms are suggested. For the scattering data considered in previous work we have compared the results for ξ = ½ and ξ = 1 and find that the estimates of the bumps are almost the same but two of the former bumps have each been split into a pair.

I. J. Good, M. L. Deaton

Pattern Recognition in the Context of an Asbestos Cancer Threshold Study

Five distinct stages of a pattern recognition problem are discussed in the context of a study of drinking water asbestos health effects. A series of nonparametric regression procedures is used to examine the possiblilty of a threshold in the relationship between asbestos in drinking water and cancer. Evidence is presented which suggests that rather than there being a threshold, the dose response curve E (Y\X) is linear. It is also shown that by choosing a large value (in comparison to earlier studies) for the 125 location parameter of the log observed-to-expected response variable, the overall significance level of the asbestos in drinking water-cancer association is greatly increased and the resolution of the threshold pattern recognition procedure is greatly enhanced.

Michael E. Tarter

Volume Testing of Statistical Programs


Volume Testing of Statistical/Database Software

By volume testing is meant assessing the ability of systems to manipulate data with large values for the width, length, and depth dimensions. The depth dimension is used as a measure of the complexity of the data relationships in a nonplanar data collection.Several carefully designed problems for complex data manipulation by statistical and database systems are presented. These problems are referenced in subsequent papers in this volume on complex data manipulation capabilities of the major statitistical/database systems in use today.

Robert F. Teitel

Scientific Information Retrieval (SIR/DBMS)

SIR/DBMS is a database management system that has been geared to the unique needs of the research community. The data definition commands in SIR/DBMS are patterned after the well-known statistical package SPSS, and SIR/DBMS interfaces directly (through the creation of system files) to SPSS, BMDP and any other system that can read SPSS system files, such as SAS and P-STAT. SIR/DBMS can easily handle complex hierarchical and network data structures.

Barry N. Robinson

Volume Testing of Statistical Software -- The Statistical Analysis System (SAS)

Solutions for two complex file management problems are proposed using the Statistical Analysis System (SAS). SAS is an integrated system for data management, statistical analysis, data reduction and summarization, color graphics, and report writing. Several fundamental concepts of SAS are reviewed and four methods of solution are suggested. Detailed descriptions of each of the problem solutions are presented, including the input/output volume at each stage (a reasonable performance metric for comparing similar systems). Comparisons among the four methods are discussed, and program listings for each solution are included.

Arnold W. Bragg

Solving Complex Database Problems in P-STAT

At the 13th Interface, six different packages submitted their solutions to four problems designed by Robert F. Teitel — two each for two different data bases. P-STAT proved to be very well suited for handling these problems. The tasks involving the first data base required only a single pass through the system data file even in the standard version of P-STAT. The problems involving the second data base were also easy to do in standard P-STAT even though they required several passes of the file. However, these problems could also be done in a single pass when the file was built using P-STAT’s new P-RADE database enhancement.

Roald Buhler, Shirrell Buhler

Volume Testing of SPSS

Four complex file — handling problems proposed by Robert Teitel are posed in accompanying paper. A solution to one using the SPSS Batch System* is presented and the associated costs are detailed. Control language to solve all four problems using the SPSS-X system,* the follow-on system to SPSS presently under development, are also presented, along with estimates of the input/output volume required.

Jonathan B. Fry

Volume Testing of Statistical Systems

This paper outlines the steps taken to solve data management and analysis problems based on two complex files. One file, TRIPS, contained four groups of variables in a three-level hierarchical structure resembling that of the 1972 National Travel Survey. The second file, PERSONS, contained records in which link variables defined relationships to other records in the file. Both problems were solved using the current implementation of OSIRIS at the University of Michigan. The solution to the problems based on the TRIPS dataset involved use of OSIRIS structured file creation and retrieval procedures. The solutions to the problems based on the PERSONS dataset were based on the use of OSIRIS sort and merge procedures. OSIRIS instructions and the cross-tabulations that resulted are presented for all four problem segments.

Pauline R. Nagara, Michael A. Nolte

Random Number Generation


In Search of Correlation in Multiplicative Congruential Generators with Modulus 231 -1

This paper describes an empirical search for correlation in sample sequences produced by 16 multiplicative congruential random number generators with modulus 231 - 1. Each generator has a distinct multiplier. One multiplier is in common use in the LLRANDOM and IMSL random generation packages as well as in APL and SIMPL/1. A second is used in SIMSCRIPT II. Six multipliers were taken from a recent study that showed them to have the best spectral and lattice test properties among 50 multipliers considered. The last eight multipliers had the poorest spectral and lattice test properties for 2-tupes among the 50. A well known poor generator, RANDU, with modulus 231, was also tested to provide a benchmark for evaluating the empirical testing procedure.A comprehensive analysis based on test statistics derived from cumulative periodograms computed for each multiplier for each of 512 independent replications of 16384 observations each showed evidence of excess high frequency variation in two multipliers and excess midrange frequency variation in three others, including RANDU. Also evidence exists for a bimodal spectral density function for yet another multiplier. An examination of the test results showed that the empirical evidence of a departure from independence did not significantly favor the eight poorest multipliers. This observation is in agreement with a similar observation made by the authors in an earlier study of these multipliers that principally concentrated on their distributional properties in one, two and three dimensions. This consistency raises some doubt as to how one should interpret the results of the spectral and lattice tests for a multiplier. Also, the three multipliers considered superior in the earlier study maintain that position in the current study.

George S. Fishman, Louis R. Moore

Portability Considerations for Random Number Generators

With increasing use being made of computer-generated random samples, questions concerning the quality of the random number generators become increasingly important. Various studies have been made of the quality of basic generators and some generators have been identified as fairly reliable. As with any computer software, however, it is necessary to know more than that the method or “program” is good. It is also necessary to know that the program will perform well in each computer-compiler environment in which it is to be executed. Portability of a program within a class of environments means that the program will run correctly in each environment. A portable program for random number generation that has been found to produce high quality results in one computer-compiler environment will be just as reliable in other environments, and hence portability allows for efficiency in testing of random number generators. Another desirable aspect of portability is that Monte Carlo studies performed with portable random number generators are reproducible elsewhere, and results of studies by one researcher are more easily extensible by another researcher.Various computer-compiler characteristics relevant to the question of portability of U(0,1) pseudorandom generators and the necessary programming considerations are discussed.

James E. Gentle

Understanding Time Series Analysis


Some Recent Developments in Spectrum and Harmonic Analysis

In estimating the spectrum of a stationary time series from a finite sample of the process two problems have traditionally been dominant: first, what algorithm should be used so that the resulting estimate is not severely biased; and second, how should one “smooth” the estimate so that the results are consistent and statistically significant.Within the class of spectrum estimation procedures that have been found successful in the various engineering problems considered, bias control is achieved by iterative model formation and prewhitening combined with robust procedures ( the “robust filter” ), while “smoothing” is done by an adaptive nonlinear method.Recently a method has been found which, by using a “local” principal components expansion to estimate the spectrum, provides new solutions to both the bias and smoothing problems and also permits a unification of the differences between windowed and unwindowed philosophies. This estimate, which is an approximate solution of an integral equation, consists of a weighted average of a series of direct spectrum estimates made using discrete prolate spheroidal sequences as orthogonal data windows.

David J. Thomson

On Some Numerical Properties of Arma Parameter Estimation Procedures

This paper reviews the algorithms used by statisticians for obtaining efficient estimators of the parameters of a univariate autoregressive moving average (ARMA) time series. The connection of the estimation problem with the problem of prediction is investigated with particular emphasis on the Kalman filter and modified Cholesky decomposition algorithms. A result from prediction theory is given which provides a significant reduction in the computations needed in Ansley’s (1979) estimation procedure. Finally it is pointed out that there are many useful facts in the literature of control theory that need to be investigated by statisticians interested in estimation and prediction problems in linear time series models.

H. Joseph Newton

Time Series Recursions and Self-Tuning Control

Recursive estimates are estimates (of parameters in a time series model) that are computed in a sequential fashion (i.e. updated quickly as new observations become available). The uses of such “real” time parameter estimators include real-time forecasting and self-tuning control. Here it is shown how “real” time parameter estimators can be constructed for time series models; also an heuristic discussion of their convergence behavior is given. The analysis and synthesis of self-tuning controllers is also discussed.

Victor Solo

Measurement and Evaluation of Software


On Measurement and Evaluation of Software: A View from the Chair

The purpose of this session was to identify problem areas in the measurement and evaluation of software that require close collaboration between computer professionals and statisticians.

Herman P. Friedman

Can Statistical Methods Help Solve Problems in Software Measurement?

The study of software metrics involves the creation and analysis of quantitative indices of merit which can be assigned to software, either existing or proposed. These measurements of software provide informational aids to be used in making software lifecycle decisions.A panel commissioned to analyze and evaluate the problems in the emerging field of software metrics has recently issued a report. An overview of the panel’s findings, including how statisticians might be of help in solving some problems in software measurement, is presented.

Frederick G. Sayward

Measuring the Performance of Computer Software — A Dual Role for Statisticians

In describing and modelling the properties of any system, science traditionally uses quantitative measures. The emerging science of software is no exception. Statisticians, traditionally the arbiters of evidential inference in science, can play an important role in defining quantitative measures — metrics — of software performance and the equally important quantitative measures of the computational difficulty of problems to be solved. Computer Science, in embracing empirical research in software, must now consult the discipline of experimental design.When the software is statistical in application, statisticians are also in the role of beneficiaries of this scientific study of software. As users they should insist that evaluations of performance measure not only the usual completion times of workloads on certain machines, but also the accuracy of computed solution and the usefulness of the output.This paper describes a classification system for statistical software which is based on quantitative measures drawn from the “life-cycle” of a complete statistical analysis: file building, editing, data display, exploration, model building.

Ivor Francis

Software Metrics: Paradigms and Processes

Computer science and software engineering have no precise, well understood, standardized and accepted metrics. In this paper problems in obtaining software metrics will be discussed, and the relationship to statistics will be emphasized. Important advances in software metrics could come about thru the synergystic relationship between statistics and software engineering. A significant challenge to software metrics researchers is to stay within the traditional scientific paradigm of hypothesis, evaluation, criticism and review in the face of intense demands for software metrics.

Marvin Denicoff, Robert Grafton

A Proposal for Structural Models of Software Systems

A software system is a collection of information. A structural model for software systems relates the information content of the system to perhaps abstract metrics characterizing the software system. It is proposed that one basis for structural models of software systems is the data models of data base management technology. This paper describes software systems from the viewpoint of defining a structural model as a schema in a data model. A proposed representation for some aspects of software systems is sketched.

J. C. Browne

Software Metrics: A Key to Improved Software Development Management

This paper describes some of the potential for applying software metrics to the management of the software development process. It also considers some of the practical difficulties one typically faces in evolving and validating a software metric. One difficulty is the collection of baseline data in the real world of software production in which controlled experiments typically are not possible. The results of some recent quantitative ‘metrics’ investigations are presented and their practical implications for software estimation and control are cited. These investigations are thought to be representative of the process of evaluating software data not obtained under ‘controlled’ conditions such as is typically the situation in the natural science laboratory.

J. E. Gaffney

Orthogonalization: An Alternative to Sweep


Orthogonalization-Triangularization Methods in Statistical Computations

Procedures for reducing a data matrix to triangular form using orthogonal transformations are presented, e.g., Householder, Givens, and examples these procedures are compard to procedure operating on normal equations. We show how an analysis of variance can be constructed from the triangular reduction of the data matrix. Procedures for calculating sums of squares, degrees of freedom, and expected mean squares are presented. These procedures apply even with mixed models and missing data. It is demonstrated that all statistics needed for inference on linear combinations of parameters of a linear model may be calculated from the triangular reduction of the data matrix. Also included is a test for estimability. We also demonstrate that if the computations are done properly some inference is warranted even when the X matrix is ill-conditioned.

Del T. Scott, G. Rex Bryce, David M. Allen

Research Data Base Management


Simple Query Language Requirements for Research Data Management

Increasingly, researchers are managing large and complex data sets with existing packaged software systems designed for use in the research environment. For example, varying degrees of data management capabilities are found in 8MDP, P-STAT, SAS, SIR, and SPSS. These systems provide data input, statistical analysis, displays, tables, reports, etc. primarily through the use of build-in procedures, or a high-level retrieval language. Also, with the exception of P-STAT and SIR, these systems were primarily designed for batch usage.

Gary D. Anderson

Data Editing on Large Data Sets

The process of analyzing large data sets often includes an early exploratory stage to first, develop a basic understanding of the data and its interrelationships and second to prepare and cleanup the data for hypothesis formulation and testing This preliminary phase of the data analysis process usually requires facilities found in research data management systems, text editors, graphics packages, and statistics packages. Also this process usually requires the analyst to write special programs to cleanup and prepare the data for analysis. This paper describes a technique now implemented as a single computational tool, a data editor, which combines a cross facilities from the above emphasis on research manipulation and subsetting The data editor provides an environment to explore arid manipulate data sets with particular attention to the implications of large data sets. It utilizes a relational data model and a self describing binary data format which allows data transportability to other data analysis packages. Some impacts of editing large data sets will be discussed. A technique for manipulating portions or subsets of large data sets without physical replication is introduced. Also an experimental command structure and operating environment are presented.

James J. Thomas, Robert A. Burnett, Janice R. Lewis

Graphical Methods and Their Software


Census Bureau Statistical Graphics

For some time, the U. S. Bureau of the Census has published statistical data in graphical as well as tabular formats. These graphical displays include barcharts, piecharts, line graphs, time series plots and univariate and bivariate statistical maps. Such publication graphics are provided both monochromaticaly and in color. Recently, the Census Bureau has initiated a research program to investigate the application of computer graphics to statistical data analysis. Examples of such analytical graphics are regression and time series plots, scatterplots used in outlier analysis, line graphs depicting rate of change overlayed on barcharts depicting level or value of one or more variables, and color statistical maps. This paper describes the computer graphics hardware and software capabilities of the Census Bureau, experience in computerized statistical graphics, and research for employing computer graphics as an analytical tool In statistical data analysis.

Lawrence H. Cox

Mosaics for Contingency Tables

A contingency table specifies the joint distribution of a number of discrete variables. The numbers in a contingency table are represented by rectangles of areas proportional to the numbers, with shape and position chosen to expose deviations from independence models. The collection of rectangles for the contingency table is called a mosaic. Mosaics of various types are given for contingency tables of two and more variables.

J. A. Hartigan, B. Kleiner

The Use of Kinematic Displays to Represent High Dimensional Data

Traditional data presentations deriving from pencil-and-paper techniques are inherently 2-dimensional, while the human visual system effortlessly deals with several more dimensions in an integrated fashion. In particular, moving pictures do impart a strong subjective 3-d effect. We describe a data manipulation and display system designed to tap this human ability for the purposes of statistical data analysis and discuss the main issues, design problems and solutions chosen. An experimental version of our system has been operating on a VAX-computer at Harvard since the fall of 1980. (A videotape illustrating some possible applications was shown at the conference.)

David Donoho, Peter J. Huber, Hans-Mathis Thoma

Contributed Papers


Order Statistics and an Experiment in Software Design

Not long ago, most introductory undergraduate statistics courses were taught without the use of computers. But now, it is common practice to introduce computing as part of such courses. Conversely, with current trends toward teaching techniques of structured programming and requiring ample documentation, it is useful to introduce elementary statistics into an undergraduate’s first programming course. As an example, we examine a programming assignment, given to the members of a class studying FORTRAN, to develop methods of analyzing order statistics of random samples, exceedances thereof, and related waiting times. The students, who were required to work in teams and to perform alphanumeric manipulations, were thereby compelled to develop well structured programs and to provide usable documentation.

R. S. Wenocur

Further Approximation to the Distributions of Some Transformations to the Sample Correlation Coefficient

In this article the first eleven moments of r, are derived. They are used to examine the distributions of some of the familiar transformations of r under normal assumptions. Tables provided compare μ2 skewness (β1), and kurtosis (β2) with μ2*, β1* and β2* studied by Subrahmaniam and Gajjar (1980). These results provide further evidence of the usefulness of this work.

N. N. Mikhail, Beverly A. Prescott, L. P. Lester

Using Linear Programming to Find Approximate Solutions to the Fields to Impute Problem for Industry Data

Sande has suggested a mathematical programming formulation of the fields to impute problem (FTIP) for continuous data. This formulation seeks to find a minimum weighted sum of fields that would need to be changed to yield an acceptable record by solving a mixed integer programming problem known as the fixed charge problem. While this formulation can and has been solved to find an optimal solution to the FTIP, this approach can be expensive in terms of solution time. In this paper, we demonstrate the use of a heuristic procedure to find an approximately optimal solution to FTIP. This procedure uses the SWIFT algorithm developed by Walker in conjunction with a judicious choice of dummy variable costs to arrive at an approximate solution based on a linear programming solution. We will show that this solution is optimal in many cases. We will also discuss the use of the special structure of FTIP to arrive at an optimal solution to the LP problem.

Patrick G. McKeown, Joanne R. Schaffer

Using Computer-Binned Data for Density Estimation

With real time microcomputer monitoring systems or with large data bases, data may be recorded as bin counts to satisfy computer memory constraints and to reduce computational burdens. If the data represent a random sample, then a natural question to ask is whether such binned data may successfully be used for density estimation. Here we consider three density procedures: the histogram, parametric models determined by a few moments, and the nonparametric kernel density estimator of Parzen and Rosenblatt. For the histogram, we show that computer-binning causes no problem as long as the binning is sufficiently smaller than the data-based bin width 3.5σ n−1/3. Another result is that some binning of data appears to provide marginal improvement in the integrated mean squared error of the corresponding kernel estimate. Some examples are given to illustrate the theoretical and visual effects of using binned data.

David W. Scott

On the Nonconsistency of Maximum Likelihood Nonparametric Density Estimators

One criterion proposed in the literature for selecting the smoothing parameter(s) in RosenblattParzen nonparametric constant kernel estimators of a probability density function is a leave-out-one-at-a-time nonparametric maximum likelihood method. Empirical work with this estimator in the univariate case showed that it worked quite well for short tailed distributions. However, it drastically oversmoothed for long tailed distributions. In this paper it is shown that this nonparametric maximum likelihood method will not select consistent estimates of the density for long tailed distributions such as the double exponential and Cauchy distributions. A remedy which was found for estimating long tailed distributions was to apply the nonparametric maximum likelihood procedure to a variable kernel class of estimators. This paper considers one data set, which is a pseudo-random sample of size 100 from a Cauchy distribution, to illustrate the problem with the leave-out-one-at-a-time nonparametric maximum likelihood method and to illustrate a remedy to this problem via a variable kernel class of estimators.

Eugene F. Schuster, Gavin G. Gregory

Computer Program for Krishnaiah’s Finite Intersection Tests for Multiple Comparisons of Mean Vectors

The program FIT performs Krishnaiah’s finite intersection test procedure on the mean vectors from k multivariate populations. The test procedure is valid under the following assumptions: a) the k populations are distributed as multivariate normal, b) the covariance matrices of the k populations are equal. We can perform twosided or one-sided tests. The common covariance matrix, ∑ = (σij) may be unknown or known. When ∑ is unknown, the test statistics are distributed as multivariate F or multivariate t for the two-sided test or the one-sided test respectively. In the case when ∑ is known, then the test statistics are distributed as multivariate chi-square or multivariate normal for the two-sided test or the one-sided test respectively. The program FIT computes suitable bounds on the required percentage points of these distributions.

C. M. Cox, C. Fang, R. M. Boudreau

Approximating the Log of the Normal Cumulative

Approximation formulas are obtained for ln Ф, where Ф is the cumulative distribution function for the normal distribution.

John F. Monahan

A kTH Nearest Neighbour Clustering Procedure

Due to the lack of development in the probabilistic and statistical aspects of clustering research. clustering procedures are often regarded as heuristics generating artificial clusters from a given set of sample data. In this paper, a clustering procedure that is useful for drawing statistical inference about the underlying population from a random sample is developed. It is based on the uniformly consistent kth nearest neighbour density estimate. and is applicable to both case-by-variable data matrices and case-by-case dissimilarity matrices. The proposed clustering procedure is shown to be asymptotically consistent for high-density clusters in several dimensions. and its small-sample behavior is illustrated by empirical examples.

M. Anthony Wong, Tom Lane

Interactive Statistical Graphics: Breaking Away

The proper role of the computer in data analysis is one of increasing the ability of the analyst to extract information. This requires a high degree of user control within an interactive environment, not only over data manipulation and computations but also over graphical display. An interactive statistical graphics system such as that described in this paper allows the user to break away from constraints imposed by most statistical packages.

Neil W. Polhemus, Bernard Markowicz

Interactive Graphical Analysis for Multivariate Data

A Fortran program, called CLUSTER, has been implemented which interactively assists the statistician in exploring multivariate data, displaying projected data on a Tektronix 4010, Tektronix 4027 or Hewlett-Packard 2648 graphics terminal. The program is designed to be portable, and isolates device dependent display code so as to simplify the addition of device drivers for other graphics terminals. Both keyboard and cursor input are supported, and emphasis has been placed on a good human interface.

Robert B. Stephenson, John C. Beatty, Jane F. Gentleman

SLANG, a Statistical Language for Descriptive Time Series Analysis

SLANG is a language designed to provide easy access to a statistical database for users who have little or no programming experience. This language, which can operate in both interactive and batch mode, allows retrieval and descriptive analyses of time series. For more complex analyses, the retrieval capability of SLANG can also be used as a bridge between a statistical database and those commercially available statistical analysis packages which support only sequential input. Design highlights presented in the paper are: a syntax which allows only a small number of statement types, simple data types and a library of functions which users can extend to increase the power of the language. User experience and efficiency considerations are also discussed. SLANG currently operates under IBM’s MVS on a database of international statistics managed by ADABAS.

M. Nicolai, R. Cheng

On the Parameter Estimation in Queueing Theory

Two estimators are introduced for estimating the number m of servers for the multiserver queueing system M/M/m with infinite waiting room. One estimator is the maximum likelihood estimator and the other is more accurate for estimating large values of m. Simulation is used to simulate the system and compare the two estimators numerically.

Jacob E. Samaan, Derrick S. Tracy

Statistical Computation with a Microcomputer

The development of microcomputer systems in recent years has given many individuals the capability for statistical computation that would have been previously impossible. The proper role of microcomputers and their advantages and limitations is discussed. The MICROSTAT system is described as an example of the capability of a microcomputer statistical package. Sample printouts illustrate the degree of computational accuracy that can be achieved with an 8-bit microprocessor.

J. Burdeane Orris

On the Exact Distribution of Geary’s U-Statistic and its Application to Least Squares Regression

In this paper the first four moments of U-Statistic are derived, to which a two-moment graduation shows that neither a beta distribution nor a scalar multiple chi-square distribution is a good fit to the actual distribution of U-Statistic.On the other hand, a two moment graduation using the exact four moments of U2-Statistic as a scalar multiple chi-square distribution is a very good approximation to the actual distribution of U2-Statistic.A comparison of our results with Gastwirth and Selwyn’s (1980, 139) results are given.From the computations of β1 and β2 for the fitted scalar multiple chi-square and for the actual distribution, we can recommend the fitted scalar multiple chi-square distribution for any statistical tests in practical situations.

N. N. Mikhail, L. P. Lester

Exposure to the Risk of an Accident: The Canadian Department of Transport National Driving Survey and Data Analysis System, 1978–79

This paper is the first of a series reporting on the methodology and results of a comprehensive twelve-month, nationwide survey conducted in Canada during 1978–1979. There were approximately 22,700 households sampled using a 7-day driver trip diary recording instrument. The surveyed information consists of 3 dependent variables and 59 main independent variables classified into 5 different record types. A data analysis system was designed to provide for maximum flexibility through the implementation of three sub-systems. Part I discusses objectives, design and methodological features for both the survey and data analysis system. Subsequent parts will focus on further system enhancements, i.e. linkage with the Canadian traffic accident data base and implementation of a detailed linear modelling system for statistical analyses. This part presents exposure information for various driver, vehicle and trip variables and examines relative risk ratios that are a function of accidents/fatalities and exposure (travel distance or travel time). These “exposure-sensitive” measurements can be used to study the diverse groups of independent variables surveyed, to identify significant variations, and to provide a basis for the implementation of effective traffic safety countermeasures.

Delbert E. Stewart

CONCOR: An Edit and Automatic Correction Package

For over ten years the International Statistical Programs Center (ISPC) of the U.S. Bureau of the Census has been involved in the development and dissemination of generalized computer software products for use by statistical organizations in developing countries. In response to critical needs for improvement in data processing capabilities during the 1970 World Census Program ISPC developed a general cross-tabulation system (CENTS) which has been continually enhanced over the years and is presently installed at over 90 computer centers worldwide.

Robert R. Bair

BGRAPH: A Program for Biplot Multivariate Graphics

BGRAPH (Tsianco, 1980) is an inter-active conversational program to perform biplot multivariate graphics. The program generates two-and three-dimensional biplot displays based on the singular value decomposition (SVD) of a matrix and the resulting rank 2 or 3 approximations. Three dimensional displays may be either orthogonal projections, perspective projections, stereograms or analyglyphs. Other capabilities of the program include subset selection, selective labeling of points, rotation of axes, construction of ellipsoids of concentration, MANOVA biplots and plot storage.

Michael C. Tsianco, K. Ruben Gabriel, Charles L. Odoroff, Sandra Plumb

MONCOR--A Program to Compute Concordant and other Monotone Correlations

The new interactive FORTRAN program MONCOR is described. MONCOR computes the concordant monotone correlation, discordant monotone correlation, isoconcordant monotone correlation, isodiscordant monotone correlation and their associated monotone variables. Data input can be finite discrete bivariate probability mass functions or ordinal contingency tables, both of which must be given in matrix form. The well-known British Mobility data are used to illustrate the input and output options available in MONCOR.

George Kimeldorf, Jerrold H. May, Allan R. Sampson

Computer Offerings for Statistical Graphics -- An Overview

CRT display devices, special plotters and other graphics output devices which communicate with small or large host computers provide analysts an opportunity to obtain automatically high quality graphical displays of data and results of statistical evaluations. This paper gives an overview of programs and subroutine libraries for statistical graphics. Different offerings are identified and their technical features compared. The purpose is to provide a working guide to those who wish to use statistical graphics.

Patricia M. Caporal, Gerald J. Hahn

Computing Percentiles of Large Data Sets

We describe an algorithm for finding percentiles of large data sets (those having 100,000 or more points). This algorithm does not involve sorting the entire data set. Instead, we sample the data and obtain a guess for the percentile. Then, using the guess we extract a subset of the original data through which we search for the true percentile.

Jo Ann Howell

A Self-Describing Data File Structure for Large Data Sets

A major goal of the Analysis of Large Data Sets (ALDS) research project at Pacific Northwest Laboratory (PNL) is to provide efficient data organization, storage, and access capabilities for statistical applications involving large amounts of data. As part of the effort to achieve this goal, a self-describing binary (SDB) data file structure has been designed and implemented together with a set of data manipulation functions and supporting SDB data access routines. Logical and physical data descriptors are stored in SDB files preceding the data values. SDB files thus provide a common data representation for interfacing diverse software components. This paper describes the data descriptors and data structures permitted by the file design. Data buffering, file segmentation and a segment overflow handler are also discussed.

Robert A. Burnett

Nonlinear Estimation Using a Microcomputer

The estimation of nonlinear models, possibly involving constraints, can be carried out quite easily using contemporary computers. The estimation process is illustrated using both real-world and artificial problems, the largest problem involving no less than 1250 nonlinear parameters. The formulation of nonlinear estimation problems is presented and various algorithms are suggested for their solution. The particular numerical methods suitable for microcomputer environments are sketched. A discussion of the role of scaling is given. Performance figures are presented for various problems using a North Star Horizon computer and the Radio Shack/Sharp Pocket Computer. It is noted that the microcomputer was able to solve a 41 parameter econometric problem in relatively little time after a service bureau budget had been exhausted in seeking parameter estimates without success.

John C. Nash

Statistical Procedures for Low Dose Extrapolation of Quantal Response Toxicity Data

Maximum likelihood procedures for fitting the probit, logit, Weibull and gamma multi-hit dose response models with independent, additive or mixed independent/additive background to quantal assay toxicity data are reviewed. In addition to parameter estimation, the use of the above models for low dose extrapolation is indicated with both point estimates and lower confidence limits on the “safe” dose discussed. A computer program implementing these procedures is described and two sets of toxicity data are analyzed to illustrate its use.

John Kovar, Daniel Krewski

An Economic Design of -Charts with Warning Limits to Control Non-Normal Process Means

In this paper, we develop an expected cost model for a production process under the surveillance of an x-chart with warning limits for controlling the non-normal process mean. The economic design of control charts involves the optimal determination of the design parameters that minimize the expected total cost of monitoring the quality of the process output. The design parameters of a general control chart with warning limits are the sample size, the sampling interval, the action limit coefficient, the warning limit coefficient, and the critical run length. To develop the expected loss-cost function, expressions for the average run lengths, when the process is in control, and when the process is out of control are derived. A direct search technique is employed to obtain the optimal values of the design parameters. The effects of non-normality parameters on the loss-cost function and on the design parameters are discussed using a numerical example.

M. A. Rahim, R. S. Lashkari

Prior Probabilities, Maximal Posterior, and Minimal Field Error Localization

One of the significant difficulties with automatic edit and imputation is that any attempt to rigorously justify the methods used must confront the problem of the “error model”: the observed record x is the true record y plus an error vector ε, $$ {\rm x = y + \varepsilon } $$ .

G. E. Liepins, D. J. Pack


Weitere Informationen