Top

1994 | Book

Read chapter Read first chapter

The Statistical Analysis of Categorical Data

Author: Professor Dr. Erling B. Andersen

Publisher: Springer Berlin Heidelberg

Included in: Professional Book Archive

About this book

The aim of this book is to give an up to date account of the most commonly uses statisti cal models for categorical data. The emphasis is on the connection between theory and applications to real data sets. The book only covers models for categorical data. Various models for mixed continuous and categorical data are thus excluded. The book is written as a textbook, although many methods and results are quite recent. This should imply, that the book can be used for a graduate course in categorical data analysis. With this aim in mind chapters 3 to 12 are concluded with a set of exer cises. In many cases, the data sets are those data sets, which were not included in the examples of the book, although they at one point in time were regarded as potential can didates for an example. A certain amount of general knowledge of statistical theory is necessary to fully benefit from the book. A summary of the basic statistical concepts deemed necessary pre requisites is given in chapter 2. The mathematical level is only moderately high, but the account in chapter 3 of basic properties of exponential families and the parametric multinomial distribution is made as mathematical precise as possible without going into mathematical details and leaving out most proofs.

Frontmatter

1. Categorical Data

Abstract

This book is about categorical data, i.e. data which can only take a finite or countable number of values.

Erling B. Andersen

2. Preliminaries

Abstract

In this chapter a short review is given of some basic elements of statistical theory which are necessary prerequisites for the theory and methods developed in subsequent chapters.

Erling B. Andersen

3. Statistical Inference

Abstract

The majority of interesting models for categorical data are log-linear models. A family of log-linear models is often referred to as an exponential family.

Erling B. Andersen

4. Two-way Contingency Tables

Abstract

A two-way contingency is a number of observed counts set up in a matrix with I rows and J columns Data are thus given as a matrix

$$X = \left[ \begin{array}{l} {x_{11}} \cdots {x_{1J}}\\ \vdots \\ {x_{I1}} \cdots {x_{IJ}} \end{array} \right]$$

The statistical model for such data depends on the way the data are collected. A great variety of tables can, however, be treated by three closely connected statistical models. Let the random variables corresponding to the contingency table be X₁₁,…,X_IJ. Then in the first model the X_ij’sare assumed to be independent with

$${X_{ij}} \sim Ps({\lambda _{ij}}),$$

i.e. X_ij is Poisson distributed with parameter λ_ij. The likelihood function for this model is

$$f({x_{11}},...,{x_{IJ}}|{\lambda _{11}},...,{\lambda _{IJ}}) = \mathop {II}\limits_{i = 1}^I \;\mathop {II}\limits_{j = 1}^J \frac{{\lambda _{ij}^{{x_{ij}}}}}{{x_{ij}^!}}{e^{ - {\lambda _{ij}}}}$$

(4.1)

The log-likelihood is accordingly given by

$$\ln {\rm{L(}}{\lambda _{11}}{\rm{,}} \ldots ,{\lambda _{{\rm{IJ}}}}{\rm{) = }}\sum\limits_{\rm{i}} {\sum\limits_{\rm{j}} {{{\rm{x}}_{{\rm{ij}}}}\ln {\lambda _{{\rm{ij}}}} - \sum\limits_{\rm{i}} {\sum\limits_{\rm{j}} {{\rm{ln}}{{\rm{x}}_{{\rm{ij}}}}!} - \sum\limits_{\rm{i}} {\sum\limits_{\rm{j}} {{\lambda _{{\rm{ij}}}}.} } } } }$$

(4.2)

The model is thus a IJ-dimensional log-linear model with canonical parameters lnλ₁₁,...,1nλ_IJ and sufficient statistics T_ij=X_ij, i=1,...,I, j=1,...,J.

Erling B. Andersen

5. Three-way Contingency Tables

Abstract

Consider a three-way contingency table {X_ijk, i=1,...I, j=1,...,J, k=1,...,K}. As model for such data, it may be assumed that the x’s are observed values of random variables X_ijk, i=1,...,I, j=1,...,J, k=1,...,K with a multinomial distribution

$${X_{111}},...,{X_{IJK}} \sim M(n,{p_{111}},...,{p_{IJK}}).$$

(5.1)

Erling B. Andersen

6. Multi-dimensional Contingency Tables

Abstract

In chapter 5, log-linear models for three-dimensional tables were treated in great details. Hence we shall not for higher order tables go into details with the parameterizations of the models or with the exact expressions for test quantities and their distributions. Besides for higher order tables the mathematical expressions quickly becomes large and cumbersome to write down.

Erling B. Andersen

7. Incomplete Tables, Separability and Collapsibility

Abstract

An observed contingency table is incomplete if it contains zeros in certain cells. Such zeros are of two types, random zeros and structural zeros. A cell has a random zero, if the observed value in the cell is zero, but the expected value is positive. A cell has a structural zero if the expected number is zero, i.e. if it is known a priori that the cell will contain a zero. Random or structural zeros does not impaire the log-linear structure of a given model. It means, however, that certain log-linear parameters can not be estimated.

Erling B. Andersen

8. The Logit Model

Abstract

In chapters 4, 5 and 6 the categorical variables appeared in the model in a symmetrical way. In many situations, for example in examples 6.1 and 6.2 in chapter 6, one of the variable is of special interest. For the survival data in example 6.1, survival is the variable of special interest, and the problem is to study if the other three variables have influenced the chance of survival. Variable B in example 6.1 may, therefore, be called a response variable and variables A, C and D explanatory variables. This terminology is the same as the one used in regression analysis, and when survival is regarded as a response variable the data in example 6.1 can in fact be analysed by a regression model. In example 6.2 the position on the truck of the collision can be regarded as a response variable. We are here primarily interested in the effect of explanatory variable A, i.e. the introduction of the safety measure in November 1971, but have to take into account that the other explanatory variables, i.e. whether the truck was parked or not and what the light conditions were, may be of importance for the location of the collision. When the response variable is binary and the explanatory variables are categorical, the appropriate regression model is known as the logit model. More precisely the assumptions for a logit model are:

(a)

The response variable is binary.

(b)

The contingency table formed by the reponse variable and the explanatory variables can be described by a log-linear model.

Erling B. Andersen

9. Logistic Regression Analysis

Abstract

In chapter 8 the connection to log-linear models for contingency tables was stressed. The direct connection to regression analysis for continuous response variables will now be brought more clearly into focus. Assume as before that the response variable is binary and that it is observed together with p explanatory variables. For n cases the data will then consist of n vectors

$$\left( {{y_V},{z_{1V}}, \ldots ,{z_{pV}}} \right),v = 1, \ldots ,n$$

of jointly observed values of the response variable and the explanatory variables.

Erling B. Andersen

10. Models for the Interactions

Abstract

If the statistical analysis of a contingency table is based on one of the log-linear models in chapters 5, 6 and 7, a number of natural models are easily overlooked. Many useful models can thus be expressed as structures in the log-linear interaction parameters. In this chapter a number of such models are discussed. Many of these models can be viewed as attempts to describe the non-zero interactions by a simple structure if the analysis of the data by a log linear model has failed to give a satisfactory fit to the model. If e.g. the independence hypothesis for a two-way table has been rejected, a residual analysis will often reveal a certain structure in the two-factor interactions.

Erling B. Andersen

11. Correspondence Analysis

Abstract

A statistical technique, which is closely related to the models discussed in chapter 10, was developed in France in the 1970’s. This technique known in the English speaking world as correspondence analysis was introduced by Benzecri (1973) as l’Analyse de Correspondance. Many authors have argued that correspondence analysis was not developed in France by Benzecri. This claim is correct in the sense that the technique is closely related to many other forms of statistical analyses, which go far back in time. Some of these connections are discussed in section 11.3 below, where also references will be given. Whatever those connections are, the name correspondence analysis and its popularity in France and neighbouring countries is certainly due to Benzecri and his students.

Erling B. Andersen

12. Latent Structure Analysis

Abstract

The structure of a log-linear model can be described on an association diagram by the lines connecting the points. Especially for higher order contingency tables the structure on an assocition diagram can be very complicated, implicating a complicated interpretation of the model. Adding to the interpretation problems for a multi-dimensional contingency table is the fact, that the decision to exclude or include a given interaction in the model can be based on conflicting significance levels depending on the order in which the statistical tests are carried out. These decisions are thus based on the intuition and experience of the data analyst rather than on objective criteria. Hence a good deal of arbitrariness is often involved, when a model is selected to describe the data. We recall for example from several of the examples in the previous chapters that the log-linear model often gave an adequate description of the data judged by a direct test of the model against the saturated model, while among the sequence of successive tests leading to the model, there were cases of significant levels.

Erling B. Andersen

13. Computer Programs

Abstract

For most of the models in this book it is necessary to use computer programs to execute the statistical computations.

Erling B. Andersen

Backmatter

Title: The Statistical Analysis of Categorical Data
Author: Professor Dr. Erling B. Andersen
Publisher: Springer Berlin Heidelberg
Electronic ISBN: 978-3-642-78817-8
Print ISBN: 978-3-642-78819-2
DOI: https://doi.org/10.1007/978-3-642-78817-8

Springer Professional

The Statistical Analysis of Categorical Data

About this book

Table of Contents

Frontmatter

1. Categorical Data

2. Preliminaries

3. Statistical Inference

4. Two-way Contingency Tables

5. Three-way Contingency Tables

6. Multi-dimensional Contingency Tables

7. Incomplete Tables, Separability and Collapsibility

8. The Logit Model

9. Logistic Regression Analysis

10. Models for the Interactions

11. Correspondence Analysis

12. Latent Structure Analysis

13. Computer Programs

Backmatter