Skip to main content
main-content

Über dieses Buch

In many areas of science a basic task is to assess the influence of several factors on a quantity of interest. If this quantity is binary logistic, regression models provide a powerful tool for this purpose. This monograph presents an account of the use of logistic regression in the case where missing values in the variables prevent the use of standard techniques. Such situations occur frequently across a wide range of statistical applications.
The emphasis of this book is on methods related to the classical maximum likelihood principle. The author reviews the essentials of logistic regression and discusses the variety of mechanisms which might cause missing values while the rest of the book covers the methods which may be used to deal with missing values and their effectiveness. Researchers across a range of disciplines and graduate students in statistics and biostatistics will find this a readable account of this.

Inhaltsverzeichnis

Frontmatter

Logistic Regression With Two Categorical Covariates

1. Introduction

Abstract
In many scientific areas a basic task is to assess the simultaneous influence of several factors on a quantity of interest. Regression models provide therefore a powerful framework, and estimation of the effects in such models is a well-established field of statistics. In general this estimation is based on measurements of the factors (covariates) and the quantity of interests (outcome variable) for a set of units. However, in practice often not all covariates can be measured for all units, i.e., some of the units show a missing value in one or several covariates. The reasons can be very different and depend mainly on the type of the measurement procedure for a single covariate and the type of the data collection procedure. Some examples should illustrate this:
  • If the covariate values are collected by a questionnaire or interview, non-response is a typical source for missing values. It may be due to a true lack of knowledge, if for example a person is asked for certain diseases during its childhood, or to an intentional refusal. The latter is especially to be expected for embarrassing questions like alcohol consumption, drug abuse, sexual activities, or income.
  • In retrospective studies covariate values are often collected on the basis of documents like hospital records. Incompleteness of the documents causes missing values.
  • In clinical trials biochemical parameters are often used as covariates. The measurement of these parameters often requires a certain amount of blood, urine or tissue, which may not be available.
  • In prospective clinical trials the recruitment of patients can last several years. Meanwhile scientific progress may discover new influential factors, which may cause the decision to add the measurement of the covariate to the data collection procedure. For patients recruited before this decision the value of this covariate is missing.
  • If the measurement of a covariate is very expensive, one may restrict the measurement to a subset of all units.
  • Even in a well planned and conducted study small accidents can happen. A test tube may break, a case report form may be lost on the mail, an examination may be forgotten, the inaccuracy of an instrument may be detected too late, etc. Each accident may cause a missing value.
Werner Vach

Logistic Regression With Two Categorical Covariates

2. The Complete Data Case

Abstract
Let be Y a binary outcome variable, X1 a covariate with categories 1,…, J and X2 a covariate with categories 1,…, K. In a logistic model we assume
$$P\left ( Y =1|X_{1}=j,X_{2}=k\right )=\Lambda \left ( \beta_{0} +\beta_{1j} +\beta_{2k} \right )=\mu_{jk} \left ( \beta \right )\;\;\;(2.1)$$
with parameter restrictions β11 = 0 and β21 = 0. Λ(x):= 1/(1 + e−x) denotes the logistic function. We consider the covariates as random variables, and parametrize their joint distribution by
$$\begin{matrix} P\left ( X_{1}= j\right )=:\tau_{j} \;\;\textup{and}\\ P\left (X_{2}= k| X_{1}=j\right ):=\pi_{k|j} \end{matrix}$$
.
Werner Vach

3. Missing Value Mechanisms

Abstract
We now assume that we have missing values in the second covariate. The observability of X2 is indicated by a random variable
$$O_{2}:\begin{cases} & \text{1 if}\;\;X_{2} \;\;\textup <Emphasis Type="ItalicSmallcaps"> obervable</Emphasis> \\ & \text{0 if}\;\;X_{2} \;\;\textup<Emphasis Type="ItalicSmallcaps"> unobservable</Emphasis> \end{cases}\;\;\;\;2\dagger)$$
and instead of X2 we observe the random variable
$$Z_{2}:=\begin{cases} X_{2} \;\;\textup{if}\;\;O_{2}=1\\ K+1 \;\;\textup{if}\;\;O_{2}=0 \end{cases}$$
with an additional category for missing values. Instead of K + 1 we also use the symbol “?” for this category in the sequel.
Werner Vach

4. Estimation Methods

Abstract
Methods to estimate the regression parameters of a logistic model in the presence of missing values in the covariates can be divided into two classes. The first class contains ad-hoc-methods, which try to manipulate the missing values in a simple manner in order to obtain an artificially completed table without missing values. The widespread use of these methods relates mainly to the fact that we can use standard statistical software for the analysis of the completed table. But a major drawback of these methods are their poor statistical properties. For many missing value mechanisms, even if they satisfy the MAR assumption, the methods may yield inconsistent estimates or they use the information of the incomplete observations in a very inefficient manner. Moreover the validity of variance estimates and confidence intervals is in doubt, because the process of manipulating the table is neglected. For the second class of methods consistency is implied by the estimation principle and estimates of asymptotic variance are available. The drawback of these methods is the increased effort for the implementation, because standard statistical software cannot be used directly.
Werner Vach

5. Quantitative Comparisons: Asymptotic Results

Abstract
So far the evaluation of the methods has been performed on a qualitative level: We have three consistent methods (ML Estimation, PML Estimation and Filling) and four asymptotically biased methods (Conditional and Unconditional Probability Imputation, Additional Category and Omission of Covariate). The bias of Complete Case Analysis is of another quality, because it depends only on the structure of the missing rates. With the exception of MDXY mechanisms, Complete Case Analysis yields consistent estimates for the regression parameters β1j and β2k, and hence it can be compared to the three consistent methods mentioned above.
Werner Vach

6. Quantitative Comparisons: Results of Finite Sample Size Simulation Studies

Abstract
The investigations of the last chapter were based on asymptotic arguments. It remains to show that the results of the comparisons are transferable to the finite sample size. Moreover, the properties of the methods themselves have been examined so far only asymptotically, and the estimation of variance is also based on asymptotic results.
Werner Vach

7. Examples

Abstract
So far our investigations have pointed out a lot of strengths and weaknesses of the considered methods. We want to illustrate some of these features by comparing the results of different methods applied to the same data set. We start with some artificially generated data sets, where we know the missing value mechanism, and then we present an example with a real data set.
Werner Vach

8. Sensitivity Analysis

Abstract
ML Estimation, PML Estimation and Filling are all based on the MAR assumption. In many applications this assumption is highly questionable. In general we have no chance to check the validity of the MAR assumption based on the observed contingency table: missing values in X2 cause 2 × J new cells in our contingency table, but the MAR assumption introduces already 2 × J new parameters (qij)i=o,1;j=1,,J, hence additional parameters are not identifiable 8†).
Werner Vach

Generalizations

9. General Regression Models with Missing Values in One of Two Covariates

Abstract
So far we have considered the logistic regression model with two categorical covariates and missing values in one covariate. The considered techniques to handle missing values can also be applied in other regression models, at least if the covariates (and for some methods the outcome variable, too) are categorical. We will now discuss some generalizations to the case of covariates measured on an arbitrary scale, and this discussion will be done within the scope of rather general regression models. We do not consider the special case of Gaussian distributed errors, for which a lot of suggestions have been made. We refer to the excellent review of Little (1992). Note, however, that many of the methods described there depend on the assumption of a joint multivariate normal distribution of all variables.
Werner Vach

10. Generalizations for More Than two Covariates

Abstract
We now consider the case, where we have not 2, but p > 2 covariates X1,…, Xp. We do not consider Mean/Probability Imputation and Additional Category, because the basic criticism presented in Chapter 4 and 9 of course remains valid. Also Complete Case Analysis is not considered in detail, but we should mention that the robustness against violations of the MAR assumption remains valid of course, too. Thus we restrict our attention in this chapter to ML Estimation, Semiparametric ML Estimation, and Estimation of the Score Function.
Werner Vach

11. Missing Values and Subsampling

Abstract
A special type of missing values in the covariates is due to sampling schemes, where some covariates are measured only for a subsample. As this subsample is planned in advance, the missing value mechanism satisfies the MAR assumption. We consider in this chapter the application of ML estimation, semiparametric ML estimation and estimation of the score function in this setting. Modifications of these approaches are necessary, if the subsample is a validation sample with precise measurements of a covariate, for which a surrogate covariate is measured for all subjects. Finally we consider subsampling of the nonresponders and the sampling of additional variables to avoid a violation of the MAR assumption.
Werner Vach

12. Further Examples

Abstract
The following examples should illustrate some methodological issues in the application of methods to handle missing values in the covariates. In the first example the application of ML estimation allows a substantial gain in the precision relative to a Complete Case Analysis, but doubts of the validity of the MAR assumption require a sensitivity analysis. In the second example a Complete Case Analysis gives a somewhat dubious result and we have to examine, whether this result is a methodological artifact.
Werner Vach

13. Discussion

Abstract
Missing values in the covariates are a challenge in the daily work of statisticians. The standard approach of most statistical software packages is the Complete Case Analysis. Most statisticians follow this suggestion although they feel that there must be more efficient ways. Hence many readers of this book will look for a list recommending better methods. Based on the investigations of this book such an attempt will be made in the next two sections. However, there exists not (yet) a general standard technique to handle missing values in the covariates, but we are able to distinguish some (new) methods with good and some (old) methods with bad statistical properties. The presented difficulties to give general recommendations also highlight some important issues of future research. Additional topics are summarized in the third section, and the book finishes with an important remark.
Werner Vach

Backmatter

Weitere Informationen