Skip to main content
main-content

Über dieses Buch

Data fusion or statistical file matching techniques merge data sets from different survey samples to solve the problem that exists when no single file contains all the variables of interest. Media agencies are merging television and purchasing data, statistical offices match tax information with income surveys. Many traditional applications are known but information about these procedures is often difficult to achieve. The author proposes the use of multiple imputation (MI) techniques using informative prior distributions to overcome the conditional independence assumption. By means of MI sensitivity of the unconditional association of the variables not jointy observed can be displayed. An application of the alternative approaches with real world data concludes the book.

Inhaltsverzeichnis

Frontmatter

1. Introduction

Abstract
It seems that statistical matching splits the field of statistics in two. Statistical matching is blamed and repudiated by sceptical theoretical and practical statisticians about the power of matching techniques. This is reported, e.g., by Moriarity and Scheuren (2001), Judkins (1998), Gabler (1997), Bennike (1987), Rodgers (1984), Woodbury (1983), and Sims (1972a and b). On the other hand, famous statistical offices such as Statistics Canada as well as market research companies especially in Europe have done or are still doing statistical matching which in Europe is typically called data fusion. However, from time to time there are reports published stating that data from different sources have been matched successfully. Positive experiences with statistical matching have been published in a wide variety of journals or as internal reports or working papers, e.g., by Aluja-Banet and Thio (2001), Wendt (1976, 1986, 2000), Kovacevic and Liu (1994), Liu and Kovacevic (1996, 1997, 1998), Roberts (1994), Baker (1990), Baker et al. (1989), Antoine (1987), Antoine and Santini (1987), Scheler and Wiegand (1987), Wiegand (1986), Okner (1974), Ruggles and Ruggles (1974), Ruggles et al. (1977), and Okner (1972a and b).
Susanne Rässler

2. Frequentist Theory of Statistical Matching

Abstract
The objective of statistical matching techniques is the generation of a new data set that allows even more flexible analysis than each single data set. In particular, the associations between variables never jointly observed are specified in such a completed data set. In this chapter we show whether a statistically matched file may be analyzed as if it were a single sample.
Susanne Rässler

3. Practical Applications of Statistical Matching

Abstract
Much of the literature describing traditional approaches and techniques that are used in practice are working papers, technical or internal reports. Often they are difficult to obtain if available at all. Most of the reports or articles published are less theoretical; details about the final matching algorithms are often best explained in private talks or at conferences. No comprehensive work addressing new and recently used matching techniques is available. In this chapter we summarize and record the history of statistical matching techniques and briefly explain some of its first solutions. Different techniques that had and still have great importance for practical applications are then discussed in more detail. Often our information is based on unpublished reports supplied by experts practicing statistical matching. Thus we hope to fill a gap in the literature and explain what often is left to the reader’s imagination.
Susanne Rässler

4. Alternative Approaches to Statistical Matching

Abstract
Throughout this chapter we consider the matching problem as a problem of “file concatenation”, a term coined by Rubin (1986); see Figure 4.1. First the two files A and B are concatenated and then the missing values of each part are multiply imputed to reflect uncertainty about the missing data and the unknown association of the variables never jointly observed. Thus our task is again to impute the missing data of X in file A and the missing data of Y in file B. U obs file A denotes the variables Z and Y and in file B, Z and X, respectively. Basically, this is a classical imputation problem.
Susanne Rässler

5. Empirical Evaluation of Alternative Approaches

Abstract
In this chapter we consider a typical European statistical matching task where multiple categorical, continuous, and so-called semicontinuous variables concerning media and television behavior have been recorded in separate surveys in addition to the usual demographic and socioeconomic information. Our goal here is less to analyze or describe the relationship among variables in a meaningful way but to find out whether the procedures for imputing missing values preserve important features of marginal and joint distributions. We want to investigate the performance of the proposed alternative matching techniques discussed in Chapter 4 when applied to real data sets which typically do not follow simplifying assumptions. The validity of a matching technique is measured according to the four levels we introduced in Chapter 2. The data are provided from the television behavior panel run by the largest German market research company.1 This data set is matched regularly with the purchasing behavior panel; the procedure actually applied by the GfK is described in section 3.3.5.
Susanne Rässler

6. Synopsis and Outlook

Abstract
In Chapter 1 we point out that the statistical matching task may be viewed as a problem of nonresponse; more precisely, the missing information is regarded as missing at random because the missingness is induced by the study design of the separate samples. The missing data are due to unasked questions and the missingness mechanism is regarded as ignorable which in principle makes the application of conventional multiple imputation techniques obvious. However, contrary to the traditional missingness patterns, statistical matching is characterized by its identification problem. The association of the variables never jointly observed is unidentifiable and cannot be estimated by means of likelihood inference. Therefore prior information has to be embedded in the estimation process. Statistically matched files tend to display conditional independence between the variables only observed in separate files. We show that the validity of the traditional matching techniques concerning the preservation of the true association of the variables never jointly observed depends on the explanatory power of the common variables. Following an approach published by Rubin (1987) we propose the use of multiple imputation techniques using informative prior distributions to overcome the conditional independence assumption. By means of MI, sensitivity of the unconditional association of the (specific) variables not jointly observed can be displayed. In other words, different prior settings of conditional associations allow us to show the extent to which unconditional associations are determined by the common variables.
Susanne Rässler

Backmatter

Weitere Informationen