Skip to main content

2023 | Buch

Visualization and Imputation of Missing Values

With Applications in R

insite
SUCHEN

Über dieses Buch

This book explores visualization and imputation techniques for missing values and presents practical applications using the statistical software R. It explains the concepts of common imputation methods with a focus on visualization, description of data problems and practical solutions using R, including modern methods of robust imputation, imputation based on deep learning and imputation for complex data. By describing the advantages, disadvantages and pitfalls of each method, the book presents a clear picture of which imputation methods are applicable given a specific data set at hand.

The material covered includes the pre-analysis of data, visualization of missing values in incomplete data, single and multiple imputation, deductive imputation and outlier replacement, model-based methods including methods based on robust estimates, non-linear methods such as tree-based and deep learning methods, imputation of compositional data, imputation quality evaluation from visual diagnostics to precision measures, coverage rates and prediction performance and a description of different model- and design-based simulation designs for the evaluation. The book also features a topic-focused introduction to R and R code is provided in each chapter to explain the practical application of the described methodology.

Addressed to researchers, practitioners and students who work with incomplete data, the book offers an introduction to the subject as well as a discussion of recent developments in the field. It is suitable for beginners to the topic and advanced readers alike.

Inhaltsverzeichnis

Frontmatter
Chapter 1. Topic-Focused Introduction to R and Data sets Used
Abstract
The theoretical concepts explained in the book are illustrated by examples, which make use of the statistical software environment R. In this chapter, a short introduction to some functionalities of R is given. This introduction does not replace a general introduction to R, but it provides the background that is necessary to understand the examples and the R code in the book. First, the available software tools are briefly discussed before the focused introduction to R is given by also introducing the package VIM, which is used throughout the book. Finally, some interesting data sets are introduced. Most of them are used for demonstration purposes and exercises in this book.
Matthias Templ
Chapter 2. Distribution, Pre-analysis of Missing Values and Data Quality
Abstract
One should always be clear about the reasons for missing values and their structure. This chapter discusses the issues of why and how missing values occur in data sets, from clinical trials and associated censoring, to questionnaires with different types of non-response, to missing values due to measurement failure. In addition, the different types of mechanisms for missing values are described. Furthermore, different measurement scales of variables as well as the measurement of distances between observations with values of different scales are discussed. The case of outliers is also dealt with in this chapter. Dealing with outliers is a central topic in this book, as many data sets contain (masked) outliers and these can strongly influence the non-robust imputation methods. Subsequently, a distinction is made between univariate and multivariate imputation procedures and the idea of deterministic rule-based imputation is introduced.
Matthias Templ
Chapter 3. Detection of the Missing Values Mechanism with Tests and Models
Abstract
This chapter is aimed at interested readers who want to learn why statistical tests may not be suitable for rejecting or not rejecting the null hypothesis of missing completely at random. Even though these methods are of limited use, we focus on them for once to critically discuss their role. For readers who wish to focus on practical issues and more useful methods for assessing the structure of missing values in a data set, we refer to the visualization tools in Chap. 5.
Matthias Templ
Chapter 4. Visualization of Missing Values
Abstract
The main objective of this chapter is to highlight the importance of exploring missing values using visualization methods and to present a collection of such visualization techniques for missing values in incomplete data. With visualization methods for missing values, we want to learn relationships between variables and explore the data set even in the presence of missing values. Most importantly, these methods are also useful to learn about the structure of the missing values, which influences the choice of imputation methods.
Matthias Templ
Chapter 5. General Considerations on Univariate Methods: Single and Multiple Imputation
Abstract
Missing or invalid values clearly affect the quality of data analysis, model results, and classification performance. However, the methods and principles for imputing data vary widely.
This chapter focuses on general considerations for imputing missing values. At the outset, full case analysis, where observations with missing values are deleted before analysis, and univariate imputation methods are criticized. They are discussed once here and then only used as a (worst-case) benchmark. Important in this chapter is the introduction to multiple imputation, a principle for properly estimating the variance of estimators, and the types of randomness introduced for imputation. The concept of multiple imputation is often used in simple production environments, but rarely when the production of statistics is complex, for example, in official statistics. Finally, the difference between joint modeling approaches and fully conditional modeling is explained.
Matthias Templ
Chapter 6. Deductive Imputation and Outlier Replacement
Abstract
Some attempts to define erroneous values are “rule-based” approaches—identification by expertly developed data-related processing rules, followed by deletion and imputation. Note that these rules—although efficient and important in many situations—are strictly deterministic and ignore the probabilistic component when working with samples from a population.
The second part of this chapter deals with the replacement of outliers. After identifying potential outliers, they can be replaced by more reasonable values. It is pointed out that often robust imputation methods are more suitable than outlier replacement.
Matthias Templ
Chapter 7. Imputation Without a Formal Statistical Model
Abstract
Model-based methods can produce poor results if the models are incorrectly specified. This is especially often the case if the models are selected automatically, as is the case with many model-based imputation methods. Imputation methods that do not rely on a statistical model are then often the preferred choice. Moreover, they are often used because of their simplicity and good general performance. In this chapter, hot-deck methods, k-nearest neighbor methods, and methods that rely on covariance estimates, such as principal component imputation, are presented.
Matthias Templ
Chapter 8. Model-Based Methods
Abstract
Model-based methods are often used to impute missing values. If the model assumptions are satisfied, these types of methods are often superior to model-free methods. In this chapter, linear models are discussed, while the following chapters focus on nonlinear methods.
First, we introduce linear regression based on (classical) ordinary least squares (OLS) at a very basic level. OLS has some nice mathematical properties, but this type of method is strongly influenced by outliers. Robust methods give roughly the same results in the case of a multivariate normal distribution, but give reliable results when the data contain artifacts or/and outliers.
Therefore, after an introduction to common concepts and implementations based on mice, we focus on robust imputation available in the R package VIM, as it gives roughly the same results in the case of elliptically normally distributed data, but is better in practice when obvious or masked outliers are present.
Matthias Templ
Chapter 9. Nonlinear Methods
Abstract
So far, we have exclusively considered model-free and linear models for regression, since (1) the theory is simple(er), (2) (imputation) models are easier to interpret, and (3) for small n and/or large p, linear models are often the only way to avoid overfitting the data.
What is new in this chapter is the consideration of nonlinearities between variables, that is, those that do not disappear by transforming variables or including quadratic terms, by adding new features, or by specifying interactions between predictors. Starting with tree-based methods such as the imputation with random forests or XGboost, we also introduce GAMs and show how ANNs can be used to impute missing values.
Matthias Templ
Chapter 10. Methods for Compositional Data
Abstract
You work with compounds of a whole (and, of course, including missing values), for example, measurements of parts per million of chemical elements of a sample, or chemical concentrations of elements in general, or time-use per day, or expenditures, wages or income for different components, or any data where the rows add up to constants. Then you should get familiar with compositional data analysis, log-ratio analysis, and corresponding imputation methods.
Your dataset especially includes zeros, and therefore you cannot apply log-ratio techniques, since this would result in a division by zero. The concentration of some compositions on some chemical elements was too low to measure it (rounded zero), and you are not sure if you should replace these concentrations with a positive constant. Then you should get familiar with rounded zeros and their treatment with compositional zero-replacement methods.
This chapter attempts to provide answers to these questions and discusses compositional data analysis methods for imputing missing values and rounded zeros.
Before we discuss imputation methods for compositional data, we will introduce compositional data as well as log-ratio analysis.
Matthias Templ
Chapter 11. Evaluation of the Quality ofImputation
Abstract
The aim of imputation is to complete a dataset and draw statistically valid conclusions from the imputed data. As we have seen in the previous chapters, there are a variety of imputation methods, and the choice of methods is nontrivial and data-dependent. To assess how well the data are imputed, we need evaluation criteria. First, we tend to check the imputed values with visual aids. This ranges from simple graphs to biplots and tours. For numerical quantification, precision measures, and estimator-based evaluation measures (such as bias, mse, or covarage rate) can be used to assess the quality of the imputation.
Matthias Templ
Chapter 12. Simulation of Data for SimulationStudies
Abstract
Simulation studies can be used to assess some of the properties of imputation methods. There are different types of simulations that can be performed: Imputation of missing values in real data, model-based simulation studies where we simulate data based on a model, and design-based simulation studies that consider complex situations with complex samples.
In general, simulation studies in the literature are kept too simple, for example, by simulating using a normal distribution or by using a simple regression model. The aim of this chapter is to provide a comprehensive overview of simulation studies and to focus on realistic and real-world simulation studies. This includes the complex data generation as well as the consideration of outliers that influence the imputation methods.
Matthias Templ
Backmatter
Metadaten
Titel
Visualization and Imputation of Missing Values
verfasst von
Matthias Templ
Copyright-Jahr
2023
Electronic ISBN
978-3-031-30073-8
Print ISBN
978-3-031-30072-1
DOI
https://doi.org/10.1007/978-3-031-30073-8

Premium Partner