Skip to main content
Top

2019 | Book

Advanced R Statistical Programming and Data Models

Analysis, Machine Learning, and Visualization

insite
SEARCH

About this book

Carry out a variety of advanced statistical analyses including generalized additive models, mixed effects models, multiple imputation, machine learning, and missing data techniques using R. Each chapter starts with conceptual background information about the techniques, includes multiple examples using R to achieve results, and concludes with a case study.
Written by Matt and Joshua F. Wiley, Advanced R Statistical Programming and Data Models shows you how to conduct data analysis using the popular R language. You’ll delve into the preconditions or hypothesis for various statistical tests and techniques and work through concrete examples using R for a variety of these next-level analytics. This is a must-have guide and reference on using and programming with the R language.
What You’ll LearnConduct advanced analyses in R including: generalized linear models, generalized additive models, mixed effects models, machine learning, and parallel processing
Carry out regression modeling using R data visualization, linear and advanced regression, additive models, survival / time to event analysis
Handle machine learning using R including parallel processing, dimension reduction, and feature selection and classification
Address missing data using multiple imputation in R
Work on factor analysis, generalized linear mixed models, and modeling intraindividual variability
Who This Book Is For
Working professionals, researchers, or students who are familiar with R and basic statistical techniques such as linear regression and who want to learn how to use R to perform more advanced analytics. Particularly, researchers and data analysts in the social sciences may benefit from these techniques. Additionally, analysts who need parallel processing to speed up analytics are given proven code to reduce time to result(s).

Table of Contents

Frontmatter
Chapter 1. Univariate Data Visualization
Abstract
Most statistical models discussed in the rest of the book make assumptions about the data and the best model to use for them. As data analysts, we often must specify the distribution that we assume the data come from. Anomalous values, also called extreme values or outliers, may also have undue influence on the results from many statistical models. In this chapter, we examine visual and graphical approaches to exploring the distributions and anomalous values for one variable at a time (i.e., univariate). The goal of this chapter is not specifically to create beautiful or publication quality graphs nor to show results, but rather to use graphs to understand the distribution of a variable and identify anomalous values. This chapter focuses on univariate data visualization, and the next chapter will employ some of the same concepts but applied to multivariate distributions and cover how to assess the relations between variables.

                library(checkpoint)
                checkpoint("2018-09-28", R.version = "3.5.1",
                  project = book_directory,
                  checkpointLocation = checkpoint_directory,
                  scanForPackages = FALSE,
                  scan.rnw.with.knitr = TRUE, use.knitr = TRUE)
               
                library(knitr)
                library(ggplot2)
                library(cowplot)
                library(MASS)
                library(JWileymisc)
                library(data.table)
               
                options(width = 70, digits = 2)
              
Matt Wiley, Joshua F. Wiley
Chapter 2. Multivariate Data Visualization
Abstract
The previous chapter covered methods for univariate data visualization. This chapter continues that theme, but moves from visualizing single variables to visualizing multiple variables at a time. In addition to examining distributions and anomalous values as in the previous chapter, we also cover how to visualize the relations between variables. Visualizing relations between variables can help particularly for more traditional statistical models where the data analyst must specify the functional form (e.g., linear, quadratic, etc.). In later chapters we will also cover machine learning models that employ algorithms to learn the functional form in data without the analyst needing to specify it.

                library(checkpoint)
                checkpoint("2018-09-28", R.version = "3.5.1",
                  project = book_directory,
                  checkpointLocation = checkpoint_directory,
                  scanForPackages = FALSE,
                  scan.rnw.with.knitr = TRUE, use.knitr = TRUE)
               
                library(knitr)
                library(ggplot2)
                library(cowplot)
                library(MASS)
                library(mvtnorm)
                library(mgcv)
                library(quantreg)
                library(JWileymisc)
                library(data.table)
               
                options(width = 70, digits = 2)
              
Matt Wiley, Joshua F. Wiley
Chapter 3. GLM 1
Abstract
Generalized linear models (GLMs) are a broad class of models that encompass regression and analysis of variance (ANOVA) are other terms or analyses that are often used to refer to GLMs. This chapter uses a number of packages shown as follows. We run our setup code to load these and to make data tables print in a neat fashion.

                library(checkpoint)
                checkpoint("2018-09-28", R.version = "3.5.1",
                  project = book_directory,
                  checkpointLocation = checkpoint_directory,
                  scanForPackages = FALSE,
                  scan.rnw.with.knitr = TRUE, use.knitr = TRUE)
               
                library(knitr)
                library(data.table)
                library(ggplot2)
                library(visreg)
                library(ez)
                library(emmeans)
                library(rms)
                library(ipw)
                library(JWileymisc)
                library(RcppEigen)
                library(texreg)
               
                options(
                  width = 70,
                  stringsAsFactors = FALSE,
                  datatable.print.nrows = 20,
                  datatable.print.topn = 3,
                  digits = 2)
              
Matt Wiley, Joshua F. Wiley
Chapter 4. GLM 2
Abstract
Generalized linear models (GLMs) also can accommodate outcomes that are not continuous and normally distributed. Indeed, one of the great advantages of GLMs is they provide a unified framework to understand regression models applied to variables assumed to come from a variety of distributions. For this chapter, we will lean heavily on one excellent R package, VGAM, which provides utilities for vector generalized linear models (VGLMs) and vector generalized additive models (VGAMs) [125]. VGLMs and VGAMs are an even more flexible class of models where there may be multiple responses. However, beyond offering flexibility of multiple parameters, the VGAM package implements over 20 link functions, well over 50 different models/assumed distributions. We will only scratch the surface of the VGAM package capabilities in this chapter, but its great flexibility means that we will not need to introduce many different packages nor many different functions. If you would like to learn about VGLMs and VGAMs in far greater depth, we recommend an excellent book by the author of the VGAM package [125].

                library(checkpoint)
                checkpoint("2018-09-28", R.version = "3.5.1",
                  project = book_directory,
                  checkpointLocation = checkpoint_directory,
                  scanForPackages = FALSE,
                  scan.rnw.with.knitr = TRUE, use.knitr = TRUE)
               
                library(knitr)
                library(data.table)
                library(ggplot2)
                library(ggthemes)
                library(scales)
                library(viridis)
                library(VGAM)
                library(ipw)
                library(JWileymisc)
                library(xtable)
                library(texreg)
               
                options(
                  width = 70,
                  stringsAsFactors = FALSE,
                  datatable.print.nrows = 20,
                  datatable.print.topn = 3,
                  digits = 2)
              
Matt Wiley, Joshua F. Wiley
Chapter 5. GAMs
Abstract
Generalized additive models (GAMs) are extensions of the generalized linear models (GLMs) we discussed in earlier chapters. Like GLMs, GAMs accommodate outcomes that are both continuous and discrete. However, unlike GLMs that are fully parametric models, GAMs are semi-parametric models. GAMs allow a mix of a parametric and a nonparametric association between outcome and predictors. For this chapter, we will lean heavily on one excellent R package, VGAM, which provides utilities for vector generalized linear models (VGLMs) and vector generalized additive models (VGAMs) [125]. VGAMs are an even more flexible class of models than GAMs where there may be multiple responses. However, beyond offering flexibility of multiple parameters, the VGAM package implements over 20 link functions, well over 50 different models/assumed distributions. We will only scratch the surface of the VGAM package capabilities in this chapter, but its great flexibility means that we will not need to introduce many different packages nor many different functions. If you would like to learn about VGAMs in far greater depth, we recommend an excellent book by the author of the VGAM package [125].

                library(checkpoint)
                checkpoint("2018-09-28", R.version = "3.5.1",
                  project = book_directory,
                  checkpointLocation = checkpoint_directory,
                  scanForPackages = FALSE,
                  scan.rnw.with.knitr = TRUE, use.knitr = TRUE)
               
                library(knitr)
                library(data.table)
                library(ggplot2)
                library(ggthemes)
                library(scales)
                library(viridis)
                library(car)
                library(mgcv)
                library(VGAM)
                library(ipw)
                library(JWileymisc)
                library(xtable)
               
                options(
                  width = 70,
                  stringsAsFactors = FALSE,
                  datatable.print.nrows = 20,
                  datatable.print.topn = 3,
                  digits = 2)
              
Matt Wiley, Joshua F. Wiley
Chapter 6. ML: Introduction
Abstract
Machine learning (ML) is a rather amorphous, in the authors’ opinions at least, toolkit of computer aided statistics. While our eventual target will be support vector machines, classification and regression trees, and artificial neural networks using some recent R packages, at their heart machine learning is simply pattern recognition of various flavors.
Matt Wiley, Joshua F. Wiley
Chapter 7. ML: Unsupervised
Abstract
This chapter focuses on unsupervised machine learning, which typically deals with unlabelled data. The objective is to somehow sort these data into similar groups based on common feature(s). Often, although not always, unsupervised machine learning also is used as a type of dimension reduction. For example, if you get a dataset with hundreds or thousands of features, but only a few thousand cases, you may wish to first utilize unsupervised learning to distil the large number of features into a smaller number of dimensions that still capture most of the information from the larger set. Unsupervised machine learning also makes a good final step of the exploratory data analysis phase. Part of the sorting or clustering in unsupervised machine learning can be leveraged to understand how many “unique” groups or dimensions your data have. Imagine a dataset that is comprised of various indicators from several distinct geographic regions. One might expect an unsupervised grouping technique to indicate something about the geographic regions. Or, one might discover that physically distant locations have several highly common features.
Matt Wiley, Joshua F. Wiley
Chapter 8. ML: Supervised
Abstract
Machine learning (ML) and classification.
Matt Wiley, Joshua F. Wiley
Chapter 9. Missing Data
Abstract
Missing data is common in nearly all real-world analysis. This chapter introduces the concept of missing data formally including common ways of describing missingness. Then we discuss some of the potential ways missing data can be addressed in analysis. The main package we will use in this chapter is the mice package, one package that offers robust features for handling missing data and minimizing the impact of missing data on analysis results [95].

                library(checkpoint)
                checkpoint("2018-09-28", R.version = "3.5.1",
                  project = book_directory,
                  checkpointLocation = checkpoint_directory,
                  scanForPackages = FALSE,
                  scan.rnw.with.knitr = TRUE, use.knitr = TRUE)
               
                library(knitr)
                library(ggplot2)
                library(cowplot)
                library(lattice)
                library(viridis)
                library(VIM)
               
                library(mice)
                library(micemd)
                library(parallel)
               
                library(data.table)
                library(xtable)
                library(JWileymisc) # has data
               
                options(width = 70, digits = 2)
              
Matt Wiley, Joshua F. Wiley
Chapter 10. GLMMs: Introduction
Abstract
Generalized linear mixed models (GLMMs) extend the generalized linear models (GLMs), introduced in previous chapters, to statistically account for data that are clustered (e.g., children within schools, individuals within a particular hospital clinic, repeated measures on the same person) and render these non-independent observations conditionally independent.
Matt Wiley, Joshua F. Wiley
Chapter 11. GLMMs: Linear
Abstract
This chapter builds on the foundation of working with multilevel data and introduces a class of statistical models—generalized linear mixed models (GLMMs)—that are appropriate for such data.

                library(checkpoint)
                checkpoint("2018-09-28", R.version = "3.5.1",
                  project = book_directory,
                  checkpointLocation = checkpoint_directory,
                  scanForPackages = FALSE,
                  scan.rnw.with.knitr = TRUE, use.knitr = TRUE)
               
                library(knitr)
                library(ggplot2)
                library(cowplot)
                library(viridis)
                library(JWileymisc)
                library(data.table)
                library(lme4)
                library(lmerTest)
                library(chron)
                library(zoo)
                library(pander)
                library(texreg)
                library(xtable)
                library(splines)
                library(parallel)
                library(boot)
               
                options(width = 70, digits = 2)
              
Matt Wiley, Joshua F. Wiley
Chapter 12. GLMMs: Advanced
Abstract
This chapter on generalized linear mixed models (GLMMs) builds on the foundation of working with multilevel data from the GLMMs Introduction chapter and the GLMMs Linear chapter that focused strictly on continuous, normally distributed outcomes. This chapter focuses on GLMMs for other types of outcomes, specifically for binary outcomes and count outcomes.
Matt Wiley, Joshua F. Wiley
Chapter 13. Modelling IIV
Abstract
Up until this point, we have focused exclusively on statistical models of the location (or mean) of a distribution. This chapter focuses on something new, the scale or variability of a distribution. Specifically, this chapter introduces the concept of intra-individual variability (IIV), the variability within individual units across repeated assessments. Although a relatively niche area of study, IIV provides additional information about an individual unit and allows new types of research or practical questions to be evaluated, such as do people (schools, factories, etc.) with greater variability have different outcomes? This chapter makes use of the package varian which was developed by one of the authors specifically for variability analysis.

                library(checkpoint)
                checkpoint("2018-09-28", R.version = "3.5.1",
                  project = book_directory,
                  checkpointLocation = checkpoint_directory,
                  scanForPackages = FALSE,
                  scan.rnw.with.knitr = TRUE, use.knitr = TRUE)
               
                library(knitr)
                library(ggplot2)
                library(cowplot)
                library(viridis)
                library(data.table)
                library(JWileymisc)
                library(varian)
                library(mice)
                library(parallel)
               
                options(width = 70, digits = 2)
              
Matt Wiley, Joshua F. Wiley
Backmatter
Metadata
Title
Advanced R Statistical Programming and Data Models
Authors
Matt Wiley
Joshua F. Wiley
Copyright Year
2019
Publisher
Apress
Electronic ISBN
978-1-4842-2872-2
Print ISBN
978-1-4842-2871-5
DOI
https://doi.org/10.1007/978-1-4842-2872-2

Premium Partner