Skip to main content
Top

R by Example

  • 2024
  • Book

About this book

Now in its second edition, R by Example is an example-based introduction to the statistical computing environment that does not assume any previous familiarity with R or other software packages. R functions are presented in the context of interesting applications with real data.

The purpose of this book is to illustrate a range of statistical and probability computations using R for people who are learning, teaching, or using statistics. Specifically, it is written for users who have covered at least the equivalent of (or are currently studying) undergraduate level calculus-based courses in statistics. These users are learning or applying exploratory and inferential methods for analyzing data, and this book is intended to be a useful resource for learning how to implement these procedures in R.

The new edition includes expanded coverage of ggplot2 graphics, as well as new chapters on importing data and multivariate data methods.

Table of Contents

  1. Frontmatter

  2. Chapter 1. Introduction

    Jim Albert, Maria Rizzo
    Abstract
    This chapter provides an introduction to R and RStudio, starting with installing the software and packages to extend the language. Basic aspects of the R software system are described including objects for containing data such as vectors, matrices and data frames, and operations on these objects including functions. Methods for organizing R code and an introduction to dynamic report writing in Quarto and R Markdown are covered. Projects are helpful in organizing one’s work including code, data and reports. The R help system is described and some basic graphical methods are introduced using R graphics and the ggplot2 package.
  3. Chapter 2. Quantitative Data

    Jim Albert, Maria Rizzo
    Abstract
    Summary and graphical methods for analyzing different types of quantitative data are covered. The types include integer data, bivariate data, bivariate data with a grouping variable, multivariate data, and time series data. An example with sample means is used to illustrate the Central Limit Theorem. Many types of plots are illustrated using both the base R graphics and ggplot graphics systems.
  4. Chapter 3. Categorical Data

    Jim Albert, Maria Rizzo
    Abstract
    Categorical data can be character or factor type, requiring different methods than one uses to analyze quantitative data. In this chapter we focus on methods for tabulating, summarizing, and graphing categorical data. Contingency tables to compare two categorical data samples and a chi-square test for association are discussed. Plots and a chi-square goodness-of-fit test help to compare a sample to a known probability distribution. Methods for graphing patterns of association such as segmented bar charts, side-by-side barplots, and mosaic plots are included. Several other types of plots for categorical data using base R graphics and ggplot graphics are illustrated in the context of these examples.
  5. Chapter 4. Exploratory Data Analysis

    Jim Albert, Maria Rizzo
    Abstract
    Exploratory data analysis (EDA) seeks to learn general patterns or tendencies in data and find specific occurrences that deviate from the general patterns. John Tukey makes a clear distinction between confirmatory data analysis, where one draws inferential conclusions, and exploratory methods, where one places few assumptions on the distributional shape of the data and looks for interesting patterns. This chapter focus on some fundamental exploratory methods described by Tukey including batch comparison, relationships of two variables, patterns in time series and exploring a batch of fractions. There are four general themes of exploratory data analysis, namely Revelation, Resistance, Residuals, and Reexpression, collectively called the four R’s. The four R’s are illustrated for the basic exploratory methods with suitable graphical methods to gain insight into the data.
  6. Chapter 5. Presentation Graphics

    Jim Albert, Maria Rizzo
    Abstract
    Graphics, including base R graphics and ggplot graphics, are used throughout this book. This chapter provides a survey of different graphical methods with a focus on designing attractive and informative plots. Various types of plots appear in the examples including time series plots, bubble plots, Cleveland dotplots, histograms with a reference density overlay, heat maps, and scatter plots with a fitted smooth curve. Multipanel displays and other methods for displaying multiple plots are described. Methods are given for exporting graphs from R graphics or ggplot graphics to one or more files in various formats. Additional plots and options to customize their appearance are discussed with hints in the Exercises.
  7. Chapter 6. Importing Data

    Jim Albert, Maria Rizzo
    Abstract
    Various methods are illustrated for importing data from local or internet text and .csv files, for situations when the data is unavailable in an R package. The base R functions read.table() and read.csv() as well as tidyverse package readr functions for importing data are illustrated by examples. Some data preprocessing tasks are described such as recoding data and identifying missing data. The chapter describes different methods for reshaping data from wide to long format using stack(), reshape(), melt() (reshape2 package) and pivot_longer() (tidyr package).
  8. Chapter 7. Basic Inference Methods

    Jim Albert, Maria Rizzo
    Abstract
    R has an excellent collection of functions to implement the basic methods of statistical inference that are typically covered in a first course in statistics. In this chapter we review implementations of basic testing methods for one and two-samples using R. This material includes one-sample and two-sample tests for proportions and means and a one-sample t-test for paired data. Large sample and small sample methods for estimation of proportions are discussed. Nonparametric methods are discussed including the Wilcoxon Signed Rank test, the two-sample Mann-Whitney-Wilcoxon test and a permutation test for location.
  9. Chapter 8. Regression

    Jim Albert, Maria Rizzo
    Abstract
    Regression refers to a general class of statistical methods that relate a single response variable to one or more input (predictor) variables. This chapter first describes the simple linear regression model where one has a single predictor. R instructions for fitting the model, computing residuals, and graphing the fit and the residuals are included. Multiple regression using R is discussed where one has two or more input variables. Residual plots and plots of the predicted response and other relevant graphics are included. The chapter concludes by discussing examples of fitting a curve to model a non-linear relation between the response and predictor.
  10. Chapter 9. Analysis of Variance I

    Jim Albert, Maria Rizzo
    Abstract
    Analysis of Variance (ANOVA) is a statistical procedure for comparing means of two or more populations. In this chapter statistical methods for one-way ANOVA models are introduced, which help analyze differences in the mean response corresponding to the levels of a single group variable or factor. In several examples, the chapter describes R functions for an ANOVA F test for testing equality of treatment means, post-hoc tests to compare treatment means using Fisher’s Least Significant Difference and Tukey’s Honest Significant Difference methods with relevant plots.
  11. Chapter 10. Analysis of Variance II

    Jim Albert, Maria Rizzo
    Abstract
    This second ANOVA chapter considers randomized block designs and two-way ANOVA models. Randomized block designs model the effects of a single group variable or factor while controlling for another source of variation using blocks. Two-way ANOVA models explain differences in the mean response corresponding to the levels of two group variables (factors) and their possible interaction. Tukey HSD confidence intervals and plots for main effects and interactions, residual plots, interaction plots and other relevant graphics are described.
  12. Chapter 11. Randomization Tests

    Jim Albert, Maria Rizzo
    Abstract
    Randomization tests or permutation tests provide a nonparametric approach based on statistics not requiring that the test statistics have specified distributions. A sampling distribution is obtained by resampling the data to generate a large number of test statistics under a null hypothesis. This chapter illustrates examples of the nonparametric approach including a randomization test for location, a test for correlation, and a test for independence. Permutation tests are also implemented using the boot() function from the boot package. Plots are used to illustrate the resampling distributions of the test statistics.
  13. Chapter 12. Multivariate Data

    Jim Albert, Maria Rizzo
    Abstract
    Descriptive statistics for multivariate data, methods for transforming the data, and some useful graphics tools such as scatterplot matrices and correlograms are described in this chapter. Examples illustrate common tasks with multivariate data such as methods of transformation, centering or scaling data to equalize variances, and computing eigenvalues and eigenvectors of the sample covariance matrix. Some of the operations are also illustrated using the tidyverse dplyr functions. Principal Components Analysis (PCA) can be implemented using the prcomp() or princomp() functions, and the chapter explains and interprets the output with relevant graphics, screeplots and biplots. The chapter concludes with a look at hierarchical cluster analysis implemented with the hclust() function.
  14. Chapter 13. Simulation Experiments

    Jim Albert, Maria Rizzo
    Abstract
    Simulation is a versatile tool to investigate the probability distribution of an outcome of a random event or experiment. In this chapter we focus on R functions that simplify the design and implementation of common types of simulation studies. Two functions that are used throughout are the sample() function for drawing random samples with or without replacement, and the replicate() function to easily repeat blocks of code including the sampling. The examples include some famous probability problems such as the collector’s problem and the hat-check problem, and the streaky patterns found among hitters in a baseball game. Simulations are summarized with suitable plots in R graphics and ggplot graphics.
  15. Chapter 14. Bayesian Modeling

    Jim Albert, Maria Rizzo
    Abstract
    Frequentist and Bayes are two general approaches in the development of statistical methods. Frequentist methods covered in this book include the familiar t confidence intervals, linear regression models, and ANOVA for testing equality of means. This chapter introduces the Bayesian approach to statistical inference by use of several illustrative examples. In the Bayes approach, one performs inference by the use of subjective probability. A prior density represents one’s initial opinion on the location of the parameter. After data is observed, by Bayes’ rule, one’s updated opinion about the parameter is expressed by the posterior distribution. One performs inference by summarizing the posterior distribution. One checks the validity of the model and predict future data by the use of the predictive distribution. The Metropolis-Hastings algorithm is introduced as a practical method to draw samples from the posterior distribution.
  16. Chapter 15. Monte Carlo Methods

    Jim Albert, Maria Rizzo
    Abstract
    The Monte Carlo method is a general algorithm for estimating a definite integral based on outcomes of a simulation experiment. One can apply this algorithm to estimate the expectation of a function of a random variable where a random sample is drawn from the associated probability distribution. There is an associated standard error of a Monte Carlo estimate which provides insight into the accuracy of this simulation-based calculation. One can use Monte Carlo to simulate the sampling distribution of a statistical estimate which is helpful in computing the bias and standard error of one estimate, or in comparing the mean absolute error of two estimators. The Monte Carlo method is useful in determining the probability of coverage of an interval procedure. A Markov chain Monte Carlo algorithm is a general method for simulating from an arbitrary probability distribution. Metropolis-Hastings and Gibbs sampling are introduced as general methods for simulating from distributions. A variety of plots in R graphics and ggplot graphics are used to illustrate the concepts.
  17. Backmatter

Title
R by Example
Authors
Jim Albert
Maria Rizzo
Copyright Year
2024
Electronic ISBN
978-3-031-76074-7
Print ISBN
978-3-031-76073-0
DOI
https://doi.org/10.1007/978-3-031-76074-7

Accessibility information for this book is coming soon. We're working to make it available as quickly as possible. Thank you for your patience.

Premium Partner

    Image Credits
    Neuer Inhalt/© ITandMEDIA, Nagarro GmbH/© Nagarro GmbH, AvePoint Deutschland GmbH/© AvePoint Deutschland GmbH, AFB Gemeinnützige GmbH/© AFB Gemeinnützige GmbH, USU GmbH/© USU GmbH, Ferrari electronic AG/© Ferrari electronic AG