Zum Inhalt

R by Example

  • 2024
  • Buch

Über dieses Buch

In seiner zweiten Ausgabe ist R by Example eine beispielhafte Einführung in die statistische Rechnerumgebung, die keine vorherige Vertrautheit mit R oder anderen Softwarepaketen voraussetzt. R-Funktionen werden im Kontext interessanter Anwendungen mit realen Daten dargestellt. Der Zweck dieses Buches ist es, eine Reihe von statistischen Berechnungen und Wahrscheinlichkeitsberechnungen zu veranschaulichen, die R für Menschen verwenden, die Statistiken lernen, lehren oder verwenden. Insbesondere richtet sie sich an Nutzer, die mindestens das Äquivalent von auf Grundrechenarten basierenden Studiengängen in der Statistik abgedeckt haben (oder derzeit studieren). Diese Benutzer erlernen oder wenden explorative und folgernde Methoden zur Analyse von Daten an, und dieses Buch soll eine nützliche Ressource sein, um zu lernen, wie diese Verfahren in R. umgesetzt werden können. Die neue Ausgabe umfasst eine erweiterte Abdeckung von ggplot2-Grafiken sowie neue Kapitel zum Import von Daten und multivariaten Datenmethoden.

Inhaltsverzeichnis

  1. Frontmatter

  2. Chapter 1. Introduction

    Jim Albert, Maria Rizzo
    Abstract
    This chapter provides an introduction to R and RStudio, starting with installing the software and packages to extend the language. Basic aspects of the R software system are described including objects for containing data such as vectors, matrices and data frames, and operations on these objects including functions. Methods for organizing R code and an introduction to dynamic report writing in Quarto and R Markdown are covered. Projects are helpful in organizing one’s work including code, data and reports. The R help system is described and some basic graphical methods are introduced using R graphics and the ggplot2 package.
  3. Chapter 2. Quantitative Data

    Jim Albert, Maria Rizzo
    Abstract
    Summary and graphical methods for analyzing different types of quantitative data are covered. The types include integer data, bivariate data, bivariate data with a grouping variable, multivariate data, and time series data. An example with sample means is used to illustrate the Central Limit Theorem. Many types of plots are illustrated using both the base R graphics and ggplot graphics systems.
  4. Chapter 3. Categorical Data

    Jim Albert, Maria Rizzo
    Abstract
    Categorical data can be character or factor type, requiring different methods than one uses to analyze quantitative data. In this chapter we focus on methods for tabulating, summarizing, and graphing categorical data. Contingency tables to compare two categorical data samples and a chi-square test for association are discussed. Plots and a chi-square goodness-of-fit test help to compare a sample to a known probability distribution. Methods for graphing patterns of association such as segmented bar charts, side-by-side barplots, and mosaic plots are included. Several other types of plots for categorical data using base R graphics and ggplot graphics are illustrated in the context of these examples.
  5. Chapter 4. Exploratory Data Analysis

    Jim Albert, Maria Rizzo
    Abstract
    Exploratory data analysis (EDA) seeks to learn general patterns or tendencies in data and find specific occurrences that deviate from the general patterns. John Tukey makes a clear distinction between confirmatory data analysis, where one draws inferential conclusions, and exploratory methods, where one places few assumptions on the distributional shape of the data and looks for interesting patterns. This chapter focus on some fundamental exploratory methods described by Tukey including batch comparison, relationships of two variables, patterns in time series and exploring a batch of fractions. There are four general themes of exploratory data analysis, namely Revelation, Resistance, Residuals, and Reexpression, collectively called the four R’s. The four R’s are illustrated for the basic exploratory methods with suitable graphical methods to gain insight into the data.
  6. Chapter 5. Presentation Graphics

    Jim Albert, Maria Rizzo
    Abstract
    Graphics, including base R graphics and ggplot graphics, are used throughout this book. This chapter provides a survey of different graphical methods with a focus on designing attractive and informative plots. Various types of plots appear in the examples including time series plots, bubble plots, Cleveland dotplots, histograms with a reference density overlay, heat maps, and scatter plots with a fitted smooth curve. Multipanel displays and other methods for displaying multiple plots are described. Methods are given for exporting graphs from R graphics or ggplot graphics to one or more files in various formats. Additional plots and options to customize their appearance are discussed with hints in the Exercises.
  7. Chapter 6. Importing Data

    Jim Albert, Maria Rizzo
    Abstract
    Various methods are illustrated for importing data from local or internet text and .csv files, for situations when the data is unavailable in an R package. The base R functions read.table() and read.csv() as well as tidyverse package readr functions for importing data are illustrated by examples. Some data preprocessing tasks are described such as recoding data and identifying missing data. The chapter describes different methods for reshaping data from wide to long format using stack(), reshape(), melt() (reshape2 package) and pivot_longer() (tidyr package).
  8. Chapter 7. Basic Inference Methods

    Jim Albert, Maria Rizzo
    Abstract
    R has an excellent collection of functions to implement the basic methods of statistical inference that are typically covered in a first course in statistics. In this chapter we review implementations of basic testing methods for one and two-samples using R. This material includes one-sample and two-sample tests for proportions and means and a one-sample t-test for paired data. Large sample and small sample methods for estimation of proportions are discussed. Nonparametric methods are discussed including the Wilcoxon Signed Rank test, the two-sample Mann-Whitney-Wilcoxon test and a permutation test for location.
  9. Chapter 8. Regression

    Jim Albert, Maria Rizzo
    Abstract
    Regression refers to a general class of statistical methods that relate a single response variable to one or more input (predictor) variables. This chapter first describes the simple linear regression model where one has a single predictor. R instructions for fitting the model, computing residuals, and graphing the fit and the residuals are included. Multiple regression using R is discussed where one has two or more input variables. Residual plots and plots of the predicted response and other relevant graphics are included. The chapter concludes by discussing examples of fitting a curve to model a non-linear relation between the response and predictor.
  10. Chapter 9. Analysis of Variance I

    Jim Albert, Maria Rizzo
    Abstract
    Analysis of Variance (ANOVA) is a statistical procedure for comparing means of two or more populations. In this chapter statistical methods for one-way ANOVA models are introduced, which help analyze differences in the mean response corresponding to the levels of a single group variable or factor. In several examples, the chapter describes R functions for an ANOVA F test for testing equality of treatment means, post-hoc tests to compare treatment means using Fisher’s Least Significant Difference and Tukey’s Honest Significant Difference methods with relevant plots.
  11. Chapter 10. Analysis of Variance II

    Jim Albert, Maria Rizzo
    Abstract
    This second ANOVA chapter considers randomized block designs and two-way ANOVA models. Randomized block designs model the effects of a single group variable or factor while controlling for another source of variation using blocks. Two-way ANOVA models explain differences in the mean response corresponding to the levels of two group variables (factors) and their possible interaction. Tukey HSD confidence intervals and plots for main effects and interactions, residual plots, interaction plots and other relevant graphics are described.
  12. Chapter 11. Randomization Tests

    Jim Albert, Maria Rizzo
    Abstract
    Randomization tests or permutation tests provide a nonparametric approach based on statistics not requiring that the test statistics have specified distributions. A sampling distribution is obtained by resampling the data to generate a large number of test statistics under a null hypothesis. This chapter illustrates examples of the nonparametric approach including a randomization test for location, a test for correlation, and a test for independence. Permutation tests are also implemented using the boot() function from the boot package. Plots are used to illustrate the resampling distributions of the test statistics.
  13. Chapter 12. Multivariate Data

    Jim Albert, Maria Rizzo
    Abstract
    Descriptive statistics for multivariate data, methods for transforming the data, and some useful graphics tools such as scatterplot matrices and correlograms are described in this chapter. Examples illustrate common tasks with multivariate data such as methods of transformation, centering or scaling data to equalize variances, and computing eigenvalues and eigenvectors of the sample covariance matrix. Some of the operations are also illustrated using the tidyverse dplyr functions. Principal Components Analysis (PCA) can be implemented using the prcomp() or princomp() functions, and the chapter explains and interprets the output with relevant graphics, screeplots and biplots. The chapter concludes with a look at hierarchical cluster analysis implemented with the hclust() function.
  14. Chapter 13. Simulation Experiments

    Jim Albert, Maria Rizzo
    Abstract
    Simulation is a versatile tool to investigate the probability distribution of an outcome of a random event or experiment. In this chapter we focus on R functions that simplify the design and implementation of common types of simulation studies. Two functions that are used throughout are the sample() function for drawing random samples with or without replacement, and the replicate() function to easily repeat blocks of code including the sampling. The examples include some famous probability problems such as the collector’s problem and the hat-check problem, and the streaky patterns found among hitters in a baseball game. Simulations are summarized with suitable plots in R graphics and ggplot graphics.
  15. Chapter 14. Bayesian Modeling

    Jim Albert, Maria Rizzo
    Abstract
    Frequentist and Bayes are two general approaches in the development of statistical methods. Frequentist methods covered in this book include the familiar t confidence intervals, linear regression models, and ANOVA for testing equality of means. This chapter introduces the Bayesian approach to statistical inference by use of several illustrative examples. In the Bayes approach, one performs inference by the use of subjective probability. A prior density represents one’s initial opinion on the location of the parameter. After data is observed, by Bayes’ rule, one’s updated opinion about the parameter is expressed by the posterior distribution. One performs inference by summarizing the posterior distribution. One checks the validity of the model and predict future data by the use of the predictive distribution. The Metropolis-Hastings algorithm is introduced as a practical method to draw samples from the posterior distribution.
  16. Chapter 15. Monte Carlo Methods

    Jim Albert, Maria Rizzo
    Abstract
    The Monte Carlo method is a general algorithm for estimating a definite integral based on outcomes of a simulation experiment. One can apply this algorithm to estimate the expectation of a function of a random variable where a random sample is drawn from the associated probability distribution. There is an associated standard error of a Monte Carlo estimate which provides insight into the accuracy of this simulation-based calculation. One can use Monte Carlo to simulate the sampling distribution of a statistical estimate which is helpful in computing the bias and standard error of one estimate, or in comparing the mean absolute error of two estimators. The Monte Carlo method is useful in determining the probability of coverage of an interval procedure. A Markov chain Monte Carlo algorithm is a general method for simulating from an arbitrary probability distribution. Metropolis-Hastings and Gibbs sampling are introduced as general methods for simulating from distributions. A variety of plots in R graphics and ggplot graphics are used to illustrate the concepts.
  17. Backmatter

Titel
R by Example
Verfasst von
Jim Albert
Maria Rizzo
Copyright-Jahr
2024
Electronic ISBN
978-3-031-76074-7
Print ISBN
978-3-031-76073-0
DOI
https://doi.org/10.1007/978-3-031-76074-7

Informationen zur Barrierefreiheit für dieses Buch folgen in Kürze. Wir arbeiten daran, sie so schnell wie möglich verfügbar zu machen. Vielen Dank für Ihre Geduld.

    Bildnachweise
    AvePoint Deutschland GmbH/© AvePoint Deutschland GmbH, ams.solutions GmbH/© ams.solutions GmbH, Wildix/© Wildix, arvato Systems GmbH/© arvato Systems GmbH, Ninox Software GmbH/© Ninox Software GmbH, Nagarro GmbH/© Nagarro GmbH, GWS mbH/© GWS mbH, CELONIS Labs GmbH, USU GmbH/© USU GmbH, G Data CyberDefense/© G Data CyberDefense, Vendosoft/© Vendosoft, Kumavision/© Kumavision, Noriis Network AG/© Noriis Network AG, tts GmbH/© tts GmbH, Asseco Solutions AG/© Asseco Solutions AG, AFB Gemeinnützige GmbH/© AFB Gemeinnützige GmbH, Ferrari electronic AG/© Ferrari electronic AG, Doxee AT GmbH/© Doxee AT GmbH , Haufe Group SE/© Haufe Group SE, NTT Data/© NTT Data, Bild 1 Verspätete Verkaufsaufträge (Sage-Advertorial 3/2026)/© Sage, IT-Director und IT-Mittelstand: Ihre Webinar-Matineen in 2025 und 2026/© amgun | Getty Images