2022 | Book

# Statistics for Data Scientists

## An Introduction to Probability, Statistics, and Data Analysis

Authors: Prof. Dr. Maurits Kaptein, Prof. Dr. Edwin van den Heuvel

Publisher:

Book Series : Undergraduate Topics in Computer Science

Part of:

insite
SEARCH

This book provides an undergraduate introduction to analysing data for data science, computer science, and quantitative social science students. It uniquely combines a hands-on approach to data analysis – supported by numerous real data examples and reusable [R] code – with a rigorous treatment of probability and statistical principles.

Where contemporary undergraduate textbooks in probability theory or statistics often miss applications and an introductory treatment of modern methods (bootstrapping, Bayes, etc.), and where applied data analysis books often miss a rigorous theoretical treatment, this book provides an accessible but thorough introduction into data analysis, using statistical methods combining the two viewpoints. The book further focuses on methods for dealing with large data-sets and streaming-data and hence provides a single-course introduction of statistical methods for data science.

##### Chapter 1. A First Look at Data
Abstract
For data scientists, the most important use of statistics will be in making sense of data. Therefore, in this first chapter we immediately start by examining, describing, and visualizing data. We will use a dataset called face-data.csv throughout this chapter; this dataset, as well as all the other datasets we use throughout this book, is described in more detail in the preface. The dataset can be downloaded at http://​www.​nth-iteration.​com/​statistics-for-data-scientist. In this first chapter we will discuss techniques that help visualize and describe available data. We will use and introduce R, a free and publicly available statistical software package that we will use to handle our calculations and graphics. You can download R at https://​www.​r-project.​org.
Maurits Kaptein, Edwin van den Heuvel
##### Chapter 2. Sampling Plans and Estimates
Abstract
In the previous chapter we computed descriptive statistics for the dataset on faces. The results showed that the average rating was 58.37 and that men rated the faces higher than women on average. If we are only interested in the participants in the study and we are willing to believe that the results are fully deterministic, we could claim that the group of men rates higher than the group of women on average. However, if we believe that the ratings are not constant for one person for the same set of faces or if we would like to know whether our statements would also hold for a larger group of people (who did not participate in our experiment), we must understand what other results could have been observed in our study if we had conducted the experiment at another time with the same group of participants or with another group of participants. To be able to extend your conclusions beyond the observed data, which is called more technically statistical inference, you should wonder where the dataset came from, how participants were collected, and how the results were obtained. For example, if the women who participated in the study of rating faces all came from one small village in the Netherlands, while the men came from many different villages and cities in the Netherlands, you would probably agree that the comparison between the average ratings from men and women becomes less meaningful. In this situation the dataset is considered selective towards women in the small village. Selective means here that not all women from the villages and cities included in the study are represented by the women in the study, but only a specific subgroup of women have been included. To overcome these types of issues, we need to know about the concepts of population, sample, sampling procedures, and estimation of population characteristics, and also how these concepts are related to each other to be able to do proper statistical inference.
Maurits Kaptein, Edwin van den Heuvel
##### Chapter 3. Probability Theory
Abstract
Statistics is a science that is concerned with principles, methods, and techniques for collecting, processing, analyzing, presenting, and interpreting (numerical) data. Statistics can be divided roughly into descriptive statistics (Chap. 1) and inferential statistics (Chap. 2), as we have already suggested. Descriptive statistics summarizes and visualizes the observed data. It is usually not very difficult, but it forms an essential part of reporting (scientific) results. Inferential statistics tries to draw conclusions from the data that would hold true for part or the whole of the population from which the data is collected. The theory of probability, which is the topic of the next two theoretical chapters, makes it possible to connect the two disciplines of descriptive and inferential statistics. We have already encountered some ideas from probability theory in the previous chapter. To start with, we discussed the probability of selecting a specific sample $$\pi _k$$ and we briefly defined the notion of probability based on the throwing of a dice. In this chapter we work out these ideas more formally and discuss the probabilities of events; we define probabilities and discuss how to calculate with probabilities. In the previous chapter, when discussing bias, we have also encountered the expected population parameter $$\mathbb {E}(T)$$, but we have not yet detailed what expectations are exactly; this is something we cover in Chap. 4.
Maurits Kaptein, Edwin van den Heuvel
##### Chapter 4. Random Variables and Distributions
Abstract
In the first chapter we discussed the calculation of some statistics that could be useful to summarize the observed data. In Chap. 2 we explained sampling approaches for the proper collection of data from populations. We demonstrated, using the appropriate statistics, how we may extend our conclusions beyond our sample to our population. Probability sampling required reasoning with probabilities, and we provided a more detailed description of this topic in Chap. 3. The topic of probability seems distant from the type of data that we looked at in the first chapter, but we did show how probability is related to measures of effect size for binary data. We will continue discussing real-world data in this chapter, but to do so we will need to make one more theoretical step. We will need to go from distinct events to dealing with more abstract random variables. This allows us to extend our theory on probability to other types of data without restricting it to specific events (i.e., binary data). Thus, this chapter will introduce random variables so that we can talk about continuous and discrete data. Random variables are directly related to the data that we collect from the population; a relationship we explore in depth. Subsequently we will discuss the distributions of random variables. Distributions relate probabilities to outcomes of random variables. We will discuss separately distributions for discrete random variables and for continuous random variables. In each case we will introduce several well-known distributions. In both cases we will also discuss properties of the random variables: we will explain their expected value, variance, and moments. These properties provide summaries of the population. They are closely related to the mean, variance, skewness, and kurtosis we discussed in Chaps. 1 and 2. However, we will only finish our circle—from data to theory to data—in the next chapter.
Maurits Kaptein, Edwin van den Heuvel
##### Chapter 5. Estimation
Abstract
The field of inferential statistics tries to use the information from a sample to make statements or decisions about the population of interest. It takes into account the uncertainty that the information is coming from sampling and does not perfectly represent the population, since another sample would give different outcomes. An important aspect of inferential statistics is estimation of the population parameters of interest. We have discussed the step from descriptions of a sample to those of a population already in Chap. 2; however, now that we have the theory of random variables at our disposal we can do much more than we did before. This is what we explore in this chapter. This chapter can be split up into two parts: in Sects. 5.25.4 we consider the distribution functions of sample statistics or estimators given assumptions regarding the distribution of the variables of interest in the population. Sample statistics themselves are random variables, and hence we can study their distribution functions, expectations, and higher moments. We first study the distributions of sample statistics in general, assuming that the variable of interest has some distribution in the population but without further specifying the shape of this distribution function. Next, we study the distributions of sample statistics when we assume the variable of interest to be either normally or log normally distributed in the population. We devote more attention to so-called normal populations because of their prominence in statistical theory. The second part of this chapter is Sect. 5.5, where we change our focus to estimation: in the subsections we discuss two different methods to obtain estimates $$\hat{\boldsymbol{\theta }}$$ of the parameters of a population distribution $$F_{\boldsymbol{\theta }}(x)$$ given sample data. The methods we discuss are the method of moments and the maximum likelihood method. In these sections, to provide a concrete example, we study the log normal distribution function, as this is one of the distribution functions for which the estimates originating from the two estimation methods differ.
Maurits Kaptein, Edwin van den Heuvel
##### Chapter 6. Multiple Random Variables
Abstract
Up to now we have mainly focussed on the analysis of a single variable. We have discussed probability density functions (PDFs), probability mass functions (PMFs), and distribution functions (CDFs) as descriptions of the population values for such a single variable and connected these functions to a single random variable. These probability functions were functions with a single argument $$x\in \mathbb {R}$$. For instance, the CDF $$F_{\theta }(x)$$ was defined for all $$x\in \mathbb {R}$$. In this chapter we will extend the concept of probability functions to multiple arguments, say (xy), that would represent multiple random variables.
Maurits Kaptein, Edwin van den Heuvel
##### Chapter 7. Making Decisions in Uncertainty
Abstract
Up till now we have covered ways of summarizing data, and we have paid a lot of attention to understanding how summaries computed on sample data (sample statistics) vary as a function of the sampling plan and the population characteristics. In Chap. 5 we also covered the idea that we can use our sample to estimate population parameters; in this case we are basically making a decision—our best guess—regarding the population parameter given the sample data. The distribution function of the estimators—which are themselves sample statistics—that we studied in Chap. 5 gives us some feel for the precision of our inferences. However, what if your estimate of the population parameter is 10, and someone asks whether you are sure that it is not 10.2; what would your answer then be? In this Chapter we examine multiple approaches to answering this seemingly simple question.
Maurits Kaptein, Edwin van den Heuvel
##### Chapter 8. Bayesian Statistics
Abstract
In this book we have introduced both the practice of analyzing data using R, and covered probability theory, including estimation and testing. For most of the text we have, however, considered what some would call Frequentist statistics (the name deriving from the notion of probability as a long-run frequency): in this school of thought regarding probability it is generally assumed that population values are fixed quantities (e.g., $$\theta$$ is, despite being unknown, theoretically knowable and has a fixed value). Any uncertainty (and hence our resort to probability theory) arises from our sampling procedure: because we use (ostensibly) random sampling we have access to only one of the many possible samples that we could have obtained from the population of interest, and when estimating $$\theta$$ (by, e.g., computing $$\hat{\theta }$$) we will need to consider the fact that another sample might have produced a different value. There is, however, another school of thought, called Bayesian statistics. Its name is derived from Bayes Theorem, as this theorem is used almost constantly in this latter stream of thought. In this Chapter we introduce this second school of thought and discuss its relationship(s) to the materials we covered in previous chapters.
Maurits Kaptein, Edwin van den Heuvel
##### 9. Correction to: Statistics for Data Scientists
In the original version of the book, the following belated corrections have been made
Maurits Kaptein, Edwin van den Heuvel
Title
Statistics for Data Scientists
Authors
Prof. Dr. Maurits Kaptein
Prof. Dr. Edwin van den Heuvel