2022 | Book

# Introduction to Statistics and Data Analysis

## With Exercises, Solutions and Applications in R

Authors: Christian Heumann, Michael Schomaker, Shalabh

Publisher: Springer International Publishing

2022 | Book

Authors: Christian Heumann, Michael Schomaker, Shalabh

Publisher: Springer International Publishing

Now in its second edition, this introductory statistics textbook conveys the essential concepts and tools needed to develop and nurture statistical thinking. It presents descriptive, inductive and explorative statistical methods and guides the reader through the process of quantitative data analysis. This revised and extended edition features new chapters on logistic regression, simple random sampling, including bootstrapping, and causal inference.

The text is primarily intended for undergraduate students in disciplines such as business administration, the social sciences, medicine, politics, and macroeconomics. It features a wealth of examples, exercises and solutions with computer code in the statistical programming language R, as well as supplementary material that will enable the reader to quickly adapt the methods to their own applications.

Advertisement

Abstract

Statistics is a collection of methods which help us to describe, summarize, interpret, and analyse data. Drawing conclusions from data is vital in research, administration, and business. Researchers are interested in understanding whether a medical intervention helps in reducing the burden of a disease, how personality relates to decision making, whether a new fertilizer increases the yield of crops, how a political system affects trade policy, who is going to vote for a political party in the next election, what are the long-term changes in the population of a fish species, and many more questions.

Abstract

In Chapter 1, we highlighted that different variables contain different levels of information. When summarizing or visualizing one or more variable(s), it is this information which determines the appropriate statistical methods to use.

Abstract

A data set may contain many variables and observations. However, we are not always interested in each of the measured values but rather in a few summary measures of the data. Statistical functions fulfill the purpose of summarizing the data in a meaningful yet concise way.

Abstract

In Chaps. 2 and 3 we discussed how to analyse a single variable using graphs and summary statistics. However, in many situations we may be interested in the interdependence of two or more variables. For example, suppose we want to know whether male and female students in a college have any preference between the subjects mathematics and biology, i.e. if there is any evidence that male students prefer mathematics over biology and female students prefer biology over mathematics or vice versa. Suppose we choose an equal number of male and female students and ask them about their preferred subject. We expect that if there is no association between the two variables “gender of student” (male or female) and “subject” (mathematics or biology), then an equal proportion of male and female students should choose the subjects biology and mathematics respectively. Any difference in the proportions may indicate a preference of males or females for a particular subject. Similarly, in another example, we may want to find out whether female employees of an organization are paid less than male employees or vice versa.

Abstract

Combinatorics is a special branch of mathematics. It has many applications not only in several interesting fields such as enumerative combinatorics (the classical application) but also in other fields, for example in graph theory and optimization.

Abstract

Let us first consider some simple examples to understand the need for probability theory. Often one needs to make a decision whether to carry an umbrella or not when leaving the house; a company might wonder whether to introduce a new advertisement to possibly increase sales or to continue with their current advertisement; or someone may want to choose a restaurant based on where he can get his favourite dish. In all these situations, randomness is involved. For example, the decision of whether to carry an umbrella or not is based on the possibility or chance of rain. The sales of the company may increase, decrease or remain unchanged with a new advertisement. The investment in a new advertising campaign may therefore only be useful if the probability of its success is higher than that of the current advertisement. Similarly, one may choose the restaurant where one is most confident of getting the food of one’s choice. In all such cases, an event may be happening or not and depending on its likelihood, actions are taken. The purpose of this chapter is to learn how to calculate such likelihoods of events happening and not happening.

Abstract

In the first part of the book we highlighted how to describe data. Now we discuss the concepts required to draw statistical conclusions from a sample of data about a population of interest. For example, suppose we know the starting salary of a sample of 100 students graduating in law. We can use this knowledge to draw conclusions about the expected salary for the population of all students graduating in law. Similarly, if a newly developed drug is given to a sample of selected tuberculosis patients, then some patients may show improvement and some patients may not; but we are interested in the consequences for the entire population of patients.

Abstract

We introduced the concept of probability density and probability mass functions of random variables in the previous chapter. In this chapter, we are introducing some common standard discrete and continuous probability distributions which are widely used for either practical applications or for constructing statistical methods described later in this book. Suppose we are interested in determining the probability of a certain event. The determination of probabilities depends upon the nature of the study and various prevailing conditions which affect it. For example, the determination of the probability of a head when tossing a coin is different from the determination of the probability of rain in the afternoon. One can speculate that some mathematical functions can be defined which depict the behaviour of probabilities under different situations. Such functions have special properties and describe how probabilities are distributed under different conditions.

Abstract

The first four chapters of this book illustrated how one can summarize a data set both numerically and graphically. The validity of interpretations made from such a descriptive analysis are valid only for the data set under consideration and cannot necessarily be generalized to other data. However, it is desirable to make conclusions about the entire population of interest and not only about the sample data. In this chapter, we describe the framework of statistical inference which allows us to infer from the sample data about the population of interest—at a given, pre-specified uncertainty level—and knowledge about the random process generating the data.

Abstract

We introduced point and interval estimation of parameters in the previous chapter. Sometimes, the research question is less ambitious in the sense that we are not interested in precise estimates of a parameter but we only want to examine whether a statement about a parameter of interest or the research hypothesis, is true or not (although we will see later in this chapter that there is a connection between confidence intervals and statistical tests, called duality).

Abstract

We learnt about various measures of association in Chap. 4. Such measures are used to understand the degree of relationship or association between two variables.

Abstract

In Chap. 11, we introduced
the linear regression model to describe the association between a metric response variable y and a metric covariate X, assuming a linear relationship between X and y at first. Note that a random variable is typically denoted in uppercase, say Y, and its realization is denoted in lowercase, say y.

Abstract

The statistical inferences drawn about a population of interest are dependent on the actual sample of data. An underlying assumption in many statistical analyses is that this sample is a good representation of the population. By representative we mean that the sample contains all the salient features and characteristics present in the population.

Abstract

In the previous chapters, we
have learnt about methods to describe the observed data (Part I) and how to make probabilistic statements about the characteristics of the underlying populations generating the data (Parts II and III). We introduced concepts of association, correlation, dependence and independence, conditional and marginal probabilities, likelihood, odds ratios and other notions to understand the joint behaviour of variables.