Erschienen in:

Open Access 2023 | OriginalPaper | Buchkapitel

Statistical Inference

verfasst von : Craig Starbuck

Erschienen in: The Fundamentals of People Analytics

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Patentsuche

Aus

Abstract

This chapter covers the fundamentals of statistical inference. Topics include discrete and continuous probability distributions, conditional probability, Central Limit Theorem (CLT), confidence intervals, hypothesis testing, multiple testing, and statistical power.

The objective of inferential statistics is to make inferences –with some degree of confidence– about a population based on available sample data. In people analytics, a population often refers to all employees—past, present, and future; therefore, inferential statistics are appropriate even when data are accessible for every current employee. Several related concepts are fundamental to this goal and will be covered here.

Introduction to Probability

Randomness and uncertainty exist all around us. In probability theory, random phenomena refer to events or experiments whose outcomes cannot be predicted with certainty (Pishro-Nik, 2014). If you have taken a course in probability, there is a good chance you have considered the case of a fair coin flip—one of the most intuitive applications of probability. In the absence of information on how the coin is flipped, we cannot be certain of the outcome. What we can be certain of is that with a large number of coin flips, the proportion of heads will become increasingly close to 50%, or $\frac {1}{2}$.

The Law of Large Numbers (LLN) is an important theorem for building an intuitive understanding of how probability relates to the statistical inference concepts we will cover. In the case of a fair coin flip, it is possible to observe many consecutive heads by chance. This is because small samples can lend to anomalies. However, as the number of flips increases, we will undoubtedly observe an increasing number of tails; we expect a roughly equal number of heads and tails with a large enough number of flips.

Probability Distributions

Probability distributions are statistical functions that yield the probability of obtaining possible values for a random variable. Probabilities range from 0 to 1, where the probability of a definite event is 1 and the probability of an impossible event is 0. The empirical probability (or experimental probability) of an event is the fraction of times it occurred relative to the total number of repetitions. Since a probability distribution defines the likelihood of observing all possible outcomes of an event or experiment, the sum of all probabilities for all possible values must equal 1.

For example, let us look at how org tenure is distributed across employees. We can understand the general shape of the distribution using descriptive statistics:


                  
                    ##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
                  
                  
                    ##   0.000   3.000   5.000   7.032   9.000  72.000

Comparing the higher mean value of 7.03 to the median value of 5 indicates there are larger values skewing the mean upward which is further evidenced in the large delta between Q3 and max values.

Beyond descriptives, visuals are often helpful in understanding a variable’s distribution. As shown in Fig. 1, it is clear that org tenure is positively skewed, and understanding the shape (or spread) of this distribution enables us to identify which values are most likely in order to estimate the likelihood of different results:

You can likely imagine the shape of probability distributions for many common events. If we consider the probability of employees exiting an organization, the outcome is binary. That is, employees either leave or stay; there are no options between these extremes. However, the distribution of performance scores will likely look quite different. Most organizations have expected –or even forced– distributions in which an average rating is awarded most frequently and low and high performance ratings less frequently. This would start to look more like a bell curve as the number of performance levels increases.

Just as we grouped variables into discrete and continuous categories in chapter “Measurement and Sampling”, this is also how probability distributions are categorized. If you read chapter “Measurement and Sampling”, you likely already have some a priori expectations about the characteristics of discrete and continuous distributions.

The shape of a probability distribution is defined by parameters, which represent its essential properties (e.g., measures of central tendency and spread). These probability distributions underpin the many types of statistical tests covered in this book.

Discrete Probability Distributions

Discrete probability distributions, also known as Probability Mass Functions (PMF), can be leveraged to model different types of nominal and ordinal variables. Some common discrete distributions include:

Bernoulli: probability of success or failure for a single observation with two outcomes
Binomial: number of successes and failures in a sequence of independent observations with two outcomes (collection of Bernoulli trials)
Multinomial: generalization of the binomial distribution for observations with more than two outcomes
Negative Binomial (Pascal): a version of the binomial distribution for a fixed number of observations (this is positively skewed despite what the name might suggest)
Poisson: probability of a given number of events occurring over a specified period
Geometric: special case of the negative binomial distribution that repeats observations until a success is observed (rather than a fixed number of times)

Several functions are available in R to simulate PMFs. The precise shape of a distribution depends on the parameters, but we will simulate and visualize these common PMFs to illustrate differences in the general shape of each. First, let us simulate the distributions by drawing 1000 random values from each with a specified set of parameters:

Next, we will visualize each distribution (Fig. 2):

Continuous Probability Distributions

Continuous probability distributions, also known as Probability Density Functions (PDF), can be leveraged to model different types of interval and ratio variables. Some common continuous distributions include:

Normal (Gaussian): distribution characterized by a mean and standard deviation for which the mean, median, and mode are equal
Uniform: values of a random variable with equal probabilities of occurring
Log-Normal: normal distribution of log-transformed values
Student’s t : similar to the normal distribution but with thicker tails (approaches normal as n increases)
Chi-Square: similar to the t distribution in that the shape approaches normal as n increases
F: developed to examine variances from random samples taken from two independent normal populations

The normal distribution is colloquially known as a bell curve. It is important to note that a normal distribution and standard normal distribution are not one-and-the-same. The standard normal distribution is a special case of the normal distribution which has no free parameters; its parameters are always μ = 0 and σ = 1. The parameters of a normal distribution are unspecified, and μ and σ can take on values other than 0 and 1, respectively.

A number of functions are available in R to simulate PDFs. While the precise shape of a distribution is parameter dependent, we will simulate and visualize these common PDFs to illustrate differences in the general shape of each. Let us first draw 1,000 random values from each distribution with a specified set of parameters:

Next, we will visualize each distribution. Since these continuous distributions are probability density functions, we will superimpose density plots over each (Fig. 3):

The distribution of data is critically important in statistics. The accuracy of many statistical tests is based on assumptions rooted in underlying data distributions, and violating these assumptions can result in serious errors due to misaligned probability distributions. Though there are many more discrete and continuous probability distributions, we will leverage several of these common types to assess the likelihood of differences, effects, and associations in later chapters of this book.

Conditional Probability

Conditional probability reflects the probability conditioned on the occurrence of a previous event or outcome. For example, we may find that the proportion of heads is greater or less than $\frac {1}{2}$ with a large number of fair coin flips when the coin is consistently heads up when flipped. The outcome is, therefore, conditioned on the fixed –rather than random– positioning of the coin when flipped.

Formally, Bayes’ Theorem (alternatively, Bayes’ Rule) states that for any two events A and B wherein the probability of A is not 0 (P(A) ≠ 0):

$$\displaystyle \begin{aligned} P(A \vert B) = \frac{P(B \vert A) P(A)}{P(B)}, \end{aligned}$$

where:

A = an event
B = another event
P(A|B) = conditional probability that event A occurs, given event B occurs (posterior)
P(B|A) = conditional probability that event B occurs, given event A occurs (likelihood)
P(B) = normalizing constant, constraining the probability distribution to sum to 1 (evidence)
P(A) = probability event A occurs before knowing if event B occurs (prior)

Bayes’ Rule allows us to predict the outcome more accurately by conditioning the probability on known factors rather than assuming all events operate under the same conditions. Bayes’ Rule is pervasive in people analytics, as the probability of outcomes can vary widely when conditioned on a person’s age, tenure, education, job, perceptions, relationships, and many other factors. For example, if we consider a company with 100 terminations over a 12-month period and average headcount of 1,000, the probability of attrition not conditioned on any other factor is 10%, or $\frac {1}{10}$. Aside from trending this probability over time to identify if overall attrition is becoming more or less of a concern, this is not too helpful at the company level. However, if we condition the probability of attrition on an event –such as a recent manager exit– and find that the probability of attrition among those whose manager has left in the last six months is 70%, or $\frac {7}{10}$, this is far more actionable (and concerning).

The Monty Hall Problem is an excellent example of how our intuition is often at odds with the laws of conditional probability.

In the classic game show, Let’s Make a Deal, Monty Hall asks contestants to choose one of three closed doors. Behind one door is a prize while the other two doors contain nothing. After the contestant selects a door, Monty opens one of the other two doors which does not contain a prize. At this point, there are two closed doors: the door the contestant selected and another for which the content remains unknown. All that is known at this point is that the prize is behind one of the two closed doors.

It is at this juncture that Monty introduces a twist by asking if the contestant would like to switch doors. Most assume that the two closed doors have an equal (50/50) chance of containing the prize, because we generally think of probabilities as independent, random events. However, this is incorrect. Contestants who switch from their original selection have a 66% chance (rather than 50%) of winning. This may be counterintuitive, because the brain wants to reduce the problem to a simple coin flip. There is a major difference between the Monty Hall problem and a coin flip; for two outcomes to have the same probability, randomness and independence are required. In the case of the Monty Hall problem, neither assumption is satisfied.

When all three doors are closed, each has the same probability of being selected. The probability of choosing the door with a prize is 0.33. Monty’s knowledge of the door containing the prize does not impact the probability of selecting the winning door. This is because the choice is completely random given we have no information that would increase the probability of a door containing the prize. The process is no longer random when Monty uses his insider knowledge about the prize’s location and opens a door he knows does not contain the prize. The probabilities change. Since Monty will never show the door containing the prize, he is careful to always open a door that has nothing behind it. If he was not constrained by the requirement to not reveal the prize’s location and instead chose to open one of the remaining doors at random, the probabilities would be equal (and he may end up opening the door that contains the prize).

Seeing is believing, so let us prove this with a simulation in R:

As we can see, wins occur nearly 50% more often when contestants switch doors. This exercise hopefully demonstrates the importance of conditional probability and statistical assumptions like randomness. Also, if ever you find yourself playing Let’s Make a Deal, switch doors.

Central Limit Theorem

The Central Limit Theorem (CLT) is a mainstay of statistics and probability and fundamental to understanding the mechanics of statistical inference.

Coined by a French-born mathematician named Abraham De Moivre in the 1700s, the CLT states that given a sufficiently large sample size, the average of independent random variables tends to follow a normal (or Gaussian) distribution as the number of samples increases. The distribution of sample means approaches a normal distribution regardless of the shape of the population distribution from which the samples are drawn. This is important because the normal distribution has properties that can be used to test the likelihood that an observed value, difference, or relationship in a sample is also present in the population (Fig. 4).

Let us begin with an intuitive example of CLT. Imagine that we have a reliable way to measure how fun a population is on a 100-point scale, where 100 indicates maximum fun (life of the party) and 1 indicates maximum boringness. Consider that a small statistics conference is in progress at a nearby convention center, and there are 40 statisticians in attendance. In a separate room at the same convention center, there is also a group of 40 random people (non-statisticians) who are gathered to discuss some less interesting topic. Our job is to walk into one of the rooms and determine –based on the fun factor alone– whether we have entered the statistics conference or the other, less interesting gathering of non-statisticians.

Instinctively, we already know the statisticians will be more fun than the other group. However, let us assume we need the mean fun score and standard deviation of these two groups for this example. The group of statisticians have, on average, a fun score of 85 with a standard deviation of 2, while the group of non-statisticians are a bit less fun with a mean score of 65 and a standard deviation of 4. With a known population mean and standard deviation, the standard error (SE) –the standard deviation of sample means– provides the ability to calculate the probability that the sample (the room of 40 people) belongs to the population of interest (fellow statisticians).

The SE is defined by:

$$\displaystyle \begin{aligned} SE = \frac{\sigma}{\sqrt{n}} \end{aligned}$$

Herein lies the beauty of the CLT: roughly 68% of sample means will lie within one standard error of the population mean, roughly 95% within two standard errors of the population mean, and roughly 99% within three standard errors of the population mean. Therefore, any room whose members have an average fun score that is not within two standard errors of the population mean (between 84.37 and 85.63 for our statisticians) is statistically unlikely to be the group of statisticians for which we are searching. This is because in less than 5 in 100 cases could we randomly draw a reasonably sized sample of statisticians with an average fun score so extremely different from the population average.

Because small samples lend to anomalies, we could –by chance– select a single person who happens to fall in the tails (extremely boring or extremely fun); however, as the sample size increases, it becomes more and more likely that the observed average reflects the average of the larger population. It would be virtually impossible (in less than 1 in 100 cases) to draw a random sample of statisticians from the population with average funness that is not within three standard errors of the population mean (between 84.05 and 85.95). Therefore, if we find that the room of people have an average fun score of 75, we will likely have far more fun in the other room!

Let us now see the CLT in action by simulating a random uniform population distribution from which we can draw random samples. Remember, the shape of the population distribution does not matter; we could simulate an Exponential, Gamma, Poisson, Binomial, or other distribution and observe the same behavior (Fig. 5).

As expected, these randomly generated data are uniformly distributed. Next, we will draw 100 random samples of various sizes and plot the average of each.

Per the CLT, we can see that as n increases, the sample means become more normally distributed (Fig. 6).

Confidence Intervals

A Confidence Interval (CI) is a range of values that likely contains the value of an unknown population parameter. These unknown population parameters are often μ or σ, though we will also leverage CIs in later chapters for regression coefficients, proportions, rates, and differences.

If we draw random samples from a population, we can compute a CI for each sample. Building on the CLT, for a given confidence level (usually 95%, though 99% or 90% are sometimes used), the specified percent of sample intervals is expected to include the estimated population parameter. For example, for a 95% CI we would expect 19 in every 20 (or 95 in every 100) intervals across the samples to include the true population parameter. This is illustrated in Fig. 7.

It is important to note that CIs should not be applied to the distribution of sample values; CIs relate to population parameters. A common misinterpretation of a CI is that it represents an interval within which a certain percent of sample values exists. Because this misinterpretation is so prevalent, there is a good chance you will be tested on your understanding of CIs when applying to positions involving statistical analyses!

The standard error is fundamental to estimating CIs. While the standard deviation is a measure of variability for a random variable, the variability captured by the SE reflects how well a sample represents the population. Since sample statistics will approach the actual population parameters as the size of the sample increases, the SE and sample size are inversely related; that is, the SE decreases as the sample size increases.

Since the CLT is fundamental to inferential statistics, let us validate that our simulated distribution of sample means adheres to the properties of normally distributed data per the Empirical Rule:

95% of sample means are within 2 SEs, which is what we expect per the characteristics of the normal distribution.

Nearly all of the sample means are within 3 SEs, indicating that it would be highly unlikely –nearly impossible even– to observe a sample mean from the same population that falls outside this interval.

Now, let us illustrate the relationship between CIs and standard errors using sample data from our uniform population distribution. In our example, both μ and σ are known and our sample size n is at least 30; therefore, we can use a Z-test to calculate the 95% CI. A z score of 1.96 corresponds to the 95% CI for a two-tailed distribution; that is, we are looking for significantly different values in either the larger or smaller direction. The 95% CI represents the range of values we would expect to include μ in at least 95 of 100 random samples taken from the population.

The CI in this case is defined by:

$$\displaystyle \begin{aligned} CI = \bar{x} \pm z_{\alpha/_2} \frac{\sigma}{\sqrt{n}} \end{aligned}$$

Let us randomly take n = 100 from the population and compute sample statistics to estimate the 95% CI:

Our known μ is 51.2, which is covered by our 95% CI (47.9–59.0). Per the CLT, in less than 5% of cases would we expect to draw a random sample from the population that results in a 95% CI which does not include μ. Note that our CI narrows with larger samples since our confidence that the range includes μ increases with more data.

Next, let us look at a 99% CI. We will enter 2.576 for z:

Like the 95% CI, this slightly wider 99% CI (46.2–60.7) also includes our μ of 51.2.

If σ is not known, and/or we have a small sample (n < 30), we need to use a t-test to calculate the CIs. In a people analytics setting, the reality is that population parameters are often unknown. For example, if we knew how engagement scores vary in the employee population, there would be no need to survey a sample of employees and make inferences about said population.

As we will see, the t-test underpins many statistical tests and models germane to the people analytics discipline since we are often working with small data sets, so it is important to understand the mechanics. As shown in Fig. 8, the t distribution is increasingly wider and shorter relative to the normal distribution as the sample size decreases; this is also characteristic of the sampling distribution of means for smaller samples we observed in our CLT example. Specifically, degrees of freedom (df) is used to determine the shape of the probability distribution. Degrees of freedom represents the number of observations in the data that are free to vary when estimating statistical parameters, which is a function of the sample size (n − 1). For example, if we could choose 1 of 5 projects to work on each day between Monday and Friday, we would only be able to choose 4 out of the 5 days; on Friday, only 1 project would remain to be selected, so our degrees of freedom (the number of days in which we have a choice between projects) would be 4.

When estimating the CI for smaller samples, we need to leverage the wider, more platykurtic t distribution to achieve greater accuracy. Therefore, the CI for a two-tailed test in this case is defined by:

$$\displaystyle \begin{aligned} CI = \bar{x} \pm t_{\alpha/_2} \frac{\sigma}{\sqrt{n}} \end{aligned}$$

Let us compare CIs calculated using a t-test to those calculated using the Z-Test. While a fixed z score can be used for each CI level when n > 30, the t statistic varies based on both the CI level and df. Though R will determine the correct t statistic for us, let us reference the table shown in Fig. 9 to manually lookup the t statistic.

For illustrative purposes, let us draw a smaller sample of n = 25 from our uniform population distribution and calculate the 95% CI using the t statistic from the table (df = 24). The t statistic for this CI and df is 2.064:

As expected, the 95% CI using the t statistic is much wider (35.2–59.6), acknowledging the increased uncertainty in estimating population parameters given the limited information in this smaller sample. To increase our confidence to the 99% level, the interval widens even further (30.9–63.9):

Hypothesis Testing

Hypothesis testing is how we leverage CIs to test whether a significant difference or relationship exists in the data. Sir Ronald Fisher invented what is known as the null hypothesis, which states that there is no relationship/difference; disprove me if you can! The null hypothesis is defined by:

$$\displaystyle \begin{aligned} H_0: \mu_A = \mu_B \end{aligned}$$

The objective of hypothesis testing is to determine if there is sufficient evidence to reject the null hypothesis in favor of an alternative hypothesis. The null hypothesis always states that there is nothing of significance. For example, if we want to test whether an intervention has an effect on an outcome in a population, the null hypothesis states that there is no effect. If we want to test whether there is a difference in average scores between two groups in a population, the null hypothesis states that there is no difference.

An alternative hypothesis may simply state that there is a difference or relationship in the population, or it may specify the expected direction (e.g., Population A has a significantly larger or smaller average value than Population B; Variable A is positively or negatively related to Variable B). Therefore, alternative hypotheses are defined by:

$$\displaystyle \begin{aligned} H_A: \mu_A \neq \mu_B \end{aligned}$$

$$\displaystyle \begin{aligned} H_A: \mu_A < \mu_B \end{aligned}$$

$$\displaystyle \begin{aligned} H_A: \mu_A > \mu_B \end{aligned}$$

Alpha

The alpha level of a hypothesis test, denoted by α, represents the probability of obtaining observed results due to chance if the null hypothesis is true. In other words, α is the probability of rejecting the null hypothesis (and therefore claiming that there is a significant difference or relationship) when in fact we should fail to reject it because there is insufficient evidence to support the alternative hypothesis.

α is often set at 0.05 but is sometimes set at a more rigorous 0.01, depending upon the context and tolerance for error. An α of 0.05 corresponds to a 95% CI (1–0.05), and 0.01 to a 99% CI (1–0.01). With non-directional alternative hypotheses, we must divide α by 2 (i.e., we could observe a significant result in either tail of the distribution), while one-tailed tests position the rejection region entirely within one tail based on what is being hypothesized.

At the 0.05 level, we would conclude that a finding is statistically significant if the chance of observing a value at least as extreme as the one observed is less than 1 in 20 if the null hypothesis is true. Recall that we observed this behavior with our simulated distribution of sample means. While we could observe more extreme values by chance with repeated attempts, in less than 1 in every 20 times would we expect a 95% CI that does not capture μ. Moreover, in less than 1 in every 100 times should we expect a sample with a 99% CI that does not capture μ.

Type I & II Errors

A Type I error is a false positive, wherein we conclude that there is a significant difference or relationship when there is not. A Type II error is a false negative, wherein we fail to capture a significant finding. α represents our chance of making a Type I error, while β represents our chance of making a Type II error. I once had a professor explain that committing a Type I error is a shame, while committing a Type II error is a pity, and I have found this to be a helpful way to remember what each type of error represents (Fig. 10).

p-Values

In statistical tests, the p-value is referenced to determine whether the null hypothesis can be rejected. We generally rely on the availability of a theoretical null distribution to obtain a p-value associated with a particular test statistic. The p-value represents the probability of obtaining a result at least as extreme as the one observed if the null hypothesis is true. As a general rule, if p < 0.05, we can confidently reject the null hypothesis and conclude that the observed difference or relationship was unlikely a chance observation.

While statistical significance helps us understand the probability of observing results by chance when there is no difference or effect in the population, it does not tell us anything about the size of the difference or effect. Analyses should never be reduced to inspecting p-values; in fact, p-values have been the subject of much controversy among researchers and practitioners in recent years. Later chapters will cover how to interpret results of statistical tests to surface the story and determine if there is anything “practically” significant among statistically significant findings.

Bonferroni Correction

One caveat when leveraging a p-value to determine statistical significance is that when multiple testing is performed –that is, multiple tests using the same sample data– the probability of a Type I error increases by a factor equivalent to the number of tests performed. Though there is not agreement among statisticians about how (or even whether) the p-value threshold for statistical significance needs to be adjusted to account for this increased risk, we will cover a conservative approach known as the Bonferroni Correction to mitigate this risk.

Thus far, we have only discussed statistical significance in the context of a per analysis error rate—that is, the probability of committing a Type I error for a single statistical test. However, when two or more tests are being conducted on the same sample, the familywise error rate is an important factor in determining statistical significance. The familywise error rate reflects the fact that as we conduct more and more analyses on the same sample, the probability of a Type I error across the set (or family) of analyses increases. The familywise error rate can be calculated by:

$$\displaystyle \begin{aligned} \alpha_{FW} = 1 - (1 - \alpha_{PC})^C \end{aligned}$$

where c is equal to the number of comparisons (or statistical tests) performed, and α_PC is equal to the specified per analysis error rate (usually 0.05). For example, if α = 0.05 per analysis, the probability of a Type I error with three tests on the same data increases from 5% to 14.3%: 1 − (1 − 0.05)³ = 0.143.

The most common method of adjusting the familywise error rate down to the specified per analysis error rate is the Bonferroni Correction. To implement this correction, we can simply divide α by the number of analyses performed on the data set—such as α∕3 = 0.017 in the case of three analyses with α = 0.05. This means that for each statistical test, we must achieve p < 0.017 to report a statistically significant result. An alternative which allows us to achieve the same number of statistically significant results is to multiply the unadjusted per analysis p-values for each statistical test by the number of tests. For example, if we run three statistical tests and receive p = 0.014, p = 0.047, and p = 0.125, we would achieve one significant result with the first method (p < 0.017) as well as with the alternative since the first statistical test satisfies the per analysis error rate (p < 0.05): p = 0.014 ∗ 3 = 0.042.

Perneger (1998) is one of many who oppose the use of the Bonferroni Correction, suggesting that these “adjustments are, at best, unnecessary and, at worst, deleterious to sound statistical inference.” The Bonferroni Correction is controversial among researchers because while applying the correction reduces the chance of a Type I error, it also increases the chance of a Type II error. Because this correction makes it more difficult to detect significant results, it is rare to find such a correction reported in published research—though research often involves multiple testing on the same sample. Perneger suggests that simply describing the statistical tests that were performed, and why, is sufficient for dealing with potential problems introduced by multiple testing.

Statistical Power

Whereas α is the probability of a Type I error, Beta β is the opposite: the probability of accepting H₀ when it is false (Type II error).

β is related to the power of the analysis, which is calculated by 1—β and reflects our ability to detect a difference or relationship if one exists. If a study has 80% power, for example, it has an 80% chance of detecting an effect if one actually exists in the population. Power analysis helps with defining the optimal n-count for detecting a population effect in sample data (i.e., correctly rejecting a false H₀). Increasing the power of a statistical test decreases the probability that we will fail to detect a significant effect present in the population.

At this point, it should be intuitive that larger samples increase our chances of detecting significant results when they exist. As we observed in the t-test example, CIs for small samples (n < 30) are quite wide relative to those for large samples; therefore, the power of the analysis to detect significance is limited given how different the values of x must be to observe non-overlapping CIs.

Before diving into the mechanics of power analysis, it is important to understand the three important –and interrelated– considerations in hypothesis tests that influence whether effects are real or a product of random sampling error:

Effect size: Larger differences and stronger relationships are less likely random sampling error
Sample size: Larger samples can detect smaller differences and weaker relationships (though they may be too small or weak to be meaningful)
Variability: Greater variability in the data is likely to result in differences that are attributable to random sampling error

Power analysis may be thought of as an optimization problem. The goal is to achieve a large enough sample size to detect meaningful effects –but not wastefully large as data collection can be expensive– whilst protecting against an underpowered analysis with a low probability of detecting an important effect (Type II error).

To estimate the sample size needed to achieve a given power level, one must use domain expertise to specify parameters. The effect size parameter varies based on the statistical test but when in doubt, we can use Cohen’s (1988) conventional effect sizes which are defined in Fig. 11. The effect size for a particular test can also be retrieved using the cohen.ES() function. For example, the following command returns 0.25 as the medium effect size threshold for ANOVA: cohen.ES(test = "anov", size = "medium").

Let us illustrate by calculating the sample size required for a one-way ANOVA that involves four groups. We will set α = 0.05 and specify an 80% chance (power = 0.8) of detecting a moderate population effect. We can leverage the pwr library in R to perform power analysis. Executing ?pwr will provide package documentation that clarifies what function to execute to calculate the sample size requirement for various statistical tests.


                  
                    ##
                  
                  
                    ##      Balanced one-way analysis of variance power calculation
                  
                  
                    ##
                  
                  
                    ##               k = 4
                  
                  
                    ##               n = 44.59927
                  
                  
                    ##               f = 0.25
                  
                  
                    ##       sig.level = 0.05
                  
                  
                    ##           power = 0.8
                  
                  
                    ##
                  
                  
                    ## NOTE: n is number in each group

The power analysis for this one-way ANOVA shows that we need a minimum of n = 45 within each of the four groups to achieve an 80% chance of detecting a medium population effect across the four groups when setting α = 0.05.

Review Questions

What are some examples of a null hypothesis?

What is the difference between Type I and Type II errors?

What is the primary purpose of inferential statistics, and how does it differ from descriptive statistics?

What is the Central Limit Theorem (CLT), and why is it important?

Is randomness a requirement for probabilistic methods? Why or why not?

What does the Bonferroni Correction seek to achieve?

What is a confidence interval (CI)?

What are some examples of how the context influences what level of confidence is appropriate for statistical significance testing? Should we always use a 95 CI?

When population parameters are unknown, which test would be appropriate for testing the following null hypothesis: μ_A = μ_B?

10.

According to the Empirical Rule, 95% of normally distributed data lie within how many standard deviations of the mean?

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Springer Professional