Erschienen in:

Open Access 2023 | OriginalPaper | Buchkapitel

Analysis of Differences

verfasst von : Craig Starbuck

Erschienen in: The Fundamentals of People Analytics

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Patentsuche

Aus

Abstract

This chapter examines parametric tests and nonparametric alternatives for testing whether statistical differences are observed in data measured on discrete and continuous scales. Methods of quantifying the magnitude of observed differences are also reviewed.

There are many statistical tests that can be used to test for differences within or between two or more groups. This chapter will cover common contexts for differences in people analytics and the tests applicable to each.

Parametric vs. Nonparametric Tests

In the context of data measured on a continuous scale (quantitative), we will cover parametric tests along with their nonparametric counterparts. When the hypothesis relates to average (mean) differences and n is large, parametric tests are preferred as they generally have more statistical power. Nonparametric tests are distribution-free tests that do not require the population’s distribution to be characterized by certain parameters, such as a normal distribution defined by a mean and standard deviation. Nonparametric tests are great for qualitative data since the distribution of non-numeric data cannot be characterized by parameters.

Beyond ensuring the data were generated from a random and representative process as discussed in chapter “Measurement and Sampling”, as well as following the data screening procedures outlined in chapter “Data Preparation” (e.g., addressing concerning outliers), parametric tests of differences generally feature three key assumptions:

Independence: Observations within each group are independent of each other.

Homogeneity of Variance: Variances of populations from which samples were drawn are equal.

Normality: Residuals must be normally distributed (with mean of 0) within each group.

While homogeneity of variance assumes the variances across multiple groups are equal, parametric tests are generally robust to violations of equal variances when the sample sizes are large. Also, you may recall that μ and σ are sufficient to characterize a population distribution when data are situated symmetrically around the mean. However, the mean can be sensitive to outliers so if outliers are present in the data, the median may be a better way of representing the data’s center (i.e., nonparametric tests); just remember that the use of nonparametric tests requires hypotheses to be modified to adjust for median –rather than mean– centers.

You may be wondering whether the magical elixir that is the CLT, which we covered in chapter “Statistical Inference”, influences our ability to utilize parametric tests with respect to the assumption of normality. It is important to remember that the normal distribution properties under the CLT relate to the sampling distribution of means—not to the distribution of the population or to the data for an individual sample. The CLT is important for estimating population parameters, but it does not transform a population distribution from non-normal to normal. If we know the population distribution is non-normal (e.g., ordinal, nominal, or skewed data), nonparametric tests should be considered. This is why we used Spearman’s correlation coefficient –a nonparametric test– in chapter “Descriptive Statistics” to evaluate the relationship between job level and education; these ordinal data are not normally distributed in the population.

Differences in Discrete Data

Nonparametric tests are generally best when working with data measured on a discrete scale since these data do not come from normally distributed populations. The two most commonly used tests to analyze variables measured on a discrete scale are the nonparametric Chi-square test and Fisher’s exact test (Fig. 1).

Both tests organize data within 2x2 contingency tables which enables us to understand interrelations between variables.

Chi-Square Test

The Chi-Square Test of Independence evaluates patterns of observations to determine if categories occur more frequently than we would expect by chance. The Chi-square statistic is defined by:

$$\displaystyle \begin{aligned} {\chi}^2 = \sum\frac{(O_i - E_i)^2}{E_i} \end{aligned}$$

where O_i is the observed value, and E_i is the expected value.

H₀ states that each variable is independent of one another (i.e., there is no relationship). In addition to the χ² test statistic, df for the contingency table defined by df = (rows − 1) ∗ (columns − 1), is required to determine whether we reject or fail to reject H₀.

While there is not consensus on the minimum sample size for this test, it is important to note that the χ² statistic follows a chi-square distribution asymptotically. This means we can only calculate accurate p-values for larger samples, and a general rule of thumb is that the expected value for each cell needs to be at least 5. The challenge with small n-counts is illustrated in Fig. 2; the chi-square distribution approaches a vertical line as df drops below 5.

We will demonstrate how to perform a chi-square test of independence by evaluating whether exit rates are independent of whether an employee works overtime. Let us first construct a 2 × 2 contingency table using the table() function:


                
                  ##
                
                
                  ##        No Yes
                
                
                  ##   No  110 127
                
                
                  ##   Yes 944 289

A mosaic plot is a great way to visualize the delta between expected and observed frequencies for each cell. This can be produced using the mosaicplot() function from the graphics library:

In Fig. 3, blue indicates that the observed value is higher than the expected value, while red indicates that the observed value is lower than the expected value. Based on this plot, there appears to be some meaningful patterns and departures from expected values in both the high and low directions. There are more inactive employees than expected in the overtime group, and more active employees than expected in the no overtime group. These large standardized residuals are indicative of meaningful relationships between the two categorical variables.

Let us run the chi-square test of independence to determine whether these residuals are statistically significant. This test can be performed in R by passing the contingency table into the chisq.test() function:


                
                  ##
                
                
                  ##  Pearson's Chi-squared test with Yates' continuity correction
                
                
                  ##
                
                
                  ## data:  cont_tbl
                
                
                  ## X-squared = 87.564, df = 1, p-value < 2.2e-16

Based on the results, exit rates are not independent of overtime (χ²(1) = 87.56, p < 0.05). Therefore, there is a statistically significant relationship between an employee working overtime and the rate at which they change from active to inactive statuses—confirming what was evident in the mosaic plot.

Fisher’s Exact Test

When the sample size is small, Fisher’s Exact Test can be used to calculate the exact p-value rather than the approximation characteristic of many statistical tests such as the chi-square test.

H₀ for Fisher’s exact test is the same as H₀ for the chi-square test of independence: There is no relationship between the two categorical variables (i.e., they are independent). We can perform Fisher’s exact test using the fisher.test() function in R:


                
                  ##
                
                
                  ##  Fisher's Exact Test for Count Data
                
                
                  ##
                
                
                  ## data:  cont_tbl
                
                
                  ## p-value < 2.2e-16
                
                
                  ## alternative hypothesis: true odds ratio is not equal to 1
                
                
                  ## 95 percent confidence interval:
                
                
                  ##  0.1969101 0.3572582
                
                
                  ## sample estimates:
                
                
                  ## odds ratio
                
                
                  ##  0.2654384

Note the odds ratio shown in this output. The odds ratio represents the ratio of positive to negative cases, which is the ratio of overtime for active and inactive workers in this example. The odds ratio is defined by:

$$\displaystyle \begin{aligned} OR = \frac{a*d}{b*c} \end{aligned}$$

An odds ratio of 1 indicates no difference in overtime frequency between active and inactive workers. Figure 4 illustrates the cells for the odds ratio calculation for the 2 × 2 contingency table of overtime for active and inactive workers.

Since the 95% CI for the odds ratio does not include 1, we reject the null hypothesis and conclude that exit rates are related to working overtime; this is consistent with results from the chi-square test of independence. Since overtime was indicated far more often for inactive workers than for active workers, it is no surprise that the denominator of our ratio is larger than the numerator (i.e., OR < 1).

As discussed in chapter “Descriptive Statistics”, we can produce a ϕ coefficient to understand the strength of the association by passing the contingency table into the phi function from the psych library:

The relationship between active status and overtime is negative, and the strength of the relationship is weak (ϕ = −0.25).

Another common method of measuring the strength of the association between two categorical variables is Cramer’s V, which ranges from 0 (no association) to 1 (strong association). In the interest of not muddying the waters with an exhaustive set of alternative methods, the implementation will not be covered.

Differences in Continuous Data

A variety of parametric and nonparametric tests are available for evaluating differences between variables measured on a continuous scale. Figure 5 provides a side-by-side of these parametric and corresponding nonparametric tests of differences.

Independent Samples t-Test

When evaluating differences between two independent samples, social psychology researchers generally select from two tests: Student’s t-test and Welch’s t-test. There are other alternatives, such as Yuen’s t-test and a bootstrapped t-test, but these are less commonly reported in scholarly social science journals and will not be covered in this book.

The Student’s t-test, which was introduced in chapter “Statistical Inference”, is a parametric test whose assumptions of equal variances seldom hold in people analytics. Welch’s t-test is generally preferred to the Student’s t-test because it has been shown to provide better control of Type 1 error rates when homogeneity of variance is not met, whilst losing little robustness (e.g. Delacre et al., 2017). When n is equal between groups, the Student’s t-test is known to be robust to violations of the equal variance assumption, as long as n is sufficiently large to accurately estimate parameters and the underlying distribution is not characterized by high skewness or kurtosis.

Let us explore the mechanics of independent samples t-tests. Figure 6 illustrates mean differences (MD) for nine Welch’s t-tests based on random sample data generated from independent normal populations. Remember that statistical power increases with a large n, as t distributions approximate a normal distribution with larger df. In the context of analysis of differences, this translates to an increase in the likelihood of detecting statistical differences in the means of two distributions. Note that for the two cases where both MD and n are relatively small, mean differences are not statistically significant (p >= 0.05). You may also notice that as the t-statistic approaches 0, statistical differences become less likely since a smaller absolute t-statistic indicates a smaller difference between mean values.

Next, we will walk through the steps involved in performing Welch’s t-test. Let us first visualize the distribution of data for each group using box plots (Fig. 7).

While the median –rather than the mean which is being evaluated with Welch’s t-test– is shown in these box plots, this is a great way to visually inspect whether there are meaningful differences in the distribution of data between groups (in addition to identifying outliers). We could of course use density plots or histograms as an alternative. As we can see, annual compensation for employees with a Manager job title tends to be slightly higher than for those with a Research Scientist job title, and the variance in annual compensation appears to be fairly consistent between the groups.

There are several alternatives to a visual inspection of normality, such as the χ² Goodness-of-Fit test (Snedecor and Cochran, 1980), Kolmogorov-Smirnov (K-S) test (Chakravarti et al., 1967), or Shapiro-Wilk test (Shapiro and Wilk, 1965). The general idea is consistent for each of these tests: compare observed data to what would be expected if data are sampled from a normally distributed population. The χ² Goodness-of-Fit test compares the count of data points across the range of values relative to what would be expected in each for a sample with the same dimensions taken from a normal distribution. For example, if data are sampled from a normally distributed population, it follows that roughly half the values should exist below the mean and half above the mean. The K-S test evaluates how the observed cumulative distribution compares to the properties of a normal cumulative distribution. The Shapiro-Wilk test is based on correlations between observed and expected data.

We will test for normality using the Shapiro-Wilk test. The null hypothesis for the Shapiro-Wilk test is that the data are normally distributed, so a high p-value indicates that the assumption of normality is satisfied (i.e., failure to reject the null hypothesis of normally distributed data). We can use the with() function together with the shapiro.test() function to run this test in R:


                
                  ##
                
                
                  ##  Shapiro-Wilk normality test
                
                
                  ##
                
                
                  ## data:  annual_comp[job_title == "Manager"]
                
                
                  ## W = 0.93546, p-value = 0.000103


                
                  ##
                
                
                  ##  Shapiro-Wilk normality test
                
                
                  ##
                
                
                  ## data:  annual_comp[job_title == "Research Scientist"]
                
                
                  ## W = 0.96002, p-value = 3.427e-07

Based on these tests, distributions of annual compensation for Managers and Research Scientists are non-normal (p < 0.05).

While we should not proceed with performing Welch’s t-test due to unequal variances, let us do so merely to illustrate how the test is implemented in R. To perform Welch’s t-test in R, we can simply pass into the t.test() function a numeric vector for each of the two groups.


                
                  ##
                
                
                  ##  Welch Two Sample t-test
                
                
                  ##
                
                
                  ## data:  comp_mgr and comp_rsci
                
                
                  ## t = 0.22623, df = 159.55, p-value = 0.8213
                
                
                  ## alternative hypothesis: true difference in means is not equal to 0
                
                
                  ## 95 percent confidence interval:
                
                
                  ##  -8860.685 11153.244
                
                
                  ## sample estimates:
                
                
                  ## mean of x mean of y
                
                
                  ##  139900.8  138754.5

If the data adhered to the assumptions of Welch’s t-test, we would conclude that the mean difference between annual compensation for Managers ($\bar {x} = 139{,}901$) and Research Scientists ($\bar {x} = 138{,}755$) is not significant (t(159.55) = 0.23, p = 0.82).

Note that we can access specific metrics from this output by storing results to an object and then referencing specific elements by name or index:


                
                  ##         t
                
                
                  ## 0.2262262


                
                  ##       df
                
                
                  ## 159.5544

When object elements are referenced by index, the element name is displayed in the output to clarify what the metric represents:


                
                  ## $statistic
                
                
                  ##         t
                
                
                  ## 0.2262262


                
                  ## $parameter
                
                
                  ##       df
                
                
                  ## 159.5544


                
                  ## $p.value
                
                
                  ## [1] 0.8213151


                
                  ## $method
                
                
                  ## [1] "Welch Two Sample t-test"

Given df = 159.55, you may be wondering how df is calculated for Welch’s t-test given that thus far, we have only discussed the basic df calculation outlined in chapter “Statistical Inference”; namely, df = n − 1. Welch’s t-test uses the Welch-Satterthwaite equation for df (Satterthwaite, 1946; Welch, 1947). This equation approximates df for a linear combination of independent sample variances, which means that if samples are not independent, this approximation may not be valid. The Welch-Satterthwaite equation is defined by:

$$\displaystyle \begin{aligned} df = \frac {\left(\frac{s^2_1}{n_1} + \frac{s^2_2}{n_2}\right)^2} {\frac{1}{n_1 - 1} \left(\frac{s^2_1}{n_1}\right)^2 + \frac{1}{n_2 - 1} \left(\frac{s^2_2}{n_2}\right)^2} \end{aligned}$$

Cohen’s d is a standardized measure of the difference between two means that helps us understand the size (or practical significance) of observed mean differences. Cohen’s d is defined by:

$$\displaystyle \begin{aligned} d = \frac{\bar{x}_1 - \bar{x}_2} {s_p} \end{aligned}$$

where s_p represents the pooled standard deviation defined by:

$$\displaystyle \begin{aligned} s_p = \sqrt{\frac{s^2_1 + s^2_2}{2}} \end{aligned}$$

Cohen’s d can be produced using the cohen.d() function from the effsize package in R. The following thresholds can be referenced as a general rule of thumb for interpreting effect size:

Small = 0.2
Medium = 0.5
Large = 0.8


                
                  ##
                
                
                  ## Cohen's d
                
                
                  ##
                
                
                  ## d estimate: 0.0273669 (negligible)
                
                
                  ## 95 percent confidence interval:
                
                
                  ##      lower      upper
                
                
                  ## -0.2004390  0.2551728

Not only are the differences statistically insignificant, Cohen’s d = 0.03 indicates a negligible difference. Therefore, there is nothing of interest based on these statistical and practical significance tests.

Mann-Whitney U Test

A popular nonparametric (distribution-free) alternative to Welch’s t-test is the Mann-Whitney U Test, also referred to as the Wilcoxon Rank-Sum Test. Rather than comparing the mean between two groups, like the Student’s t-test or Welch’s t-test, the Mann-Whitney U test considers the entire distribution by evaluating the extent to which the ranks are consistent between groups (i.e., similarity in the proportion of records with each value). When distributions are similar, the medians of the two groups are compared.

The wilcox.test() function is used to run this test in R. Let us illustrate by examining whether engagement (an ordinal variable in our data set) is significantly different between those who have been promoted in the past year and those who have not:


                
                  ##
                
                
                  ##  Wilcoxon rank sum test with continuity correction
                
                
                  ##
                
                
                  ## data:  no_promo and promo
                
                
                  ## W = 196056, p-value = 0.6707
                
                
                  ## alternative hypothesis: true location shift is not equal to 0

Based on these results, we fail to reject the null hypothesis, which states that there is no difference in engagement between those with and without promotions ($W = 196,\!056$, p = 0.67). Note the reference to continuity correction in the output. Continuity correction is applied when using a continuous distribution to approximate a discrete distribution. The Mann-Whitney U test we performed approximated a continuous distribution for testing differences between our ordinal (discrete) engagement data by applying this continuity correction.

Just as Cohen’s d is used to measure the magnitude of difference between a pair of means, Cliff’s delta can be leveraged to evaluate the size of differences between ordinal variables. Cliff’s delta measures how often a value in one distribution is higher than values in another, and this is appropriate in situations in which a nonparametric test of differences is used. This statistic can be produced using the cliff.delta() function from the effsize package in R.


                
                  ##
                
                
                  ## Cliff's Delta
                
                
                  ##
                
                
                  ## delta estimate: -0.0131625 (negligible)
                
                
                  ## 95 percent confidence interval:
                
                
                  ##       lower       upper
                
                
                  ## -0.07329491  0.04706528

Some (e.g., Vargha and Delaney, 2000) have endeavored to categorize the Cliff’s delta statistic, which ranges from −1 to 1, into effect size buckets. However, such categorizations are far more controversial than thresholds attributed to Cohen’s d. Nevertheless, the near-zero delta estimate of −0.01 indicates a negligible difference.

Paired Samples t-Test

A Paired Samples t-Test is used to compare means between pairs of measurements. This test is known by many other names, such as a dependent samples t-test, paired-difference t-test, matched pairs t-test, and repeated-samples t-test.

The assumption of normality in the context of a paired samples t-test relates to normally distributed paired differences. This is important, as the p-value for the test statistic will not be valid if this assumption is violated.

To illustrate, let us design an experiment. Let us assume morale has declined for employees who travel frequently, and several actions have been proposed by a task force to help address this. The task force has decided to pilot a new flexible work benefit over a six-month period to determine if it has a meaningful effect on morale. This new benefit is piloted to a random sample of frequent travelers, and our task is to test whether the outcomes warrant a broader rollout to frequent travelers.

Our DV (happiness) will be measured using a composite index derived from individual engagement, environment satisfaction, job satisfaction, and relationship satisfaction scores. Our objective is to determine if there is a significant improvement in this happiness index for the treatment group (those who are part of the flexible work pilot) relative to the pre/post difference for the control group (those not selected for the flexible work pilot).

While we could simply look at the pre/post differences for the treatment group, we understand from chapter “Research Design” that this would be a weak design that may lead to inaccurate conclusions. There could be alternative explanations for any observed increases in happiness that are unrelated to the intervention itself. For example, between time 1 and time 2, travel frequency may have decreased for everyone, which may contribute to overall happiness. By comparing pre/post differences between the treatment and control groups, we gain more confidence in isolating the effect of the flexible work benefit on happiness since alternative explanations should be reflected in any pre/post changes observed for the control group.

Let us prepare the data for this experiment. Since employees is a cross-sectional data set (single point-in-time), we will generate simulated data for repeated measures (i.e., post-intervention scores).

It is important to remember that a paired samples t-test requires that each of the paired measurements be obtained from the same subject. Therefore, if an employee terms between time 1 and time 2, or does not provide the survey responses needed to calculate the happiness index at both time 1 and time 2, the employee should be removed from the data since paired measurements will not be available.

The variance is not assumed to be equal for a paired test; therefore, the homogeneity of variance assumption is not applicable in this context.

Next, we will evaluate whether paired differences are normally distributed using the Shapiro-Wilk test. While individual survey items are measured on an ordinal scale, our derived happiness index is the average of multiple ordinal items and can be considered an approximately continuous variable. There are 2^p − p − 1 combinations of scores, where p is the number of variables. For our happiness index, there are 2⁴ − 4 − 1 = 11 combinations.

Based on a visual inspection (Fig. 8), the distributions of differences appear to be roughly normal. This should not be surprising given random values were sampled from normal distributions to derive artificial post-intervention happiness indices.

Let us test for normality by performing the Shapiro-Wilk test on vectors of differences:


                
                  ##
                
                
                  ##  Shapiro-Wilk normality test
                
                
                  ##
                
                
                  ## data:  treat_metrics$diff
                
                
                  ## W = 0.98936, p-value = 0.3738


                
                  ##
                
                
                  ##  Shapiro-Wilk normality test
                
                
                  ##
                
                
                  ## data:  ctrl_metrics$diff
                
                
                  ## W = 0.99096, p-value = 0.5134

Since p >= 0.05 for both tests, the assumption of normally distributed differences is met. Given the data generative process implemented for this example, differences would become increasingly normal as the sample size increases. We now have the greenlight to perform the paired samples t-test.

We can run a paired samples t-test in R by passing paired = TRUE as an argument to the same t.test() function used for the independent samples t-test. Since we are investigating whether the average post-intervention happiness index is significantly greater than the average pre-intervention happiness index, we also need the alternative = "greater" argument since the default two-tailed test only evaluates whether the average indices are significantly different (regardless of whether the pre- or post-intervention index is larger).


                
                  ##
                
                
                  ##  Paired t-test
                
                
                  ##
                
                
                  ## data:  treat_metrics$post_ind and treat_metrics$pre_ind
                
                
                  ## t = 35.906, df = 137, p-value < 2.2e-16
                
                
                  ## alternative hypothesis: true mean difference is greater than 0
                
                
                  ## 95 percent confidence interval:
                
                
                  ##  0.01490482        Inf
                
                
                  ## sample estimates:
                
                
                  ## mean difference
                
                
                  ##      0.01562551

These results indicate that the post-intervention happiness index is significantly larger than the pre-intervention happiness index. This is encouraging with respect to the potential efficacy of the flexible work pilot, but the question about whether the control group experienced a commensurate improvement over the observation period remains unanswered.

Let us run the same paired samples t-test using the control group indices:


              ##
              ##  Paired t-test
              ##
              ## data:  ctrl_metrics$post_ind and ctrl_metrics$pre_ind
              ## t = 0.59995, df = 138, p-value = 0.2748
              ## alternative hypothesis: true mean difference is greater than 0
              ## 95 percent confidence interval:
              ##  -8.719262e-05           Inf
              ## sample estimates:
              ## mean difference
              ##    4.953658e-05

Since p >= 0.05, we can conclude that there was not a significant increase in happiness indices for the control group, which provides additional –but not conclusive– support for the effectiveness of the flexible work benefit.

Chapter “Linear Regression” will introduce linear regression, which is a powerful modeling tool for people analytics that helps control for multiple alternative explanations of associations with the DV in order to isolate the unique effects of each IV.

Difference-in-differences (DiD)

estimation is an alternative quasi-experimental approach that originated from econometrics for evaluating the effects of interventions like these, but it is beyond the scope of this book. Angrist and Pischke (2009) is an excellent resource for learning about these methods.

We can evaluate the magnitude of mean differences for these paired samples by passing the paired = TRUE argument to the same cohen.d() function used for independent samples:


                  
                    ##
                  
                  
                    ## Cohen's d
                  
                  
                    ##
                  
                  
                    ## d estimate: 0.03132117 (negligible)
                  
                  
                    ## 95 percent confidence interval:
                  
                  
                    ##      lower      upper
                  
                  
                    ## 0.02960345 0.03303890

Though pre/post indices are statistically significant for the treatment group, the size of the difference is negligible (d = 0.03).

Wilcoxon Signed-Rank Test

The Wilcoxon Signed-Rank Test is the nonparametric alternative to the paired samples t-test. This distribution-free test does not require normally distributed differences.

The matched Wilcoxon Signed-Rank test is performed in R using the same wilcox.test() function used to perform the unmatched Wilcoxon Rank-Sum test. Though we can use a paired samples t-test to test differences for our flexible work benefit study since the assumption of normally distributed differences is met, let us run a Wilcoxon Signed-Rank test for demonstrative purposes:


                
                  ##
                
                
                  ##  Wilcoxon signed rank test with continuity correction
                
                
                  ##
                
                
                  ## data:  treat_metrics$post_ind and treat_metrics$pre_ind
                
                
                  ## V = 9591, p-value < 2.2e-16
                
                
                  ## alternative hypothesis: true location shift is not equal to 0


                
                  ##
                
                
                  ##  Wilcoxon signed rank test with continuity correction
                
                
                  ##
                
                
                  ## data:  ctrl_metrics$post_ind and ctrl_metrics$pre_ind
                
                
                  ## V = 5166, p-value = 0.5275
                
                
                  ## alternative hypothesis: true location shift is not equal to 0

Consistent with results from the paired samples t-tests, significantly higher post-intervention happiness indices were observed for the treatment group but not for the control group.

We can evaluate the magnitude of differences for these paired samples by passing the paired = TRUE argument to the same cliff.delta() function used for independent samples:


                
                  ##
                
                
                  ## Cliff's Delta
                
                
                  ##
                
                
                  ## delta estimate: 0.1544844 (small)
                
                
                  ## 95 percent confidence interval:
                
                
                  ##      lower      upper
                
                
                  ## 0.01652557 0.28667094

With Cliff’s delta, we observe a small difference between pre/post indices for the treatment group (delta estimate = 0.15).

Analysis of Variance (ANOVA)

Analysis of Variance (ANOVA) is used to determine whether the means of scale-level DVs are equal across nominal-level variables with three or more independent categories.

It is important to understand that H₀ in ANOVA not only requires all group means to be equal but their complex contrasts as well. For example, if we have four groups named A, B, C, and D, H₀ requires that μ_A = μ_B = μ_C = μ_D is true as well as the various complex contrasts such as μ_A,B = μ_C,D and μ_A = μ_B,C,D and μ_D = μ_B,C. Therefore, a difference between one or more of these contrasts results in a decision to reject H₀ in ANOVA. As a result, we may find a significant F-statistic but no significant differences between pairwise means.

It is also possible to find a significant pairwise mean difference but a non-significant result from ANOVA. As you may recall from chapter “Statistical Inference”, multiple comparisons reduce the power of statistical tests. Since multiple tests of mean differences are performed with ANOVA, the familywise error rate is used to adjust for the increased probability of a Type I error across the set of analyses. Since the power of a single pairwise test is greater relative to the power of familywise comparisons, we may find a significant result for the former but not the latter.

ANOVA requires IVs to be categorical (nominal or ordinal) and the DV to be continuous (interval or ratio). A one-way ANOVA is used to determine how one categorical IV influences a continuous DV. A two-way ANOVA is used to determine how two categorical IVs influence a continuous DV. A three-way ANOVA is used to evaluate how three categorical IVs influence a continuous DV. An ANOVA that uses two or more categorical IVs is often referred to as a factorial ANOVA. As discussed in chapter “Getting Started”, it is important to remain grounded in specific hypotheses, as a significant ANOVA may not actually test what is being hypothesized.

ANOVA is not a test, per se, but a F-test underpins it. The mathematical procedure behind the F-test is relatively straightforward:

Compute the within-group variance, which is also known as residual variance. Simply put, this tells us how different each member of the group is from the average.

Compute the between-group variance. This represents how different the group means are from one another.

Produce the F-statistic, which is the ratio of within-group variance to between-group variance.

More formally, the F-statistic is defined by:

$$\displaystyle \begin{aligned} F = \frac{MS_{\mbox{between}}}{MS_{\mbox{within}}} \end{aligned}$$

where:

$$\displaystyle \begin{aligned} MS_{\mbox{between}} = \frac{SS_{\mbox{between}}}{df_{\mbox{between}}}, \end{aligned}$$

$$\displaystyle \begin{aligned} MS_{\mbox{within}} = \frac{SS_{\mbox{within}}}{df_{\mbox{within}}}, \end{aligned}$$

$$\displaystyle \begin{aligned} SS_{\mbox{between}} = \displaystyle\sum_{j=1}^{p} n_j(\bar{x}_j-\bar{x})^2, \end{aligned}$$

$$\displaystyle \begin{aligned} SS_{\mbox{within}} = \displaystyle\sum_{j=1}^{p} \displaystyle\sum_{i=1}^{n_j} (x_{ij}-\bar{x_j})^2 \end{aligned}$$

One-Way ANOVA

To illustrate how to perform a one-way ANOVA, we will test the hypothesis that mean annual compensation is equal across job satisfaction levels.

Each observation in employees represents a unique employee, and a given employee can only have one job satisfaction score and one annual compensation value. The assumption of independence is met since each record exists independent of one another and each job satisfaction group is comprised of different employees.

Levene’s test (Levene, 1960) can be used to test the homogeneity of variance assumption—even with non-normal distributions. This can be performed in R using the leveneTest() function from the car package:


                
                  ## Levene's Test for Homogeneity of Variance (center = median)
                
                
                  ##         Df F value Pr(>F)
                
                
                  ## group    3  0.3293 0.8042
                
                
                  ##       1466

The test statistic associated with Levene’s test relates to the null hypothesis that there are no significant differences in variances across the job satisfaction levels. Since p >= 0.05, we fail to reject this null hypothesis and can assume equal variances.

Next, let us test the assumption of normality. It is important to note that the assumption of normality does not apply to the distribution of the DV but to the distribution of residuals for each group of the IV. Residuals in the context of ANOVA represent the difference between the actual values of the continuous DV relative to its mean value for each level of the categorical IV (e.g., $y - \bar {y}_A$, $y - \bar {y}_B$, $y - \bar {y}_C$). In ANOVA, we expect the residuals to be normally distributed around a mean of 0 (the balance point) when the data are normally distributed within each IV category; the more skewed the data, the larger the average distance of each DV value from the mean.

As we can see in Fig. 9, annual compensation data are not normally distributed within job satisfaction groups. Therefore, we would not expect the distribution of residuals to be normally distributed within these groups either.

To test whether the assumption of normality is met, we will first produce and review a quantile-quantile (Q-Q) plot. A Q-Q plot compares two probability distributions by plotting their quantiles (data partitioned into equal-sized groups) against each other. After partitioning annual compensation into groups differentiated by job satisfaction level, we can use the ggqqplot() function from the ggpubr library to build a Q-Q plot and evaluate the distribution of residuals.

To satisfy the assumption of normality, residuals must lie along the linear line. Based on the Q-Q plot in Fig. 10, there is a clear departure from normality at both ends of the theoretical range.

Let us test for normality using the Shapiro-Wilk test:


                
                  ##
                
                
                  ##  Shapiro-Wilk normality test
                
                
                  ##
                
                
                  ## data:  residuals
                
                
                  ## W = 0.95874, p-value < 2.2e-16

Since p < 0.05, we reject the null hypothesis of normally distributed data, which indicates that the assumption of normality is violated. This should not be surprising based on the deviation from normality we observed in Fig. 10.

Because the assumption of normality is violated, we have two options. First, we can attempt to transform the data so that the residuals using the transformed values are normally distributed. If the data are resistant to transformation, we can leverage a nonparametric alternative to ANOVA.

Let us first try several common data transformations and then examine the resulting Q-Q plots:

Even with these transformations, there is still a clear S-shaped curve about the residuals (Fig. 11). Though we cannot proceed with ANOVA due to violated assumptions, let us demonstrate the implementation steps for ANOVA. Performing ANOVA involves pairing the aov() function with the summary() function to display model output:


                
                  ##                      Df    Sum Sq   Mean Sq F value  Pr(>F)
                
                
                  ## as.factor(job_sat)    3 1.494e+10 4.980e+09   2.795   0.039 ∗
                
                
                  ## Residuals          1466 2.612e+12 1.782e+09
                
                
                  ## ---
                
                
                  ## Signif. codes:  0 '∗∗∗' 0.001 '∗∗' 0.01 '∗' 0.05 '.' 0.1 ' ' 1

The Kruskal Wallis H Test is the nonparametric alternative to a one-way ANOVA (Daniel, 1990) and an appropriate alternative for investigating median differences in annual comp by job satisfaction in our data. This test can be performed using the kruskal.test() function in R:


                
                  ##
                
                
                  ##  Kruskal-Wallis rank sum test
                
                
                  ##
                
                
                  ## data:  annual_comp by job_sat
                
                
                  ## Kruskal-Wallis chi-squared = 8.3242, df = 3, p-value = 0.03977

Since p < 0.05, we can conclude that there are significant differences in median compensation across the groups. However, this test does not indicate which groups are different. We can utilize the pairwise.wilcox.test() function to compute pairwise Wilcoxon rank-sum tests to identify where differences exist:


                
                  ##
                
                
                  ##  Pairwise comparisons using Wilcoxon rank sum test with continuity correction
                
                
                  ##
                
                
                  ## data:  employees$annual_comp and employees$job_sat
                
                
                  ##
                
                
                  ##   1     2     3
                
                
                  ## 2 0.298 -     -
                
                
                  ## 3 0.041 0.298 -
                
                
                  ## 4 0.041 0.298 0.879
                
                
                  ##
                
                
                  ## P value adjustment method: BH

Based on the results, there are significant pairwise differences in median annual compensation for job satisfaction levels 3 and 4 relative to level 1.

Factorial ANOVA

Factorial ANOVA is any ANOVA which uses two or more categorical IVs, such as a two-way or three-way ANOVA. The following output reflects the cross-tabulation of average annual compensation for each combination of two factors—job satisfaction and stock option level.


                
                  ##    job_sat stock_opt_lvl annual_comp
                
                
                  ## 1        1             0    141254.0
                
                
                  ## 2        2             0    138753.3
                
                
                  ## 3        3             0    132159.2
                
                
                  ## 4        4             0    132227.0
                
                
                  ## 5        1             1    141763.9
                
                
                  ## 6        2             1    135494.5
                
                
                  ## 7        3             1    135235.7
                
                
                  ## 8        4             1    135569.0
                
                
                  ## 9        1             2    146240.0
                
                
                  ## 10       2             2    146432.0
                
                
                  ## 11       3             2    145080.0
                
                
                  ## 12       4             2    143019.3
                
                
                  ## 13       1             3    154844.4
                
                
                  ## 14       2             3    144254.1
                
                
                  ## 15       3             3    135672.7
                
                
                  ## 16       4             3    127102.9

As we have already discussed, a difference between one or more of these contrasts may result in a decision to reject H₀ in ANOVA. We may also find a significant pairwise difference but a non-significant result from ANOVA since the familywise error rate adjustment is applied in the context of multiple comparisons which reduces statistical power.

Factorial ANOVA can be performed by chaining together variables with a + operator within the same aov() function used for one-way ANOVA:


                
                  ##                            Df    Sum Sq   Mean Sq F value  Pr(>F)
                
                
                  ## as.factor(job_sat)          3 1.494e+10 4.980e+09   2.803 0.0386 ∗
                
                
                  ## as.factor(stock_opt_lvl)    3 1.260e+10 4.201e+09   2.365 0.0694 .
                
                
                  ## Residuals                1463 2.599e+12 1.777e+09
                
                
                  ## ---
                
                
                  ## Signif. codes:  0 '∗∗∗' 0.001 '∗∗' 0.01 '∗' 0.05 '.' 0.1 ' ' 1

While mean annual compensation is significantly different across job satisfaction levels, this output alone is not too helpful in understanding the nature of the differences. These statistical significance markers indicate that there are meaningful differences that warrant a deeper understanding. Relationships of job satisfaction and stock option level with annual compensation are illustrated more effectively in Fig. 12.

As we can see, there is a strong negative relationship between job satisfaction and average annual compensation among employees with the highest stock option level (3). The relationship between job satisfaction and average annual compensation appears to be negative for employees with other stock option levels as well, albeit much weaker.

These relationships may initially seem counterintuitive, as one might expect higher levels of job satisfaction to contribute to higher performance and consequently, higher compensation. There may be other variables that happen to be correlated with job satisfaction and/or stock option level that are the actual determinants of annual compensation. For example, there may be a relationship between jobs that each feature a stock option level of 3 but for which there are markedly different average job satisfaction scores and annual compensation among workers in these jobs. Without accounting for additional variables that may explain why employees vary in the amount of annual compensation they earn, the limited set of relationships shown in Fig. 12 may lead to a flawed understanding and inaccurate conclusions.

Three-way factorials (and beyond) become difficult to visualize and understand in the way one-way ANOVA and two-way factorials have been explained in this chapter. In chapter “Linear Regression”, we will discuss how to create linear combinations of many IVs and parse the output to understand how they independently and jointly help explain variation in the DV.

Review Questions

What are the main differences between a Chi-square test and Fisher’s exact test?

Why is it problematic to test for significant differences using the χ² statistic with extremely small samples (e.g., n < 5)?

What are the general assumptions of parametric tests?

What is a benefit of Welch’s t-test over the Student’s t-test?

How does a paired samples t-test differ from an independent samples t-test?

In what ways does the Wilcoxon signed-rank test differ from the paired samples t-test?

How can the magnitude of differences (i.e., practical significance) be quantified when working with data measured on a continuous scale?

How can the magnitude of differences (i.e., practical significance) be quantified when working with data measured on an ordinal scale?

What null hypothesis does ANOVA test?

10.

What are some ways to better understand the nature of statistical differences indicated in the output of ANOVA?

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Springer Professional