1 Introduction
In various fields of science, it is frequently required that units (persons, individuals, objects) are rated on a scale by human observers. Examples are teachers that rate assignments completed by pupils to assess their proficiency, neurologists that rate the severity of patients’ symptoms to determine the stage of Alzheimer’s disease, psychologists that classify patients’ mental health problems, and biologists that examine features of animals in order to find similarities between them, which enables the classification of newly discovered species.
To study whether ratings are reliable, a standard procedure is to ask two raters to judge independently the same group of units. The agreement between the ratings can then be used as an indication of the reliability of the classifications by the raters (McHugh
2012; Shiloach et al.
2010; Wing et al.
2002; Blackman and Koval
2000). Requirements for obtaining reliable ratings are, e.g., clear definitions of the categories and the use of clear scoring criteria. A sufficient level of agreement ensures interchangeability of the ratings and consensus in decisions (Warrens
2015).
Assessing reliability is of concern for both categorical as well as interval rating instruments. For categorical ratings, kappa coefficients are commonly used. For example, Cohen’s kappa coefficient (Cohen
1960) is commonly used to quantify the extent to which two raters agree on a nominal (unordered) scale (De Raadt et al.
2019; Viera and Garrett
2005; Muñoz and Bangdiwala
1997; Graham and Jackson
1993; Maclure and Willett
1987; Schouten
1986), while the weighted kappa coefficient (Cohen
1968) is widely used for quantifying agreement between ratings on an ordinal scale (Moradzadeh et al.
2017; Vanbelle
2016; Warrens
2012a,
2013,
2014; Vanbelle and Albert
2009; Crewson
2005; Cohen
1968). Both Cohen’s kappa and weighted kappa are standard tools for assessing agreemen t in behavioral, social, and medical sciences (De Vet et al.
2013; Sim and Wright
2005; Banerjee
1999).
The Pearson correlation and intraclass correlation coefficients are widely used for assessing reliability when ratings are on an interval scale (McGraw and Wong
1996; Shrout and Fleiss
1979). Shrout and Fleiss (
1979) discuss six intraclass correlation coefficients. Different intraclass correlations are appropriate in different situations (Warrens
2017; McGraw and Wong
1996). Both kappa coefficients and correlation coefficients can be used to assess the reliability of ordinal rating scales.
The primary aim of this study is to provide a thorough understanding of seven reliability coefficients that can be used with ordinal rating scales, such that the applied researcher can make a sensible choice out of these seven coefficients. A second aim of this study is to find out whether the choice of the coefficient matters. We compare the following reliability coefficients: Cohen’s unweighted kappa, weighted kappa with linear and quadratic weights, intraclass correlation ICC(3,1) (Shrout and Fleiss
1979), Pearson’s and Spearman’s correlations, and Kendall’s tau-b. We have the following three research questions: (1) under what conditions do quadratic kappa and the Pearson and intraclass correlations produce similar values? (2) To what extent do we reach the same conclusions about inter-rater reliability with different coefficients? (3) To what extent do the coefficients measure agreement in similar ways?
To answer the research questions, we will compare the coefficients analytically and by using simulated and empirical data. These different approaches complement each other. The analytical methods are used to make clear how some of the coefficients are related. The simulated and empirical data are used to explore a wide variety of inter-rater reliability situations. For the empirical comparison, we will use two different real-world datasets. The marginal distributions of the real-world datasets are in many cases skewed. In contrast, the marginal distributions of the simulated datasets are symmetric.
The paper is organized as follows. The second and third sections are used to define, respectively, the kappa coefficients and correlation coefficients, and to discuss connections between the coefficients. In the fourth section, we briefly discuss the comparison of reliability coefficients in Parker et al. (
2013) and we present hypotheses with regard to the research questions. In the fifth section, three coefficients that can be expressed in terms of the rater means, variances, and covariance (quadratic kappa, intraclass correlation ICC(3,1), and the Pearson correlation) are compared analytically. In the sixth section, we compare all seven coefficients in a simulation study. This is followed by a comparison of all seven coefficients using two real-world datasets in the seventh section. The final section contains a discussion and recommendations.
2 Kappa Coefficients
Suppose that two raters classified independently
n units (individuals, objects, products) into one of
k ≥ 3 ordered categories that were defined in advance. Let
pij denote the proportion of units that were assigned to category
i by the first rater and to category
j by the second rater. Table
1 is an example of an agreement table with elements
pij for
k = 4. The table presents pairwise classifications of a sample of units into four categories. The diagonal cells
p11,
p22,
p33, and
p44 are the proportion of units on which the raters agree. The off-diagonal cells consist of units on which the raters have not reached agreement. The marginal totals or base rates
pi+ and
p+j reflect how often a category is used by a rater.
Table 1
Pairwise classifications of units into four categories
Category 1 | p11 | p12 | p13 | p14 | p1+ |
Category 2 | p21 | p22 | p23 | p24 | p2+ |
Category 3 | p31 | p32 | p33 | p34 | p3+ |
Category 4 | p41 | p42 | p43 | p44 | p4+ |
Total | p+ 1 | p+ 2 | p+ 3 | p+ 4 | 1 |
Table
2 is an example of an agreement table with real-world numbers. Table
2 contains the pairwise classifications of two observers who each rated the same teacher on 35 items of the International Comparative Analysis of Learning and Teaching (ICALT) observation instrument (Van de Grift
2007). The agreement table is part of the data used in Van der Scheer et al. (
2017). The Van der Scheer data are further discussed in the fifth section.
Table 2
Pairwise classifications of two observers who rated teacher 7 on 35 ICALT items (Van der Scheer et al.
2017)
1 = Predominantly weak | 0.03 | 0 | 0 | 0 | 0.03 |
2 = More weaknesses than strengths | 0 | 0.14 | 0 | 0 | 0.14 |
3 = More strengths than weaknesses | 0 | 0.03 | 0.49 | 0 | 0.52 |
4 = Predominantly strong | 0 | 0 | 0.20 | 0.11 | 0.31 |
Total | 0.03 | 0.17 | 0.69 | 0.11 | 1.00 |
Theweighted kappa coefficient can be defined as a similarity coefficient or as a dissimilarity coefficient. In the dissimilarity coefficient definition, it is usual to assign a weight of zero to full agreements and to allocate to disagreements a positive weight whose magnitude increases proportionally to their seriousness (Gwet
2012). Each of the
k2 cells of the agreement table has its own disagreement weight, denoted by
wij, where
wij ≥ 0 for all
i and
j. Cohen’s weighted kappa (Cohen
1968) is then defined as
$$ \kappa_{w}=1-\frac{\sum\limits^{k}_{i=1}\sum\limits^{k}_{j=1}w_{ij}p_{ij}}{\sum\limits^{k}_{i=1}\sum\limits^{k}_{j=1} w_{ij}p_{i+}p_{+j}}. $$
(1)
Weighted kappa in Eq.
1 consists of two quantities: the proportion weighted observed disagreement in the numerator of the fraction, and the proportion expected weighted disagreement in the denominator. The value of weighted kappa is not affected when all weights are multiplied by a positive number.
Using
wij = 1 if
i≠
j and
wii = 0 in Eq.
1 we obtain Cohen’s kappa or unweighted kappa
$$ \kappa=\frac{P_{o}-P_{e}}{1-P_{e}}=\frac{\sum\limits^{k}_{i=1}(p_{ii}-p_{i+}p_{+i})}{1-\sum\limits^{k}_{i=1}p_{i+}p_{+i}}, $$
(2)
where
\(P_{o}={\sum }^{k}_{i=1}p_{ii}\) is the proportion observed agreement, i.e., the proportion of units on which the raters agree, and
\(P_{e}={\sum }^{k}_{i=1}p_{i+}p_{+i}\) is the proportion expected agreement. Unweighted kappa differentiates only between agreements and disagreements. Furthermore, unweighted kappa is commonly used when ratings are on a nominal (unordered) scale, but it can be applied to scales with ordered categories as well.
For ordinal scales, frequently used disagreement weights are the linear weights and the quadratic weights (Vanbelle
2016; Warrens
2012a; Vanbelle and Albert
2009; Schuster
2004). The linear weights are given by
wij = |
i −
j|. The linearly weighted kappa, or linear kappa for short, is given by
$$ \kappa_{l}=1-\frac{\sum\limits^{k}_{i=1}\sum\limits^{k}_{j=1}|i-j|p_{ij}}{\sum\limits^{k}_{i=1}\sum\limits^{k}_{j=1}|i-j|p_{i+}p_{+j}}. $$
(3)
With linear weights, the categories are assumed to be equally spaced (Brenner and Kliebsch
1996). For many real-world data, linear kappa gives a higher value than unweighted kappa (Warrens
2013). For example, for the data in Table
2, we have
κ = 0.61 and
κl = 0.68. Furthermore, the quadratic weights are given by
wij = (
i −
j)
2, and the quadratically weighted kappa, or quadratic kappa for short, is given by
$$ \kappa_{q}=1-\frac{\sum\limits^{k}_{i=1}\sum\limits^{k}_{j=1}(i-j)^{2}p_{ij}}{\sum\limits^{k}_{i=1}\sum\limits^{k}_{j=1}(i-j)^{2}p_{i+}p_{+j}}. $$
(4)
For many real-world data, quadratic kappa produces higher values than linear kappa (Warrens
2013). For example, for the data in Table
2 we have
κl = 0.68 and
κq = 0.77.
In contrast to unweighted kappa, linear kappa in Eq.
3 and quadratic kappa in Eq.
4 allow that some disagreements are considered of greater gravity than others (Cohen
1968). For example, disagreements on categories that are adjacent in an ordinal scale are considered less serious than disagreements on categories that are further apart: the seriousness of disagreements is modeled with the weights. It should be noted that all special cases of weighted kappa in Eq.
1 with symmetric weighting schemes, e.g., linear and quadratic kappa, coincide with unweighted kappa with
k = 2 categories (Warrens
2013).
The flexibility provided by weights to deal with the different degrees of disagreement could be considered a strength of linear kappa and quadratic kappa. However, the arbitrariness of the choice of weights is generally considered a weakness of the coefficient (Vanbelle
2016; Warrens
2012a,
2013,
2014; Vanbelle and Albert
2009; Crewson
2005; Maclure and Willett
1987). The assignment of weights can be very subjective and studies in which different weighting schemes were used are generally not comparable (Kundel and Polansky
2003). Because of such perceived limitations of linear kappa and quadratic kappa, Tinsley and Weiss (
2000) have recommended against the use of these coefficients. Soeken and Prescott (
1986, p. 736) also recommend against the use of these coefficients: “because nonarbitrary assignment of weighting schemes is often very difficult to achieve, some psychometricians advocate avoiding such systems in absence of well-established theoretical criteria, due to the serious distortions they can create.”
3 Correlation Coefficients
Correlation coefficients are popular statistics for measuring agreement, or more generally association, on an interval scale. Various correlation coefficients can be defined using the rater means and variances, denoted by
m1 and
\({s^{2}_{1}}\) for the first rater, and
m2 and
\({s^{2}_{2}}\) for the second rater, respectively, and the covariance between the raters, denoted by
s12. To calculate these statistics, one could use a unit by rater table of size
n × 2 associated with agreement (Tables
1 and
2), where an entry of the
n × 2 table indicates to which of the
k categories a unit (row) was assigned by the first and second raters (first and second columns, respectively). We will use consecutive integer values for coding the categories, i.e., the first category is coded as 1, the second category is coded as 2, and so on.
The Pearson correlation is given by
$$ r=\frac{s_{12}}{s_{1}s_{2}}. $$
(5)
The correlation in Eq.
5 is commonly used in statistics and data analysis, and is the most popular coefficient for quantifying linear association between two variables (Rodgers and Nicewander
1988). Furthermore, in factor analysis, the Pearson correlation is commonly used to quantify association between ordinal scales, in many cases 4-point or 5-point Likert-type scales.
The Spearman correlation is a nonparametric version of the Pearson correlation that measures the strength and direction of a monotonic relationship between the numbers. We will denote the Spearman correlation by
ρ. The value of the Spearman correlation can be obtained by replacing the observed scores by rank scores and then using Eq.
5. The values of the Pearson and Spearman correlations are often quite close (De Winter et al.
2016; Mukaka
2012; Hauke and Kossowski
2011).
A third correlation coefficient is intraclass correlation ICC(3,1) from Shrout and Fleiss (
1979). This particular intraclass correlation is given by
$$ R=\text{ICC(3,1)}=\frac{2s_{12}}{{s_{1}^{2}}+{s_{2}^{2}}}. $$
(6)
Intraclass correlations are commonly used in agreement studies with interval ratings. The correlations in Eqs.
5 and
6 are identical if the raters have the same variance (i.e.,
\({s^{2}_{1}}={s^{2}_{2}}\)). If the rater variances differ, the Pearson correlation produces a higher value than the intraclass correlation (i.e.,
r >
R). For example, for the data in Table
2, we have
R = 0.81 and
r = 0.83.
Quadratic kappa in Eq.
4 can also be expressed in terms of rater means, variances, and the covariance between the raters. If the ratings (scores) are labeled as 1, 2, 3, and so on, quadratic kappa is given by (Schuster
2004; Schuster and Smith
2005)
$$ \kappa_{q}=\frac{2s_{12}}{{s_{1}^{2}}+{s_{2}^{2}}+\frac{n}{n-1}(m_{1}-m_{2})^{2}}. $$
(7)
Quadratic kappa in Eq.
7 may be interpreted as a proportion of variance (Schuster and Smith
2005; Schuster
2004; Fleiss and Cohen
1973). Coefficients Eqs.
6 and
7 are identical if the rater means are equal (i.e.,
m1 =
m2). If the rater means differ, the intraclass correlation produces a higher value than quadratic kappa (i.e.,
R >
κq). For example, for the data in Table
2, we have
κq = 0.77 and
R = 0.81. Furthermore, if both rater means and rater variances are equal (i.e.,
m1 =
m2 and
\({s^{2}_{1}}={s^{2}_{2}}\)), the coefficients in Eqs.
5,
6, and
7 coincide.
Warrens (
2014) showed that intraclass correlation ICC(3,1), the Pearson correlation and the Spearman correlation (coefficients
R,
r, and
ρ) are in fact special cases of the weighted kappa coefficient in Eq.
1, since the coefficients produce equal values if particular weighting schemes are used. The details of these particular weighting schemes can be found in Warrens (
2014).
Linear and quadratic kappa (through their weighting schemes) and the Pearson, intraclass, and Spearman correlations (through the means, variances, and covariance of the raters) use a numerical system to quantify agreement between two raters. They use more information than just the order of the categories. In contrast, the Kendall rank correlation (Kendall
1955,
1962; Parker et al.
2013) is a non-parametric coefficient for ordinal association between two raters that only uses the order of the categories.
Let (
xi,
yi) and (
xj,
yj) be two rows of the unit by rater table of size
n × 2. A pair of rows (
xi,
yi) and (
xj,
yj) is said to be concordant if either both
xi >
xj and
yi >
yj holds or both
xi <
xj and
yi <
yj holds; otherwise, the pair is said to be discordant. A pair of rows (
xi,
yi) and (
xj,
yj) is said to be tied if
xi =
xj or
yi =
yj. Furthermore, let
nc denote the number of concordant pairs and
nd the number of discordant pairs. Moreover, let
n0 =
n(
n − 1)/2 be the total number of unit pairs, and define
$$ n_{1}={\sum}^{k}_{s=1}t_{s}(t_{s}-1)/2\qquad\text{and}\qquad n_{2}={\sum}^{k}_{s=1}u_{s}(u_{s}-1)/2, $$
(8)
where
ts and
us are the number of tied values associated with category
s of raters 1 and 2, respectively. Kendall’s tau-b is given by
$$ \tau_{b}=\frac{n_{c}-n_{d}}{\sqrt{(n_{0}-n_{1})(n_{0}-n_{2})}}. $$
(9)
The particular version of the Kendall rank correlation in Eq.
9 makes adjustment for ties and is most suitable when both raters use the same number of possible values (Berry et al.
2009). Both conditions apply to the present study.
The values of the Spearman and Kendall correlations can be different (Siegel and Castellan
1988; Xu et al.
2013). Although both coefficients range from − 1.0 to + 1.0, for most of this range, the absolute value of the Spearman correlation is empirically about 1.5 times that of the Kendall correlation (Kendall
1962).
4 Hypotheses
Before we present our hypotheses with regard to the research questions, we summarize several relevant results from Parker et al. (
2013). These authors compared various reliability coefficients for ordinal rating scales, including linear kappa, quadratic kappa and the Pearson and Kendall correlations, using simulated data. They investigated whether a fixed value, e.g., 0.60, has the same meaning across reliability coefficients, and across rating scales with different number of categories. Among other things, Parker et al. (
2013) in their study reported the following results. Differences between the values of quadratic kappa and the Pearson and Kendall correlations usually were less than 0.15. Furthermore, the values of quadratic kappa and the Pearson and Kendall correlations, on the one hand, and linear kappa, on the other hand, were usually quite different. Moreover, differences between the coefficients depend on the number of categories considered. Differences tend to be smaller with two and three categories than with five or more categories. With two categories, the three kappa coefficients are identical (Warrens
2013).
With respect to the first research question (under what conditions do quadratic kappa and the Pearson and intraclass correlations produce similar values?), we have only general expectations, since these relationships have not been comprehensively studied. We expect that intraclass correlation ICC(3,1) will produce similar values as the Pearson correlation if rater variances are similar, and similar values as quadratic kappa if the rater means are similar (Schuster
2004).
With regard to the second research question (to what extent do we reach the same conclusions about inter-rater reliability with different coefficients?), and third research question (to what extent do the coefficients measure agreement in similar ways?), we hypothesize that the values of the Pearson and Spearman correlations are very similar (De Winter et al.
2016; Mukaka
2012; Hauke and Kossowski
2011). Furthermore, we hypothesize the values of the Spearman and Kendall correlations to be somewhat different (Kendall
1962; Siegel and Castellan
1988; Xu et al.
2013; Parker et al.
2013). In addition, we hypothesize that the values of the three kappa coefficients can be quite different (Warrens
2013). Combining some of the above expectations, we also expect the values of both unweighted kappa and linear kappa to be quite different from the values of the four correlation coefficients.
5 Analytical Comparison of Quadratic Kappa and the Pearson and Intraclass Correlations
The Pearson and Spearman correlations have been compared analytically by various authors (De Winter et al.
2016; Mukaka
2012; Hauke and Kossowski
2011). Furthermore, the three kappa coefficients have been compared analytically and empirically (Warrens
2011,
2013). For many real-world data, we can expect to observe the double inequality
κ <
κl <
κq, i.e., quadratic kappa tends to produce a higher value than linear kappa, which in turn tends to produce a higher value than the unweighted kappa coefficient (Warrens
2011). Moreover, the values of the three kappa coefficients tend to be quite different (Warrens
2013).
To approach the first research question (under what conditions do quadratic kappa and the Pearson and intraclass correlations produce similar values?), we study, in this section, differences between the three agreement coefficients. The relationships between these three coefficients have not been comprehensively studied. What is known is that, in general, we have the double inequality
κq ≤
R ≤
r, i.e., quadratic kappa will never produce a higher value than the intraclass correlation, which in turn will never produce a higher value than the Pearson correlation (Schuster
2004). This inequality between the coefficients can be used to study the positive differences
r −
R,
R −
κq, and
r −
κq.
We first consider the difference between the Pearson and intraclass correlations. The positive difference between the two coefficients can be written as
$$ r-R=\frac{r(s_{1}-s_{2})^{2}}{{s^{2}_{1}}+{s^{2}_{2}}}. $$
(10)
The right-hand side of Eq.
10 consists of three quantities. We lose one parameter if we consider the ratio between the standard deviations
$$ c=\frac{\max(s_{1},s_{2})}{\min(s_{1},s_{2})}, $$
(11)
instead of the standard deviations separately. Using Eq.
11 we may write difference (
10) as
$$ r-R=\frac{r(1-c)^{2}}{1+c^{2}}. $$
(12)
The first derivative of
f(
c) = (1 −
c)
2/(1 +
c2) with respect to
c is presented in Appendix
1. Since this derivative is strictly positive for
c > 1, formula (
12) shows that difference
r −
R is strictly increasing in both
r and
c. In other words, the difference between the Pearson and intraclass correlations increases (1) if agreement in terms of
r increases, and (2) if the ratio between the standard deviations increases.
Table
3 gives the values of difference
r −
R for different values of
r and ratio (
11). The table shows that the difference between the Pearson and intraclass correlations is very small (≤ 0.05) if
c ≤ 1.40, and is small (≤ 0.10) if
c ≤ 1.60 or if
r ≤ 0.50.
Table 3
Values of difference r − R for different values of r and ratio (9)
1.20 | 0.00 | 0.00 | 0.00 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.02 |
1.40 | 0.01 | 0.01 | 0.02 | 0.02 | 0.03 | 0.03 | 0.04 | 0.04 | 0.05 | 0.05 |
1.60 | 0.01 | 0.02 | 0.03 | 0.04 | 0.05 | 0.06 | 0.07 | 0.08 | 0.09 | 0.10 |
1.80 | 0.02 | 0.03 | 0.05 | 0.06 | 0.08 | 0.09 | 0.11 | 0.12 | 0.14 | 0.15 |
2.00 | 0.02 | 0.04 | 0.06 | 0.08 | 0.10 | 0.12 | 0.14 | 0.16 | 0.18 | 0.20 |
Next, we consider the difference between the intraclass correlation and quadratic kappa. The positive difference between the two coefficients can be written as
$$ R-\kappa_{q}=\frac{R}{g(\cdot)+1}, $$
(13)
where the function
g(⋅) is given by
$$ g(n,m_{1},m_{2},s_{1},s_{2})=\frac{n-1}{n}\cdot\frac{{s^{2}_{1}}+{s^{2}_{2}}}{(m_{1}-m_{2})^{2}}. $$
(14)
A derivation of Eqs.
13 and
14 is presented in Appendix
2. The right-hand side of Eq.
13 shows that difference (
13) is increasing in
R and is decreasing in the function
g(⋅). Hence, the difference between the intraclass correlation and quadratic kappa increases if agreement in terms of
R increases. Since the ratio (
n − 1)/
n is close to unity for moderate to large sample sizes, quantity (
14) is approximately equal to the ratio of the sum of the two variances (i.e.,
\({s^{2}_{1}}+{s^{2}_{2}}\)) to the squared difference between the rater means (i.e., (
m1 −
m2)
2). Quantity (
14) increases if one of the rater variances becomes larger, and decreases if the difference between the rater means increases.
Tables
4 and
5 give the values of difference
R −
κq for different values of intraclass correlation
R and mean difference |
m1 −
m2|, and for
\({s^{2}_{1}}+{s^{2}_{2}}\) and
n = 100. Table
4 contains the values of
R −
κq when the sum of the rater variances is equal to unity (i.e.,
\({s^{2}_{1}}+{s^{2}_{2}}=1\)). Table
5 presents the values of the difference when
\({s^{2}_{1}}+{s^{2}_{2}}=2\).
Table 4
Values of difference R − κq for different values of R and |m1 − m2|, and \({s^{2}_{1}}+{s^{2}_{2}}=1\)
0.10 | 0.00 | 0.00 | 0.00 | 0.00 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 |
0.20 | 0.00 | 0.01 | 0.01 | 0.02 | 0.02 | 0.02 | 0.03 | 0.03 | 0.03 | 0.04 |
0.30 | 0.01 | 0.02 | 0.03 | 0.03 | 0.04 | 0.05 | 0.06 | 0.07 | 0.08 | 0.08 |
0.40 | 0.01 | 0.03 | 0.04 | 0.06 | 0.07 | 0.08 | 0.10 | 0.11 | 0.13 | 0.14 |
0.50 | 0.02 | 0.04 | 0.06 | 0.08 | 0.10 | 0.12 | 0.14 | 0.16 | 0.18 | 0.20 |
Table 5
Values of difference R − κq for different values of R and |m1 − m2|, and \({s^{2}_{1}}+{s^{2}_{2}}=2\)
0.10 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.01 |
0.20 | 0.00 | 0.00 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.02 | 0.02 | 0.02 |
0.30 | 0.00 | 0.01 | 0.01 | 0.02 | 0.02 | 0.03 | 0.03 | 0.03 | 0.04 | 0.04 |
0.40 | 0.01 | 0.01 | 0.02 | 0.03 | 0.04 | 0.04 | 0.05 | 0.06 | 0.07 | 0.07 |
0.50 | 0.01 | 0.02 | 0.03 | 0.04 | 0.06 | 0.07 | 0.08 | 0.09 | 0.10 | 0.11 |
Tables
4 and
5 show that the difference between the intraclass correlation and quadratic kappa is very small (≤ 0.04) if
\({s^{2}_{1}}+{s^{2}_{2}}=1\) and |
m1 −
m2|≤ 0.20 or
R ≤ 0.20, or if
\({s^{2}_{1}}+{s^{2}_{2}}=2\) and |
m1 −
m2|≤ 0.30 or
R ≤ 0.40. Furthermore, the difference between the coefficients is small (≤ 0.10) if
\({s^{2}_{1}}+{s^{2}_{2}}=1\) and |
m1 −
m2|≤ 0.30 or
R ≤ 0.50, or if
\({s^{2}_{1}}+{s^{2}_{2}}=2\) and |
m1 −
m2|≤ 0.40 or
R ≤ 0.90.
Finally, we consider the difference between the Pearson correlation and quadratic kappa. The positive difference between the two coefficients can be written as
$$ r-\kappa_{q}=r\cdot h(\cdot), $$
(15)
where the function
h(⋅) is given by
$$ h(n,m_{1},m_{2},s_{1},s_{2})=\frac{(s_{1}-s_{2})^{2}+\frac{n}{n-1}(m_{1}-m_{2})^{2}}{{s^{2}_{1}}+{s^{2}_{2}}+\frac{n}{n-1}(m_{1}-m_{2})^{2}}. $$
(16)
The right-hand side of Eq.
15 shows that difference (
15) is increasing in
r and in the function
h(⋅). Hence, the difference between the Pearson correlation and quadratic kappa increases if agreement in terms of
r increases. Quantity (
16) is a rather complex function that involves rater means as well as rater variances. Since the inequality
\((s_{1}-s_{2})^{2}\leq {s^{2}_{1}}+{s^{2}_{2}}\) holds, quantity (
16) and difference (
15) increase if the difference between the rater means increases.
To understand the difference r − κq in more detail, it is insightful to consider two special cases. If the rater means are equal (i.e., m1 = m2), the intraclass correlation coincides with quadratic kappa (i.e., R = κq) and difference r − κq is equal to difference r − R. Thus, in the special case that the rater means are equal, all conditions discussed above for difference r − R also apply to difference r − κq. Furthermore, if the rater variances are equal (i.e., \({s^{2}_{1}}={s^{2}_{2}}\)), the Pearson and intraclass correlations coincide (i.e., r = R) and difference r − κq is equal to difference R − κq. If we set s = s1 = s2 and use 2s2 instead of \({s^{2}_{1}}+{s^{2}_{2}}\), then all conditions discussed above for difference R − κq also apply to difference r − κq.
Difference (
15) is equal to the sum of differences Eqs.
10 and
13, i.e.,
$$ r-\kappa_{q}=r-R+R-\kappa_{q}=\frac{r(1-c)^{2}}{1+c^{2}}+\frac{R}{g(\cdot)+1}, $$
(17)
where quantity
c is given in Eq.
11 and function
g(⋅) in Eq.
14. Identity (
17) shows that to understand difference (
15), it suffices to understand the differences
r −
R and
R −
κq. Apart from the overall level of agreement, difference
r −
R depends on the rater variances, whereas difference
R −
κq depends primarily on the rater means.
Identity (
17) also shows that we may also combine the various conditions that hold for differences Eqs.
10 and
13 to obtain new conditions for difference (
15). For example, combining the numbers in Tables
3,
4, and
5 we find that difference (
15) is small (≤ 0.09) if
c ≤ 1.40, and in addition, if
\({s^{2}_{1}}+{s^{2}_{2}}=1\) and |
m1 −
m2|≤ 0.20 or
R ≤ 0.20, or if
\({s^{2}_{1}}+{s^{2}_{2}}=2\) and |
m1 −
m2|≤ 0.30 or
R ≤ 0.40.
With regard to the first research question, the analyses in this section can be summarized as follows. In general, differences between quadratic kappa and the Pearson and intraclass correlations increase if agreement becomes larger. Differences between the three coefficients are generally small if differences between rater means and variances are relatively small. However, if differences between rater means and variances are substantial, differences between the values of the three coefficients are small only if agreement between raters is small.
6 A Simulation Study
6.1 Data Generation
In this section, we compare all seven reliability coefficients using simulated ordinal rating data. We carried out a number of simulations under different conditions, according to the following procedure. In each scenario, we sampled scores for 200 units from a bivariate normal distribution, using the mvrnorm function in R (R Core Team
2019). The two variables correspond to the two raters. To obtain categorical agreement data, we discretized the variables into five categories: values smaller than − 1.0 were coded 1, values equal to or greater than − 1.0 and smaller than − 0.4 were coded as 2, values equal to or greater than − 0.4 and smaller than 0.4 were coded as 3, values equal to or greater than 0.4 and smaller than 1.0 were coded as 4, and values equal to or greater than 1.0 were coded as 5. For a standardized variable, this coding scheme corresponds to a unimodal and symmetric distribution with probabilities 0.16, 0.18, 0.32, 0.18, and 0.16 for categories 1, 2, 3, 4, and 5, respectively. Thus, the middle category is a bit more popular in the case of a standardized variable. Finally, the values of the seven reliability coefficients were calculated using the discretized data. The above steps were repeated 10,000 times, denoted by 10K for short, in each condition.
For the simulations, we differentiated between various conditions. The mvrnorm function in R allows the user to specify the means and covariance matrix of the bivariate normal distribution. We generated data with either a high (0.80) or medium (0.40) value of the Pearson correlation (i.e., high or medium agreement). Furthermore, we varied the rater means and the rater variances. Either both rater means were set to 0 (i.e., equal rater means), or we set one mean value to 0 and one to 0.5 (i.e., unequal rater means). Moreover, we either set both rater variances to 1 (i.e., equal rater variances), or we set the variances to 0.69 and 1.44 (i.e., unequal rater variances). Fully crossed, the simulation design consists of 8 (= 2 × 2 × 2) conditions. These eight conditions were chosen to illustrate some of the findings from the previous section. Notice that with both variances equal to 1, ratio (9) is also equal to 1. If the variances are equal to 0.69 and 1.44, ratio (9) is equal to 1.44.
6.2 Comparison Criteria
To answer the second research question (to what extent we will reach the same conclusions about inter-rater reliability with different coefficients), we will compare the values of the coefficients in an absolute sense. If the differences between the values (of one replication of the simulation study) are small (≤ 0.10), we will conclude that the coefficients lead to the same decision in practice. Of course the value 0.10 is somewhat arbitrary, but we think this is a useful criterion for many real-world applications. We will use ratios of the numbers of simulations in which the values lead to the same conclusion (maximum difference between the values is less than or equal to 0.10) and the total numbers of simulations (= 10K), to quantify how often we will reach the same conclusion. To answer the third research question (to what extent the coefficients measure agreement in a similar way), Pearson correlations between the coefficient values will be used to assess how similar the coefficients measure agreement in this simulation study.
6.3 Results of the Simulation Study
Tables
6 and
7 give two statistics that we will use to assess the similarity between the coefficients for the simulated data. Both tables consist of four subtables. Each subtable is associated with one of the simulated conditions. Table
6 contains four subtables associated with the high agreement condition, whereas Table
7 contains four subtables associated with the medium agreement condition. The upper panel of each subtable of Tables
6 and
7 gives the Pearson correlations between the coefficient values of all 10,000 simulations. The lower panel of each subtable contains the ratios of the numbers of simulations in which the values lead to the same conclusion about inter-rater reliability (maximum difference between the values is less than or equal to 0.10) and the total numbers of simulations (= 10K).
Table 6
Correlations and number of times the same decision will be reached for the values of the agreement coefficients for the simulated data, for the high agreement condition
1. Equal rater means and variances |
κ | | 0.89 | 0.68 | 0.68 | 0.68 | 0.65 | 0.72 |
κl | 0/10K | | 0.94 | 0.94 | 0.94 | 0.91 | 0.95 |
κq | 0/10K | 0/10K | | 1.00 | 1.00 | 0.98 | 0.99 |
R | 0/10K | 0/10K | 10K/10K | | 1.00 | 0.98 | 0.99 |
r | 0/10K | 0/10K | 10K/10K | 10K/10K | | 0.98 | 0.99 |
ρ | 0/10K | 0/10K | 10K/10K | 10K/10K | 10K/10K | | 0.99 |
τb | 0/10K | 9043/10K | 7636/10K | 7237/10K | 6956/10K | 8941/10K | |
2. Equal rater means, unequal rater variances |
κ | | 0.88 | 0.66 | 0.66 | 0.64 | 0.59 | 0.65 |
κl | 0/10K | | 0.94 | 0.94 | 0.92 | 0.88 | 0.91 |
κq | 0/10K | 0/10K | | 1.00 | 0.99 | 0.96 | 0.96 |
R | 0/10K | 0/10K | 10K/10K | | 0.99 | 0.96 | 0.99 |
r | 0/10K | 0/10K | 10K/10K | 10K/10K | | 0.98 | 0.99 |
ρ | 0/10K | 0/10K | 10K/10K | 10K/10K | 10K/10K | | 0.99 |
τb | 0/10K | 3133/10K | 9965/10K | 9949/10K | 9101/10K | 9515/10K | |
3. Unequal rater means, equal rater variances |
κ | | 0.85 | 0.61 | 0.49 | 0.49 | 0.42 | 0.45 |
κl | 0/10K | | 0.93 | 0.81 | 0.81 | 0.76 | 0.77 |
κq | 0/10K | 0/10K | | 0.91 | 0.91 | 0.87 | 0.86 |
R | 0/10K | 0/10K | 9352/10K | | 1.00 | 0.97 | 0.98 |
r | 0/10K | 0/10K | 9200/10K | 10K/10K | | 0.97 | 0.98 |
ρ | 0/10K | 0/10K | 8657/10K | 10K/10K | 10K/10K | | 0.99 |
τb | 0/10K | 11/10K | 10K/10K | 9419/10K | 9256/10K | 9498/10K | |
4. Unequal rater means and variances |
κ | | 0.85 | 0.63 | 0.53 | 0.52 | 0.43 | 0.46 |
κl | 0/10K | | 0.94 | 0.84 | 0.83 | 0.77 | 0.78 |
κq | 0/10K | 0/10K | | 0.92 | 0.92 | 0.88 | 0.87 |
R | 0/10K | 0/10K | 9880/10K | | 0.99 | 0.95 | 0.95 |
r | 0/10K | 0/10K | 9616/10K | 10K/10K | | 0.96 | 0.97 |
ρ | 0/10K | 0/10K | 9158/10K | 10K/10K | 10K/10K | | 0.99 |
τb | 0/10K | 7/10K | 10K/10K | 9901/10K | 9389/10K | 9818/10K | |
Table 7
Correlations and number of times the same decision will be reached for the values of the agreement coefficients for the simulated data, for the medium agreement condition
5. Equal rater means and variances |
κ | | 0.79 | 0.54 | 0.54 | 0.54 | 0.53 | 0.56 |
κl | 1256/10K | | 0.93 | 0.93 | 0.93 | 0.92 | 0.94 |
κq | 26/10K | 1447/10K | | 1.00 | 1.00 | 0.99 | 0.99 |
R | 24/10K | 1370/10K | 10K/10K | | 1.00 | 0.99 | 0.99 |
r | 24/10K | 1347/10K | 10K/10K | 10K/10K | | 0.99 | 0.99 |
ρ | 32/10K | 1804/10K | 10K/10K | 10K/10K | 10K/10K | | 1.00 |
τb | 218/10K | 9876/10K | 9993/10K | 9987/10K | 9987/10K | 9995/10K | |
6. Equal rater means, unequal rater variances |
κ | | 0.78 | 0.53 | 0.53 | 0.53 | 0.51 | 0.53 |
κl | 1363/10K | | 0.93 | 0.93 | 0.93 | 0.92 | 0.93 |
κq | 19/10K | 1427/10K | | 1.00 | 1.00 | 0.99 | 0.99 |
R | 19/10K | 1348/10K | 10K/10K | | 1.00 | 0.99 | 0.99 |
r | 15/10K | 905/10K | 10K/10K | 10K/10K | | 0.99 | 0.99 |
ρ | 23/10K | 1306/10K | 10K/10K | 10K/10K | 10K/10K | | 1.00 |
τb | 153/10K | 9534/10K | 10K/10K | 10K/10K | 9993/10K | 9999/10K | |
7. Unequal rater means, equal rater variances |
κ | | 0.76 | 0.48 | 0.47 | 0.47 | 0.44 | 0.46 |
κl | 2533/10K | | 0.92 | 0.90 | 0.90 | 0.88 | 0.89 |
κq | 70/10K | 3109/10K | | 0.98 | 0.98 | 0.96 | 0.96 |
R | 18/10K | 517/10K | 9998/10K | | 1.00 | 0.98 | 0.98 |
r | 17/10K | 502/10K | 9998/10K | 10K/10K | | 0.98 | 0.98 |
ρ | 30/10K | 756/10K | 9995/10K | 10K/10K | 10K/10K | | 1.00 |
τb | 194/10K | 7304/10K | 10K/10K | 9977/10K | 9972/10K | 9999/10K | |
8. Unequal rater means and variances |
κ | | 0.77 | 0.49 | 0.48 | 0.47 | 0.44 | 0.46 |
κl | 2205/10K | | 0.92 | 0.90 | 0.90 | 0.88 | 0.89 |
κq | 62/10K | 2589/10K | | 0.98 | 0.98 | 0.96 | 0.96 |
R | 20/10K | 591/10K | 10K/10K | | 1.00 | 0.98 | 0.98 |
r | 19/10K | 446/10K | 10K/10K | 10K/10K | | 0.98 | 0.98 |
ρ | 28/10K | 733/10K | 9997/10K | 10K/10K | 10K/10K | | 1.00 |
τb | 161/10K | 6886/10K | 10K/10K | 9981/10K | 9959/10K | 10K/10K | |
Consider the lower panels of the subtables of Tables
6 and
7 first. In all cases, we will come to the same conclusion with the intraclass, Pearson, and Spearman correlations (10K/10K). Hence, for these simulated data, it does not really matter which of these correlation coefficients is used. Furthermore, with medium agreement (Table
7), we will almost always reach the same conclusion with intraclass, Pearson, and Spearman correlations, on the one hand, and the Kendall correlation, on the other hand. When agreement is high (Table
6), we will reach the same conclusion in a substantial number of cases.
If rater means are equal (the two top subtables of Tables
6 and
7) the quadratic kappa, intraclass correlation, and the Pearson correlation coincide (see previous section), and we will come to the same conclusion with quadratic kappa and the three correlation coefficients (10K/10K). If rater means are unequal (the two bottom subtables of Tables
6 and
7), the quadratic kappa is not identical to the intraclass and Pearson correlations, but we will still reach the same conclusion in many cases with quadratic kappa and the four correlation coefficients.
The differences in the values of unweighted kappa and linear kappa compared to quadratic kappa and the four correlation coefficients are striking. If there is high agreement (Table
6), we will generally never come to the same conclusion with unweighted kappa and linear kappa. Furthermore, with high agreement, we will generally not reach the same conclusion about inter-rater reliability with unweighted kappa and linear kappa, on the one hand, and the other five coefficients, on the other hand. If there is medium agreement (Table
7), the values of the seven coefficients tend to be a bit closer to one another, but we will still come to the same conclusion in only relatively few replications.
Next, consider the upper panels of the subtables of Tables
6 and
7. The correlations between the intraclass, Pearson, Spearman, and Kendall correlations are very high (≥ 0.95) in general and almost perfect (≥ 0.98) if agreement is medium. These four correlation coefficients may produce different values but tend to measure agreement in a similar way. The correlations between quadratic kappa and the correlation coefficients are very high (≥ 0.96) in the case of medium agreement, or if high agreement is combined with equal rater means. In the case of high agreement and unequal rater means, the values drop a bit (0.86–0.92). All in all, it seems that quadratic kappa measures agreement in a very similar way as the correlation coefficients for these simulated data. All other correlations are substantially lower.
With regard to the second research question, we will reach the same conclusion about inter-rater reliability for most simulated replications with any correlation coefficient (intraclass, Pearson, Spearman, or Kendall). Furthermore, using quadratic kappa, we may reach a similar conclusion as with any correlation coefficient a great number of times. Unweighted kappa and linear kappa generally produce different (much lower) values than the other five coefficients. If there is medium agreement, the values of the seven coefficients tend to be a bit closer to one another than if agreement is high.
With regard to the third research question, the four correlation coefficients tend to measure agreement in a similar way: their values are very highly correlated in this simulation study. Furthermore, quadratic kappa is highly correlated with all four correlation coefficients as well for these simulated data.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.