Top

Quality & Quantity

Published in:

01-05-2014

Intercoder reliability indices: disuse, misuse, and abuse

Author: Guangchao Charles Feng

Published in: Quality & Quantity | Issue 3/2014

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Although intercoder reliability has been considered crucial to the validity of a content study, the choice among them has been controversial. This study analyzed all the content studies published in the two major communication journals that reported intercoder reliability, aiming to find how scholars conduct intercoder reliability test. The results revealed that some intercoder reliability indices were misused persistently concerning the levels of measurement, the number of coders, and the means of reporting reliability over the past 30 years. Implications of misuse, disuse, and abuse were discussed, and suggestions regarding proper choice of indices in various situations were made at last.

previous article The views and suggestions to the librarians’ quality of information

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

\(^{1}\) Coders could be also called annotators, judges, raters, observers, classifiers and others, depending on the research field. Intercoder, as well as interrater, is used interchangeably throughout the paper.

When the reliability value is exceedingly lower than the value of percent agreement, e.g., percent agreement is higher than 0.8, while reliability is close or lower than 0, this may indicate that the marginal distribution is too skewed.

It is identical to Bennett et al. (1954)’s \(S\) coefficient.

As Lombard et al. (2002) argued, the proportion of percent agreement was probably underestimated because most “NAs” would actually adopt percent agreement.

They have corresponding multiple coder versions proposed by other scholars. For instance, Fleiss (1971) extended \(\pi \) while Conger (1980) and Light (1971) suggested the multiple coder version of \(\kappa \).

Cohen (1968) later proposed weighted \(\kappa \) for ordinal ratings. Krippendorff (2004a)’s \(\alpha \) is able to be applied to all levels of measurement. Some indices like ICCs are only applicable to interval ratings, and yet some like \(I_{r}\), Brennan and Prediger (1981)’s \(\kappa \) and \(\pi \) do not have higher levels of counterparts.

Although it has been a consensus that percent agreement, including Holsti generally overestimates reliability in that it does not make allowance for chance agreement, but it is not considered as misuse if used for nominal scaled codings. The rationale is to be explained below.

Reporting standard errors for the reliability value obtained is still arguable in the literature. Therefore, not reporting standard errors is not a problem for the present.

There are plenty of modeling approaches, such as log-linear, IRT (item response theory), latent class, and mixture modeling. In a separate study of the author, the approach of log-linear modeling was found to be no better than most indices.

Although variables with binary outcomes belong to the nominal level, most indices share more characteristics between binary and interval variables.

Artstein, R., Poesio, M.: Inter-coder agreement for computational linguistics. Comput. Linguist. 34(4), 555–596 (2008)CrossRef

Bates, D., Maechler, M., Bolker, B.: lme4: Linear mixed-effects models using s4 classes [Computer software manual], 2011, August. Retrieved from http://cran.r-project.org/web/packages/lme4/index.html

Bennett, E.M., Alpert, R., Goldstein, A.C.: Communications through limited-response questioning. Public Opin. Q. 18(3), 303–308 (1954). Retrieved from http://poq.oxfordjournals.org/content/18/3/303.abstract doi:10.1086/266520

Brennan, R., Prediger, D.: Coefficient kappa: some uses, misuses, and alternatives. Educ. Psychol. Meas. 41(3), 687 (1981)CrossRef

Byrt, T., Bishop, J., Carlin, J.B.: Bias, prevalence and kappa. J. Clin. Epidemiology 46(5), 423–429 (1993). doi:10.1016/0895-4356(93)90018-V. Retrieved from http://www.sciencedirect.com/science/article/pii/089543569390018V

Canty, A., Ripley, B.: Boot: Bootstrap functions (1.3-4 ed.) [Computer software manual], 2012, March. Retrieved from http://cran.r-project.org/web/packages/boot/index.html

Cicchetti, D., Feinstein, A.: High agreement but low kappa: II. Resolving the paradoxes* 1. J. Clin. Epidemiol. 43(6), 551–558 (1990)CrossRef

Cohen, J.: A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 20(1), 37–46 (1960). doi:10.1177/001316446002000104 CrossRef

Cohen, J.: Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. Psychol. Bull. 70(4), 213–220 (1968). Retrieved from http://search.ebscohost.com/login.aspx?direct=truedb=pdhAN=bul-70-4-213site=ehost-live doi:10.1037/h0026256

Conger, A.: Integration and generalization of kappas for multiple raters. Psychol. Bull. 88(2), 322 (1980). doi:10.1037/0033-2909.88.2.322 CrossRef

Cronbach, L.: Coefficient alpha and the internal structure of tests. Psychometrika 16(3), 297–334 (1951)

Feinstein, A.R., Cicchetti, D.V.: High agreement but low kappa: I. The problems of two paradoxes. J. Clin. Epidemiol. 43(6), 543–549 (1990). doi:10.1016/0895-4356(90)90158-L. Retrieved from http://www.sciencedirect.com/science/article/pii/089543569090158L

Feng, G.C.: Factors affecting intercoder reliability: a Monte Carlo experiment. Qual. Quant. 47(5), 2959–2982 (2013a). doi:10.1007/s11135-012-9745-9

Feng, G.C.: Underlying determinants driving agreement among coders. Qual. Quant. 47(5), 2983–2997 (2013b). doi:10.1007/s11135-012-9807-z

Finn, R.: A note on estimating the reliability of categorical data. Educ. Psychol. Meas. 30(1), 71–76 (1970). doi:10.1177/001316447003000106 CrossRef

Fleiss, J.: Measuring nominal scale agreement among many raters. Psychol. Bull. 76(5), 378–382 (1971)CrossRef

Fleiss, J.L., Levin, B., Paik, M.C.: The measurement of interrater agreement. In: Statistical Methods for Rates and Proportions, 3rd ed., pp. 598–626. Wiley, New York (2004). doi:10.1002/0471445428.ch18. Retrieved from http://dx.doi.org/10.1002/0471445428.ch18

Gwet, K.: Inter-rater reliability: dependency on trait prevalence and marginal homogeneity. Stat. Methods Inter-Rater Reliab. Assess. Ser. 2, 1–9 (2002)

Gwet, K.: Computing inter-rater reliability and its variance in the presence of high agreement. Br. J. Math. Stat. Psychol. 61(1), 29–48 (2008)CrossRef

Gwet, K.: Handbook of Inter-Rater Reliability—A Definitive Guide to Measuring the Extent of Agreement Among Multiple Raters. Advanced Analytics LLC, Gaithersburg (2010)

Holsti, O.: Content analysis for the social sciences and humanities. Addison-Wesley: Reading, MA (1969)

Hughes, M.A., Garrett, D.E.: Intercoder reliability estimation approaches in marketing: a generalizability theory framework for quantitative data. J. Mark. Res. 27(2), 185–195 (1990). Retrieved from http://www.jstor.org/stable/3172845

Kolbe, R.H., Burnett, M.S.: Content-analysis research: an examination of applications with directives for improving research reliability and objectivity. J. Consum. Res. 18(2), 243–250 (1991). Retrieved from http://www.jstor.org/stable/2489559

Krippendorff, K.: Bivariate agreement coefficients for reliability of data. Sociol. Methodol. 2, 139–150 (1970). Retrieved from http://www.jstor.org/stable/270787

Krippendorff, K.: Content Analysis: An Introduction to Its Methodology, 2nd ed. Sage, Thousand Oaks (2004a)

Krippendorff, K.: Reliability in content analysis. some common misconceptions and recommendations. Hum. Commun. Res. 30(3), 411–433 (2004b). doi:10.1111/j.1468-2958.2004.tb00738.x

Krippendorff, K.: Computing Krippendorff ’s alpha reliability, 2007, June. Retrieved from http://repository.upenn.edu/cgi/viewcontent.cgi?article=1043context=ascpapers

Krippendorff, K.: Agreement and information in the reliability of coding. Commun. Methods Meas. 5(2), 93–112 (2011). doi:10.1080/19312458.2011.568376 CrossRef

Krippendorff, K.: A dissenting view on so-called paradoxes of reliability coefficients. In: Salmon, C.T. (ed.) Communication Yearbook, vol. 36, pp. 481–500. Routledge, New York (2012)

Light, R.J.: Measures of response agreement for qualitative data: some generalizations and alternatives. Psychol. Bull. 76(5), 365–377 (1971)CrossRef

Lin, L.: A concordance correlation coefficient to evaluate reproducibility. Biometrics 45(1), 255 (1989)CrossRef

Lin, L., Hedayat, A.S., Wenting, W.: A unified approach for assessing agreement for continuous and categorical data. J. Biopharm. Stat. 17(4), 629–652 (2007). doi:10.1080/10543400701376498 CrossRef

Lombard, M., Snyder Duch, J.: Content analysis in mass communication: assessment and reporting of intercoder reliability. Hum. Commun. Res. 28(4), 587–604 (2002)CrossRef

Maxwell, A.E.: Coefficients of agreement between observers and their interpretation. Br. J. Psychiatry 130(1), 79–83 (1977). doi: 10.1192/bjp.130.1.79. Retrieved from http://bjp.rcpsych.org/content/130/1/79.abstract

Osgood, C.: The representational model and relevant research methods. In: de Sola Pool, I. (ed.) Trends in Content Analysis, pp. 33–88. University of Illinois Press, Champaign (1959)

Perreault, J., William D., Leigh, L.E.: Reliability of nominal data based on qualitative judgments. J. Mark. Res. 26(2), 135–148 (1989). Retrieved from http://www.jstor.org/stable/3172601

Potter, W.J., Levine-Donnerstein, D.: Rethinking validity and reliability in content analysis. J. Appl. Commun. Res. 27(3), 258–284 (1999). doi:10.1080/00909889909365539 CrossRef

Riffe, D., Lacy, S., Fico, F.: Analyzing Media Messages: Using Quantitative Content Analysis in Research. Lawrence Erlbaum Assoc Inc, New Jersey (2005)

Scott, W.: Reliability of content analysis: the case of nominal scale coding. Public Opin. Q. 19, 321–325 (1955). doi:10.1086/266577

Spiegelman, M., Terwilliger, C., Fearing, F.: The reliability of agreement in content analysis. J. Soc. Psychol. 37, 175–187 (1953)

Warrens, M.: A formal proof of a paradox associated with Cohen’s kappa. J. Classif. 1–11 (2010). doi:10.1007/s00357-010-9060-x. Retrieved from https://openaccess.leidenuniv.nl/bitstream/handle/1887/16310/Warrens2010JoC27322332.2

Zhao, X.: A Reliability Index (ai) that Assumes Honest Coders and Variable Randomness. Association for Education in Journalism and Mass Communication, Chicago (2012)

Zhao, X., Liu, J.S., Deng, K.: Assumptions behind inter-coder reliability indices. In: Salmon, C.T. (ed.) Communication Yearbook, vol. 36, pp. 419–480. Routledge, New York (2012)

Title: Intercoder reliability indices: disuse, misuse, and abuse
Author: Guangchao Charles Feng
Publication date: 01-05-2014
Publisher: Springer Netherlands
Published in: Quality & Quantity / Issue 3/2014
Print ISSN: 0033-5177
Electronic ISSN: 1573-7845
DOI: https://doi.org/10.1007/s11135-013-9956-8

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Wirtschaft"

Other articles of this Issue 3/2014

A quantitative view on policymakers’ goal, institutions and tax evasion

The impacts of investor sentiment on returns and conditional volatility of international stock markets

The views and suggestions to the librarians’ quality of information

A Bayesian network to discover relationships between negative features in sport: a case study of teen players

Toward a simple structure: a comparison of different rotation techniques

Measuring health expenditures and outcomes in saarc region: health is a luxury?

Premium Partner