nach oben

Quality & Quantity

Erschienen in:

01.05.2014

Intercoder reliability indices: disuse, misuse, and abuse

verfasst von: Guangchao Charles Feng

Erschienen in: Quality & Quantity | Ausgabe 3/2014

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Although intercoder reliability has been considered crucial to the validity of a content study, the choice among them has been controversial. This study analyzed all the content studies published in the two major communication journals that reported intercoder reliability, aiming to find how scholars conduct intercoder reliability test. The results revealed that some intercoder reliability indices were misused persistently concerning the levels of measurement, the number of coders, and the means of reporting reliability over the past 30 years. Implications of misuse, disuse, and abuse were discussed, and suggestions regarding proper choice of indices in various situations were made at last.

Vorheriger Artikel The views and suggestions to the librarians’ quality of information

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

\(^{1}\) Coders could be also called annotators, judges, raters, observers, classifiers and others, depending on the research field. Intercoder, as well as interrater, is used interchangeably throughout the paper.

When the reliability value is exceedingly lower than the value of percent agreement, e.g., percent agreement is higher than 0.8, while reliability is close or lower than 0, this may indicate that the marginal distribution is too skewed.

It is identical to Bennett et al. (1954)’s \(S\) coefficient.

As Lombard et al. (2002) argued, the proportion of percent agreement was probably underestimated because most “NAs” would actually adopt percent agreement.

They have corresponding multiple coder versions proposed by other scholars. For instance, Fleiss (1971) extended \(\pi \) while Conger (1980) and Light (1971) suggested the multiple coder version of \(\kappa \).

Cohen (1968) later proposed weighted \(\kappa \) for ordinal ratings. Krippendorff (2004a)’s \(\alpha \) is able to be applied to all levels of measurement. Some indices like ICCs are only applicable to interval ratings, and yet some like \(I_{r}\), Brennan and Prediger (1981)’s \(\kappa \) and \(\pi \) do not have higher levels of counterparts.

Although it has been a consensus that percent agreement, including Holsti generally overestimates reliability in that it does not make allowance for chance agreement, but it is not considered as misuse if used for nominal scaled codings. The rationale is to be explained below.

Reporting standard errors for the reliability value obtained is still arguable in the literature. Therefore, not reporting standard errors is not a problem for the present.

There are plenty of modeling approaches, such as log-linear, IRT (item response theory), latent class, and mixture modeling. In a separate study of the author, the approach of log-linear modeling was found to be no better than most indices.

Although variables with binary outcomes belong to the nominal level, most indices share more characteristics between binary and interval variables.

Artstein, R., Poesio, M.: Inter-coder agreement for computational linguistics. Comput. Linguist. 34(4), 555–596 (2008)CrossRef

Bates, D., Maechler, M., Bolker, B.: lme4: Linear mixed-effects models using s4 classes [Computer software manual], 2011, August. Retrieved from http://cran.r-project.org/web/packages/lme4/index.html

Bennett, E.M., Alpert, R., Goldstein, A.C.: Communications through limited-response questioning. Public Opin. Q. 18(3), 303–308 (1954). Retrieved from http://poq.oxfordjournals.org/content/18/3/303.abstract doi:10.1086/266520

Brennan, R., Prediger, D.: Coefficient kappa: some uses, misuses, and alternatives. Educ. Psychol. Meas. 41(3), 687 (1981)CrossRef

Byrt, T., Bishop, J., Carlin, J.B.: Bias, prevalence and kappa. J. Clin. Epidemiology 46(5), 423–429 (1993). doi:10.1016/0895-4356(93)90018-V. Retrieved from http://www.sciencedirect.com/science/article/pii/089543569390018V

Canty, A., Ripley, B.: Boot: Bootstrap functions (1.3-4 ed.) [Computer software manual], 2012, March. Retrieved from http://cran.r-project.org/web/packages/boot/index.html

Cicchetti, D., Feinstein, A.: High agreement but low kappa: II. Resolving the paradoxes* 1. J. Clin. Epidemiol. 43(6), 551–558 (1990)CrossRef

Cohen, J.: A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 20(1), 37–46 (1960). doi:10.1177/001316446002000104 CrossRef

Cohen, J.: Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. Psychol. Bull. 70(4), 213–220 (1968). Retrieved from http://search.ebscohost.com/login.aspx?direct=truedb=pdhAN=bul-70-4-213site=ehost-live doi:10.1037/h0026256

Conger, A.: Integration and generalization of kappas for multiple raters. Psychol. Bull. 88(2), 322 (1980). doi:10.1037/0033-2909.88.2.322 CrossRef

Cronbach, L.: Coefficient alpha and the internal structure of tests. Psychometrika 16(3), 297–334 (1951)

Feinstein, A.R., Cicchetti, D.V.: High agreement but low kappa: I. The problems of two paradoxes. J. Clin. Epidemiol. 43(6), 543–549 (1990). doi:10.1016/0895-4356(90)90158-L. Retrieved from http://www.sciencedirect.com/science/article/pii/089543569090158L

Feng, G.C.: Factors affecting intercoder reliability: a Monte Carlo experiment. Qual. Quant. 47(5), 2959–2982 (2013a). doi:10.1007/s11135-012-9745-9

Feng, G.C.: Underlying determinants driving agreement among coders. Qual. Quant. 47(5), 2983–2997 (2013b). doi:10.1007/s11135-012-9807-z

Finn, R.: A note on estimating the reliability of categorical data. Educ. Psychol. Meas. 30(1), 71–76 (1970). doi:10.1177/001316447003000106 CrossRef

Fleiss, J.: Measuring nominal scale agreement among many raters. Psychol. Bull. 76(5), 378–382 (1971)CrossRef

Fleiss, J.L., Levin, B., Paik, M.C.: The measurement of interrater agreement. In: Statistical Methods for Rates and Proportions, 3rd ed., pp. 598–626. Wiley, New York (2004). doi:10.1002/0471445428.ch18. Retrieved from http://dx.doi.org/10.1002/0471445428.ch18

Gwet, K.: Inter-rater reliability: dependency on trait prevalence and marginal homogeneity. Stat. Methods Inter-Rater Reliab. Assess. Ser. 2, 1–9 (2002)

Gwet, K.: Computing inter-rater reliability and its variance in the presence of high agreement. Br. J. Math. Stat. Psychol. 61(1), 29–48 (2008)CrossRef

Gwet, K.: Handbook of Inter-Rater Reliability—A Definitive Guide to Measuring the Extent of Agreement Among Multiple Raters. Advanced Analytics LLC, Gaithersburg (2010)

Holsti, O.: Content analysis for the social sciences and humanities. Addison-Wesley: Reading, MA (1969)

Hughes, M.A., Garrett, D.E.: Intercoder reliability estimation approaches in marketing: a generalizability theory framework for quantitative data. J. Mark. Res. 27(2), 185–195 (1990). Retrieved from http://www.jstor.org/stable/3172845

Kolbe, R.H., Burnett, M.S.: Content-analysis research: an examination of applications with directives for improving research reliability and objectivity. J. Consum. Res. 18(2), 243–250 (1991). Retrieved from http://www.jstor.org/stable/2489559

Krippendorff, K.: Bivariate agreement coefficients for reliability of data. Sociol. Methodol. 2, 139–150 (1970). Retrieved from http://www.jstor.org/stable/270787

Krippendorff, K.: Content Analysis: An Introduction to Its Methodology, 2nd ed. Sage, Thousand Oaks (2004a)

Krippendorff, K.: Reliability in content analysis. some common misconceptions and recommendations. Hum. Commun. Res. 30(3), 411–433 (2004b). doi:10.1111/j.1468-2958.2004.tb00738.x

Krippendorff, K.: Computing Krippendorff ’s alpha reliability, 2007, June. Retrieved from http://repository.upenn.edu/cgi/viewcontent.cgi?article=1043context=ascpapers

Krippendorff, K.: Agreement and information in the reliability of coding. Commun. Methods Meas. 5(2), 93–112 (2011). doi:10.1080/19312458.2011.568376 CrossRef

Krippendorff, K.: A dissenting view on so-called paradoxes of reliability coefficients. In: Salmon, C.T. (ed.) Communication Yearbook, vol. 36, pp. 481–500. Routledge, New York (2012)

Light, R.J.: Measures of response agreement for qualitative data: some generalizations and alternatives. Psychol. Bull. 76(5), 365–377 (1971)CrossRef

Lin, L.: A concordance correlation coefficient to evaluate reproducibility. Biometrics 45(1), 255 (1989)CrossRef

Lin, L., Hedayat, A.S., Wenting, W.: A unified approach for assessing agreement for continuous and categorical data. J. Biopharm. Stat. 17(4), 629–652 (2007). doi:10.1080/10543400701376498 CrossRef

Lombard, M., Snyder Duch, J.: Content analysis in mass communication: assessment and reporting of intercoder reliability. Hum. Commun. Res. 28(4), 587–604 (2002)CrossRef

Maxwell, A.E.: Coefficients of agreement between observers and their interpretation. Br. J. Psychiatry 130(1), 79–83 (1977). doi: 10.1192/bjp.130.1.79. Retrieved from http://bjp.rcpsych.org/content/130/1/79.abstract

Osgood, C.: The representational model and relevant research methods. In: de Sola Pool, I. (ed.) Trends in Content Analysis, pp. 33–88. University of Illinois Press, Champaign (1959)

Perreault, J., William D., Leigh, L.E.: Reliability of nominal data based on qualitative judgments. J. Mark. Res. 26(2), 135–148 (1989). Retrieved from http://www.jstor.org/stable/3172601

Potter, W.J., Levine-Donnerstein, D.: Rethinking validity and reliability in content analysis. J. Appl. Commun. Res. 27(3), 258–284 (1999). doi:10.1080/00909889909365539 CrossRef

Riffe, D., Lacy, S., Fico, F.: Analyzing Media Messages: Using Quantitative Content Analysis in Research. Lawrence Erlbaum Assoc Inc, New Jersey (2005)

Scott, W.: Reliability of content analysis: the case of nominal scale coding. Public Opin. Q. 19, 321–325 (1955). doi:10.1086/266577

Spiegelman, M., Terwilliger, C., Fearing, F.: The reliability of agreement in content analysis. J. Soc. Psychol. 37, 175–187 (1953)

Warrens, M.: A formal proof of a paradox associated with Cohen’s kappa. J. Classif. 1–11 (2010). doi:10.1007/s00357-010-9060-x. Retrieved from https://openaccess.leidenuniv.nl/bitstream/handle/1887/16310/Warrens2010JoC27322332.2

Zhao, X.: A Reliability Index (ai) that Assumes Honest Coders and Variable Randomness. Association for Education in Journalism and Mass Communication, Chicago (2012)

Zhao, X., Liu, J.S., Deng, K.: Assumptions behind inter-coder reliability indices. In: Salmon, C.T. (ed.) Communication Yearbook, vol. 36, pp. 419–480. Routledge, New York (2012)

Titel: Intercoder reliability indices: disuse, misuse, and abuse
verfasst von: Guangchao Charles Feng
Publikationsdatum: 01.05.2014
Verlag: Springer Netherlands
Erschienen in: Quality & Quantity / Ausgabe 3/2014
Print ISSN: 0033-5177
Elektronische ISSN: 1573-7845
DOI: https://doi.org/10.1007/s11135-013-9956-8

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 3/2014

Phenomenology and hermeneutic phenomenology: the philosophy, the methodologies, and using hermeneutic phenomenology to investigate lecturers’ experiences of curriculum design

The impacts of investor sentiment on returns and conditional volatility of international stock markets

Measuring health expenditures and outcomes in saarc region: health is a luxury?

Should I take this seriously? A simple checklist for calling bullshit on policy supporting research

Perceptions of poverty attributions in Europe: a multilevel mixture model approach

A Bayesian network to discover relationships between negative features in sport: a case study of teen players

Premium Partner