nach oben

Discover Computing

Erschienen in:

01.04.2013 | Crowd Sourcing

An analysis of human factors and label accuracy in crowdsourcing relevance judgments

verfasst von: Gabriella Kazai, Jaap Kamps, Natasa Milic-Frayling

Erschienen in: Discover Computing | Ausgabe 2/2013

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Crowdsourcing relevance judgments for the evaluation of search engines is used increasingly to overcome the issue of scalability that hinders traditional approaches relying on a fixed group of trusted expert judges. However, the benefits of crowdsourcing come with risks due to the engagement of a self-forming group of individuals—the crowd, motivated by different incentives, who complete the tasks with varying levels of attention and success. This increases the need for a careful design of crowdsourcing tasks that attracts the right crowd for the given task and promotes quality work. In this paper, we describe a series of experiments using Amazon’s Mechanical Turk, conducted to explore the ‘human’ characteristics of the crowds involved in a relevance assessment task. In the experiments, we vary the level of pay offered, the effort required to complete a task and the qualifications required of the workers. We observe the effects of these variables on the quality of the resulting relevance labels, measured based on agreement with a gold set, and correlate them with self-reported measures of various human factors. We elicit information from the workers about their motivations, interest and familiarity with the topic, perceived task difficulty, and satisfaction with the offered pay. We investigate how these factors combine with aspects of the task design and how they affect the accuracy of the resulting relevance labels. Based on the analysis of 960 HITs and 2,880 HIT assignments resulting in 19,200 relevance labels, we arrive at insights into the complex interaction of the observed factors and provide practical guidelines to crowdsourcing practitioners. In addition, we highlight challenges in the data analysis that stem from the peculiarity of the crowdsourcing environment where the sample of individuals engaged in specific work conditions are inherently influenced by the conditions themselves.

Vorheriger Artikel Increasing cheat robustness of crowdsourcing tasks

Nächster Artikel Identifying top news using crowdsourcing

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

http://crowdflower.com/.

http://www.mturk.com/.

http://trec.nist.gov/.

http://microworkers.com/.

These pay levels are based on preliminary experiments with increasing pay up to the point that a satisfactory uptake was realized.

Some implementations test significance against the hypothesis that the ratings are completely random (kappa = 0), which we pass in all cases.

We do not attempt to model the exact distribution. Instead, we look at the appropriate discrete powerlaw fit based on (Clauset et al. 2009). This powerlaw fit for all data is from 31 HITs onwards (i.e., x _min = 31, α = 3.08, D = 0.07229) due to the discontinuity at the tail (spammers doing larger numbers of HITs than expected). For cleaned data, the fit is much better starting from 4 HITs onwards (i.e., x _min = 4, α = 1.93, D = 0.08019).

Rejecting workers based on their gold set agreement would have led to an artificial bias in the accuracy of the ‘cleaned set’, since the gold set is no longer an independent test of the resulting quality of work.

Alonso, O., & Baeza-Yates, R. A. (2011). Design and implementation of relevance assessments using crowdsourcing. In Advances in information retrieval —33rd European conference on IR research (ECIR 2011), LNCS, Vol. 6611 (pp. 153–164). Springer.

Alonso, O., & Mizzaro, S. (2009). Can we get rid of TREC assessors? using mechanical turk for relevance assessment. In Proceedings of the SIGIR 2009 workshop on the future of IR evaluation (pp. 557–566).

Alonso, O., Rose, D. E., & Stewart, B. (2008). Crowdsourcing for relevance evaluation. SIGIR Forum, 42(2), 9–15.CrossRef

Alonso, O., Schenkel, R., & Theobald, M. (2010). Crowdsourcing assessments for xml ranked retrieval. In Advances in information retrieval, 32nd European conference on IR research (ECIR 2010), LNCS, Vol. 5993 (pp. 602–606). Springer.

Bailey, P., Craswell, N., Soboroff, I., Thomas, P., de Vries, A. P., & Yilmaz, E. (2008). Relevance assessment: Are judges exchangeable and does it matter. In SIGIR ’08: Proceedings of the 31st annual international ACM SIGIR conference (pp. 667–674). New York, NY: ACM.

Behrend, T. S., Sharek, D. J., Meade, A. W., & Wiebe, E. N. (2011). The viability of crowdsourcing for survey research. Behavior Research Methods.

Carterette, B., & Soboroff, I. (2010). The effect of assessor error on ir system evaluation. In Proceeding of the 33rd international ACM SIGIR conference on research and development in information retrieval, SIGIR ’10 (pp. 539–546). New York, NY: ACM.

Case, K. E., Fair, R. C., & Oster, S. C. (2011). Principles of Economics (10th ed.). Englewood Cliffs NJ: Prentice-Hall.

Clauset, A., Shalizi, C. R., & Newman, M. E. J. (2009). Power-law distributions in empirical data. SIAM Review, 51(4), 661–703.MathSciNetMATHCrossRef

Cleverdon, C. W. (1967). The Cranfield tests on index language devices. Aslib, 19, 173–192.CrossRef

Cormack, G. V., Palmer, C. R., & Clarke, C. L. A. (1998). Efficient construction of large test collections. In Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’98 (pp. 282–289). New York, NY: ACM.

Doan, A., Ramakrishnan, R., & Halevy, A. Y. (2011). Crowdsourcing systems on the world-wide web. Commun. ACM, 54, 86–96.CrossRef

Downs, J. S., Holbrook, M. B., Sheng, S., & Cranor, L. F. (2010). Are your participants gaming the system? Screening mechanical turk workers. In Proceedings of the 28th international conference on human factors in computing systems (CHI ’10) (pp. 2399–2402). ACM.

Eickhoff C., & de Vries, A. P. (2011). How crowdsourcable is your task? In Proceedings of the workshop on crowdsourcing for search and data mining (CSDM 2011) (pp. 11–14). ACM.

Feild, H., Jones, R., Miller, R. C., Nayak, R., Churchill, E. F., & Velipasaoglu, E. (2010). Logging the search self-efficacy of Amazon mechanical turkers. In M. Lease, V. Carvalho, & E. Yilmaz (Eds.), Proceedings of the ACM SIGIR 2010 workshop on crowdsourcing for search evaluation (CSE 2010) (pp. 27–30). Geneva, Switzerland.

Festinger, L., & Carlsmith, J. M. (1959). Cognitive consequences of forced compliance. Journal of Abnormal and Social Psychology, 58(2), 203–210. http://psychclassics.yorku.ca/Festinger/.

Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76, 378–382.CrossRef

Grady, C., & Lease, M. (2010). Crodsourcing document relevance assessment with mechanical turk. In Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s mechanical turk (pp. 172–179).

Grimes, C., Tang, D., & Russell, D. M. (2007). Query logs alone are not enough. In E. Amitay, C. G. Murray, & J. Teevan (Eds.), Query log analysis: Social and technological challenges. A workshop at the 16th International World Wide Web Conference (WWW 2007).

Hirth, M., Hoßfeld, T., & Tran-Gia, P. (2011) Anatomy of a crowdsourcing platform—using the example of microworkers.com. In Workshop on future internet and next generation networks (FINGNet). Seoul, Korea.

Howe, J. (2008). Crowdsourcing: Why the power of the crowd Is driving the future of business. New York, NY: Crown Publishing Group.

Ipeirotis, P. (2008). Mechanical turk: The demographics. Blog post. http://behind-the-enemy-lines.blogspot.com/2008/03/mechanical-turk-demographics.html.

Ipeirotis, P. (2010a). The new demographics of mechanical turk. Blog post. http://behind-the-enemy-lines.blogspot.com/2010/03/new-demographics-of-mechanical-turk.html.

Ipeirotis, P. G. (2010b). Analyzing the amazon mechanical turk marketplace. XRDS, 17, 16–21.CrossRef

Ipeirotis P. G., Provost, F., & Wang, J. (2010). Quality management on amazon mechanical turk. In Proceedings of the ACM SIGKDD workshop on human computation, HCOMP ’10 (pp. 64–67). New York, NY: ACM.

Jain, S., & Parkes, D. C. (2009). The role of game theory in human computation systems. In Proceedings of the ACM SIGKDD workshop on human computation, HCOMP ’09 (pp. 58–61). New York, NY: ACM.

Kamps, J., Koolen, M., & Trotman, A. (2009). Comparative analysis of clicks and judgments for IR evaluation. In Proceedings of the workshop on web search click data (WSCD 2009) (pp. 80–87). New York NY: ACM Press.

Kapelner, A., & Chandler, D. (2010) Preventing satisficing in online surveys: A ‘kapcha’ to ensure higher quality data. In The world’s first conference on the future of distributed work (CrowdConf2010).

Kasneci, G., Van Gael, J., Herbrich, R., & Graepel, T. (2010). Bayesian knowledge corroboration with logical rules and user feedback. In Proceedings of the 2010 European conference on machine learning and knowledge discovery in databases: Part II, ECML PKDD’10 (pp. 1–18). Berlin : Springer.

Kazai, G. (2011). In search of quality in crowdsourcing for search engine evaluation. In Advances in Information retrieval —33rd European conference on IR research (ECIR 2011), LNCS , Vol. 6611 (pp. 165–176). Springer.

Kazai, G., Doucet, A., & Landoni, M. (2008). Overview of the inex 2008 book track. In INEX (pp. 106–123).

Kazai, G., Kamps, J., Koolen, M., & Milic-Frayling, N. (2011a). Crowdsourcing for book search evaluation: Impact of quality on comparative system ranking. In Proceedings of the 34th annual international ACM SIGIR conference on research and development in information retrieval. ACM.

Kazai, G., Kamps, J., & Milic-Frayling, N. (2011b). Worker types and personality traits in crowdsourcing relevance labels. In Proceedings of the 20th ACM international conference on Information and knowledge management (pp. 1941–1944). ACM.

Kazai, G., Milic-Frayling, N., & Costello, J. (2009). Towards methods for the collective gathering and quality control of relevance assessments. In Proceedings of the 32nd international ACM SIGIR conference on research and development in information retrieval, SIGIR ’09 (pp. 452–459). New York, NY: ACM.

Kittur, A., Chi, E. H., & Suh, B. (2008). Crowdsourcing user studies with mechanical turk. In Proceeding of the twenty-sixth annual SIGCHI conference on human factors in computing systems (CHI ’08) (pp. 453–456). ACM.

Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159–174.MathSciNetMATHCrossRef

Le, J., Edmonds, A., Hester, V., & Biewald, L. (2010) Ensuring quality in crowdsourced search relevance evaluation. In V. Carvalho, M. Lease, & E. Yilmaz (Eds.), SIGIR Workshop on crowdsourcing for search evaluation (pp. 17–20). New York, NY: ACM.

Lease, M. (2011). On quality control and machine learning in crowdsourcing. In Proceedings of the 3rd human computation workshop (HCOMP) at AAAI (pp. 97–102).

Lease, M., & Kazai, G. (2011). Overview of the trec 2011 crowdsourcing track. In Proceedings of the text retrieval conference (TREC).

Marsden, P. (2009). Crowdsourcing. Contagious Magazine, 18, 24–28.

Mason, W., & Suri, S. (2011). Conducting behavioral research on amazons mechanical turk. Behavior Research Methods.

Mason, W., & Watts, D. J. (2009). Financial incentives and the “performance of crowds”. In HCOMP ’09: Proceedings of the ACM SIGKDD workshop on human computation (pp. 77–85). New York, NY: ACM.

Nowak, S., & Rüger, S. (2010). How reliable are annotations via crowdsourcing: A study about inter-annotator agreement for multi-label image annotation. In MIR ’10: Proceedings of the international conference on Multimedia information retrieval (pp. 557–566). New York, NY: ACM.

Oppenheim, A. N. (1966). Questionnaire design and attitude measurement. London: Heinemann.

Quinn, A. J., & Bederson, B. B. (2009). A taxonomy of distributed human computation. Technical Report HCIL-2009-23. University of Maryland.

Quinn, A. J., & Bederson, B. B. (2011). Human computation: A survey and taxonomy of a growing field. In Proceedings of CHI 2011.

Radlinski, F., Kurup, M., & Joachims, T. (2008). How does clickthrough data reflect retrieval quality?. In J. G. Shanahan, S. Amer-Yahia, I. Manolescu, Y. Zhang, D. A. Evans, A. Kolcz, K. S. Choi, & A. Chowdhury (Eds). CIKM (pp. 43–52). ACM.

Ross, J., Irani, L., Silberman, M. S., Zaldivar, A., & Tomlinson, B. (2010). Who are the crowdworkers?: Shifting demographics in mechanical turk. In Proceedings of the 28th international conference on human factors in computing systems, CHI 2010, extended abstracts volume (pp. 2863–2872). ACM.

Rzeszotarski, J. M., & Kittur, A. (2011). Instrumenting the crowd: using implicit behavioral measures to predict task performance. In Proceedings of the 24th annual ACM symposium on User interface software and technology, UIST ’11 (pp. 13–22). New York, NY: ACM. doi:10.1145/2047196.2047199. url:http://doi.acm.org/10.1145/2047196.2047199.

Shaw, A., Horton, J., & Chen, D. (2011). Designing incentives for inexpert human raters. In Proceedings of the ACM Conference on computer supported cooperative work, CSCW ’11.

Silberman, M. S., Ross, J., Irani, L., & Tomlinson, B. (2010). Sellers’ problems in human computation markets. In Proceedings of the ACM SIGKDD workshop on human computation (HCOMP ’10) (pp. 18–21). ACM.

Snow, R., O’Connor, B., Jurafsky, D., & Ng, A. Y. (2008). Cheap and fast—but is it good?: Evaluating non-expert annotations for natural language tasks. In Proceedings of the conference on empirical methods in natural language processing (EMNLP ’08) (pp. 254–263). ACL.

von Ahn, L., & Dabbish, L. (2004). Labeling images with a computer game. In Proceedings of the SIGCHI conference on Human factors in computing systems, CHI ’04 (pp. 319–326). New York, NY: ACM.

Voorhees, E. M. (2000). Variations in relevance judgments and the measurement of retrieval effectiveness. Information Processing & Management, 36(5), 697–716.CrossRef

Voorhees, E. M., & Harman, D. K. (Eds.). (2005). TREC: Experimentation and evaluation in information retrieval. Cambridge, MA: MIT Press.

Vuurens, J., Vries, A. P. D., & Eickhoff, C. (2011). How much Spam can you take? An analysis of crowdsourcing results to increase accuracy. In M. Lease, V. Hester, A. Sorokin, & E. Yilmaz (Eds.), Proceedings of the ACM SIGIR 2011 workshop on crowdsourcing for information retrieval (CIR 2011) (pp. 48–55). Beijing, China.

Welinder, P., Branson, S., Belongie, S., & Perona, P. (2010). The multidimensional wisdom of crowds. In J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. Zemel, & A. Culotta (Eds.), Advances in neural information processing systems (NIPS ’10) (pp. 2424–2432).

Zhu, D., & Carterette, B. (2010). An analysis of assessor behavior in crowdsourced preference judgments. In SIGIR 2010 workshop on crowdsourcing for search evaluation.

Titel: An analysis of human factors and label accuracy in crowdsourcing relevance judgments
verfasst von: Gabriella Kazai
Jaap Kamps
Natasa Milic-Frayling
Publikationsdatum: 01.04.2013
Verlag: Springer Netherlands
Erschienen in: Discover Computing / Ausgabe 2/2013
Print ISSN: 2948-2984
Elektronische ISSN: 2948-2992
DOI: https://doi.org/10.1007/s10791-012-9205-0

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 2/2013

Crowdsourcing interactions: using crowdsourcing for evaluating interactive information retrieval systems

Increasing cheat robustness of crowdsourcing tasks

Implementing crowdsourcing-based relevance experimentation: an industrial perspective

Crowdsourcing and the crisis-affected community

Crowdsourcing for information retrieval: introduction to the special issue

Identifying top news using crowdsourcing