Skip to main content
Erschienen in: Discover Computing 2/2013

01.04.2013 | Crowd Sourcing

An analysis of human factors and label accuracy in crowdsourcing relevance judgments

verfasst von: Gabriella Kazai, Jaap Kamps, Natasa Milic-Frayling

Erschienen in: Discover Computing | Ausgabe 2/2013

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Crowdsourcing relevance judgments for the evaluation of search engines is used increasingly to overcome the issue of scalability that hinders traditional approaches relying on a fixed group of trusted expert judges. However, the benefits of crowdsourcing come with risks due to the engagement of a self-forming group of individuals—the crowd, motivated by different incentives, who complete the tasks with varying levels of attention and success. This increases the need for a careful design of crowdsourcing tasks that attracts the right crowd for the given task and promotes quality work. In this paper, we describe a series of experiments using Amazon’s Mechanical Turk, conducted to explore the ‘human’ characteristics of the crowds involved in a relevance assessment task. In the experiments, we vary the level of pay offered, the effort required to complete a task and the qualifications required of the workers. We observe the effects of these variables on the quality of the resulting relevance labels, measured based on agreement with a gold set, and correlate them with self-reported measures of various human factors. We elicit information from the workers about their motivations, interest and familiarity with the topic, perceived task difficulty, and satisfaction with the offered pay. We investigate how these factors combine with aspects of the task design and how they affect the accuracy of the resulting relevance labels. Based on the analysis of 960 HITs and 2,880 HIT assignments resulting in 19,200 relevance labels, we arrive at insights into the complex interaction of the observed factors and provide practical guidelines to crowdsourcing practitioners. In addition, we highlight challenges in the data analysis that stem from the peculiarity of the crowdsourcing environment where the sample of individuals engaged in specific work conditions are inherently influenced by the conditions themselves.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
5
These pay levels are based on preliminary experiments with increasing pay up to the point that a satisfactory uptake was realized.
 
6
Some implementations test significance against the hypothesis that the ratings are completely random (kappa = 0), which we pass in all cases.
 
7
We do not attempt to model the exact distribution. Instead, we look at the appropriate discrete powerlaw fit based on (Clauset et al. 2009). This powerlaw fit for all data is from 31 HITs onwards (i.e., x min  = 31,  α = 3.08,  D = 0.07229) due to the discontinuity at the tail (spammers doing larger numbers of HITs than expected). For cleaned data, the fit is much better starting from 4 HITs onwards (i.e., x min  = 4,  α = 1.93,  D = 0.08019).
 
8
Rejecting workers based on their gold set agreement would have led to an artificial bias in the accuracy of the ‘cleaned set’, since the gold set is no longer an independent test of the resulting quality of work.
 
Literatur
Zurück zum Zitat Alonso, O., & Baeza-Yates, R. A. (2011). Design and implementation of relevance assessments using crowdsourcing. In Advances in information retrieval —33rd European conference on IR research (ECIR 2011), LNCS, Vol. 6611 (pp. 153–164). Springer. Alonso, O., & Baeza-Yates, R. A. (2011). Design and implementation of relevance assessments using crowdsourcing. In Advances in information retrieval —33rd European conference on IR research (ECIR 2011), LNCS, Vol. 6611 (pp. 153–164). Springer.
Zurück zum Zitat Alonso, O., & Mizzaro, S. (2009). Can we get rid of TREC assessors? using mechanical turk for relevance assessment. In Proceedings of the SIGIR 2009 workshop on the future of IR evaluation (pp. 557–566). Alonso, O., & Mizzaro, S. (2009). Can we get rid of TREC assessors? using mechanical turk for relevance assessment. In Proceedings of the SIGIR 2009 workshop on the future of IR evaluation (pp. 557–566).
Zurück zum Zitat Alonso, O., Rose, D. E., & Stewart, B. (2008). Crowdsourcing for relevance evaluation. SIGIR Forum, 42(2), 9–15.CrossRef Alonso, O., Rose, D. E., & Stewart, B. (2008). Crowdsourcing for relevance evaluation. SIGIR Forum, 42(2), 9–15.CrossRef
Zurück zum Zitat Alonso, O., Schenkel, R., & Theobald, M. (2010). Crowdsourcing assessments for xml ranked retrieval. In Advances in information retrieval, 32nd European conference on IR research (ECIR 2010), LNCS, Vol. 5993 (pp. 602–606). Springer. Alonso, O., Schenkel, R., & Theobald, M. (2010). Crowdsourcing assessments for xml ranked retrieval. In Advances in information retrieval, 32nd European conference on IR research (ECIR 2010), LNCS, Vol. 5993 (pp. 602–606). Springer.
Zurück zum Zitat Bailey, P., Craswell, N., Soboroff, I., Thomas, P., de Vries, A. P., & Yilmaz, E. (2008). Relevance assessment: Are judges exchangeable and does it matter. In SIGIR ’08: Proceedings of the 31st annual international ACM SIGIR conference (pp. 667–674). New York, NY: ACM. Bailey, P., Craswell, N., Soboroff, I., Thomas, P., de Vries, A. P., & Yilmaz, E. (2008). Relevance assessment: Are judges exchangeable and does it matter. In SIGIR ’08: Proceedings of the 31st annual international ACM SIGIR conference (pp. 667–674). New York, NY: ACM.
Zurück zum Zitat Behrend, T. S., Sharek, D. J., Meade, A. W., & Wiebe, E. N. (2011). The viability of crowdsourcing for survey research. Behavior Research Methods. Behrend, T. S., Sharek, D. J., Meade, A. W., & Wiebe, E. N. (2011). The viability of crowdsourcing for survey research. Behavior Research Methods.
Zurück zum Zitat Carterette, B., & Soboroff, I. (2010). The effect of assessor error on ir system evaluation. In Proceeding of the 33rd international ACM SIGIR conference on research and development in information retrieval, SIGIR ’10 (pp. 539–546). New York, NY: ACM. Carterette, B., & Soboroff, I. (2010). The effect of assessor error on ir system evaluation. In Proceeding of the 33rd international ACM SIGIR conference on research and development in information retrieval, SIGIR ’10 (pp. 539–546). New York, NY: ACM.
Zurück zum Zitat Case, K. E., Fair, R. C., & Oster, S. C. (2011). Principles of Economics (10th ed.). Englewood Cliffs NJ: Prentice-Hall. Case, K. E., Fair, R. C., & Oster, S. C. (2011). Principles of Economics (10th ed.). Englewood Cliffs NJ: Prentice-Hall.
Zurück zum Zitat Clauset, A., Shalizi, C. R., & Newman, M. E. J. (2009). Power-law distributions in empirical data. SIAM Review, 51(4), 661–703.MathSciNetMATHCrossRef Clauset, A., Shalizi, C. R., & Newman, M. E. J. (2009). Power-law distributions in empirical data. SIAM Review, 51(4), 661–703.MathSciNetMATHCrossRef
Zurück zum Zitat Cleverdon, C. W. (1967). The Cranfield tests on index language devices. Aslib, 19, 173–192.CrossRef Cleverdon, C. W. (1967). The Cranfield tests on index language devices. Aslib, 19, 173–192.CrossRef
Zurück zum Zitat Cormack, G. V., Palmer, C. R., & Clarke, C. L. A. (1998). Efficient construction of large test collections. In Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’98 (pp. 282–289). New York, NY: ACM. Cormack, G. V., Palmer, C. R., & Clarke, C. L. A. (1998). Efficient construction of large test collections. In Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’98 (pp. 282–289). New York, NY: ACM.
Zurück zum Zitat Doan, A., Ramakrishnan, R., & Halevy, A. Y. (2011). Crowdsourcing systems on the world-wide web. Commun. ACM, 54, 86–96.CrossRef Doan, A., Ramakrishnan, R., & Halevy, A. Y. (2011). Crowdsourcing systems on the world-wide web. Commun. ACM, 54, 86–96.CrossRef
Zurück zum Zitat Downs, J. S., Holbrook, M. B., Sheng, S., & Cranor, L. F. (2010). Are your participants gaming the system? Screening mechanical turk workers. In Proceedings of the 28th international conference on human factors in computing systems (CHI ’10) (pp. 2399–2402). ACM. Downs, J. S., Holbrook, M. B., Sheng, S., & Cranor, L. F. (2010). Are your participants gaming the system? Screening mechanical turk workers. In Proceedings of the 28th international conference on human factors in computing systems (CHI ’10) (pp. 2399–2402). ACM.
Zurück zum Zitat Eickhoff C., & de Vries, A. P. (2011). How crowdsourcable is your task? In Proceedings of the workshop on crowdsourcing for search and data mining (CSDM 2011) (pp. 11–14). ACM. Eickhoff C., & de Vries, A. P. (2011). How crowdsourcable is your task? In Proceedings of the workshop on crowdsourcing for search and data mining (CSDM 2011) (pp. 11–14). ACM.
Zurück zum Zitat Feild, H., Jones, R., Miller, R. C., Nayak, R., Churchill, E. F., & Velipasaoglu, E. (2010). Logging the search self-efficacy of Amazon mechanical turkers. In M. Lease, V. Carvalho, & E. Yilmaz (Eds.), Proceedings of the ACM SIGIR 2010 workshop on crowdsourcing for search evaluation (CSE 2010) (pp. 27–30). Geneva, Switzerland. Feild, H., Jones, R., Miller, R. C., Nayak, R., Churchill, E. F., & Velipasaoglu, E. (2010). Logging the search self-efficacy of Amazon mechanical turkers. In M. Lease, V. Carvalho, & E. Yilmaz (Eds.), Proceedings of the ACM SIGIR 2010 workshop on crowdsourcing for search evaluation (CSE 2010) (pp. 27–30). Geneva, Switzerland.
Zurück zum Zitat Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76, 378–382.CrossRef Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76, 378–382.CrossRef
Zurück zum Zitat Grady, C., & Lease, M. (2010). Crodsourcing document relevance assessment with mechanical turk. In Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s mechanical turk (pp. 172–179). Grady, C., & Lease, M. (2010). Crodsourcing document relevance assessment with mechanical turk. In Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s mechanical turk (pp. 172–179).
Zurück zum Zitat Grimes, C., Tang, D., & Russell, D. M. (2007). Query logs alone are not enough. In E. Amitay, C. G. Murray, & J. Teevan (Eds.), Query log analysis: Social and technological challenges. A workshop at the 16th International World Wide Web Conference (WWW 2007). Grimes, C., Tang, D., & Russell, D. M. (2007). Query logs alone are not enough. In E. Amitay, C. G. Murray, & J. Teevan (Eds.), Query log analysis: Social and technological challenges. A workshop at the 16th International World Wide Web Conference (WWW 2007).
Zurück zum Zitat Hirth, M., Hoßfeld, T., & Tran-Gia, P. (2011) Anatomy of a crowdsourcing platform—using the example of microworkers.com. In Workshop on future internet and next generation networks (FINGNet). Seoul, Korea. Hirth, M., Hoßfeld, T., & Tran-Gia, P. (2011) Anatomy of a crowdsourcing platform—using the example of microworkers.com. In Workshop on future internet and next generation networks (FINGNet). Seoul, Korea.
Zurück zum Zitat Howe, J. (2008). Crowdsourcing: Why the power of the crowd Is driving the future of business. New York, NY: Crown Publishing Group. Howe, J. (2008). Crowdsourcing: Why the power of the crowd Is driving the future of business. New York, NY: Crown Publishing Group.
Zurück zum Zitat Ipeirotis, P. G. (2010b). Analyzing the amazon mechanical turk marketplace. XRDS, 17, 16–21.CrossRef Ipeirotis, P. G. (2010b). Analyzing the amazon mechanical turk marketplace. XRDS, 17, 16–21.CrossRef
Zurück zum Zitat Ipeirotis P. G., Provost, F., & Wang, J. (2010). Quality management on amazon mechanical turk. In Proceedings of the ACM SIGKDD workshop on human computation, HCOMP ’10 (pp. 64–67). New York, NY: ACM. Ipeirotis P. G., Provost, F., & Wang, J. (2010). Quality management on amazon mechanical turk. In Proceedings of the ACM SIGKDD workshop on human computation, HCOMP ’10 (pp. 64–67). New York, NY: ACM.
Zurück zum Zitat Jain, S., & Parkes, D. C. (2009). The role of game theory in human computation systems. In Proceedings of the ACM SIGKDD workshop on human computation, HCOMP ’09 (pp. 58–61). New York, NY: ACM. Jain, S., & Parkes, D. C. (2009). The role of game theory in human computation systems. In Proceedings of the ACM SIGKDD workshop on human computation, HCOMP ’09 (pp. 58–61). New York, NY: ACM.
Zurück zum Zitat Kamps, J., Koolen, M., & Trotman, A. (2009). Comparative analysis of clicks and judgments for IR evaluation. In Proceedings of the workshop on web search click data (WSCD 2009) (pp. 80–87). New York NY: ACM Press. Kamps, J., Koolen, M., & Trotman, A. (2009). Comparative analysis of clicks and judgments for IR evaluation. In Proceedings of the workshop on web search click data (WSCD 2009) (pp. 80–87). New York NY: ACM Press.
Zurück zum Zitat Kapelner, A., & Chandler, D. (2010) Preventing satisficing in online surveys: A ‘kapcha’ to ensure higher quality data. In The world’s first conference on the future of distributed work (CrowdConf2010). Kapelner, A., & Chandler, D. (2010) Preventing satisficing in online surveys: A ‘kapcha’ to ensure higher quality data. In The world’s first conference on the future of distributed work (CrowdConf2010).
Zurück zum Zitat Kasneci, G., Van Gael, J., Herbrich, R., & Graepel, T. (2010). Bayesian knowledge corroboration with logical rules and user feedback. In Proceedings of the 2010 European conference on machine learning and knowledge discovery in databases: Part II, ECML PKDD’10 (pp. 1–18). Berlin : Springer. Kasneci, G., Van Gael, J., Herbrich, R., & Graepel, T. (2010). Bayesian knowledge corroboration with logical rules and user feedback. In Proceedings of the 2010 European conference on machine learning and knowledge discovery in databases: Part II, ECML PKDD’10 (pp. 1–18). Berlin : Springer.
Zurück zum Zitat Kazai, G. (2011). In search of quality in crowdsourcing for search engine evaluation. In Advances in Information retrieval —33rd European conference on IR research (ECIR 2011), LNCS , Vol. 6611 (pp. 165–176). Springer. Kazai, G. (2011). In search of quality in crowdsourcing for search engine evaluation. In Advances in Information retrieval —33rd European conference on IR research (ECIR 2011), LNCS , Vol. 6611 (pp. 165–176). Springer.
Zurück zum Zitat Kazai, G., Doucet, A., & Landoni, M. (2008). Overview of the inex 2008 book track. In INEX (pp. 106–123). Kazai, G., Doucet, A., & Landoni, M. (2008). Overview of the inex 2008 book track. In INEX (pp. 106–123).
Zurück zum Zitat Kazai, G., Kamps, J., Koolen, M., & Milic-Frayling, N. (2011a). Crowdsourcing for book search evaluation: Impact of quality on comparative system ranking. In Proceedings of the 34th annual international ACM SIGIR conference on research and development in information retrieval. ACM. Kazai, G., Kamps, J., Koolen, M., & Milic-Frayling, N. (2011a). Crowdsourcing for book search evaluation: Impact of quality on comparative system ranking. In Proceedings of the 34th annual international ACM SIGIR conference on research and development in information retrieval. ACM.
Zurück zum Zitat Kazai, G., Kamps, J., & Milic-Frayling, N. (2011b). Worker types and personality traits in crowdsourcing relevance labels. In Proceedings of the 20th ACM international conference on Information and knowledge management (pp. 1941–1944). ACM. Kazai, G., Kamps, J., & Milic-Frayling, N. (2011b). Worker types and personality traits in crowdsourcing relevance labels. In Proceedings of the 20th ACM international conference on Information and knowledge management (pp. 1941–1944). ACM.
Zurück zum Zitat Kazai, G., Milic-Frayling, N., & Costello, J. (2009). Towards methods for the collective gathering and quality control of relevance assessments. In Proceedings of the 32nd international ACM SIGIR conference on research and development in information retrieval, SIGIR ’09 (pp. 452–459). New York, NY: ACM. Kazai, G., Milic-Frayling, N., & Costello, J. (2009). Towards methods for the collective gathering and quality control of relevance assessments. In Proceedings of the 32nd international ACM SIGIR conference on research and development in information retrieval, SIGIR ’09 (pp. 452–459). New York, NY: ACM.
Zurück zum Zitat Kittur, A., Chi, E. H., & Suh, B. (2008). Crowdsourcing user studies with mechanical turk. In Proceeding of the twenty-sixth annual SIGCHI conference on human factors in computing systems (CHI ’08) (pp. 453–456). ACM. Kittur, A., Chi, E. H., & Suh, B. (2008). Crowdsourcing user studies with mechanical turk. In Proceeding of the twenty-sixth annual SIGCHI conference on human factors in computing systems (CHI ’08) (pp. 453–456). ACM.
Zurück zum Zitat Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159–174.MathSciNetMATHCrossRef Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159–174.MathSciNetMATHCrossRef
Zurück zum Zitat Le, J., Edmonds, A., Hester, V., & Biewald, L. (2010) Ensuring quality in crowdsourced search relevance evaluation. In V. Carvalho, M. Lease, & E. Yilmaz (Eds.), SIGIR Workshop on crowdsourcing for search evaluation (pp. 17–20). New York, NY: ACM. Le, J., Edmonds, A., Hester, V., & Biewald, L. (2010) Ensuring quality in crowdsourced search relevance evaluation. In V. Carvalho, M. Lease, & E. Yilmaz (Eds.), SIGIR Workshop on crowdsourcing for search evaluation (pp. 17–20). New York, NY: ACM.
Zurück zum Zitat Lease, M. (2011). On quality control and machine learning in crowdsourcing. In Proceedings of the 3rd human computation workshop (HCOMP) at AAAI (pp. 97–102). Lease, M. (2011). On quality control and machine learning in crowdsourcing. In Proceedings of the 3rd human computation workshop (HCOMP) at AAAI (pp. 97–102).
Zurück zum Zitat Lease, M., & Kazai, G. (2011). Overview of the trec 2011 crowdsourcing track. In Proceedings of the text retrieval conference (TREC). Lease, M., & Kazai, G. (2011). Overview of the trec 2011 crowdsourcing track. In Proceedings of the text retrieval conference (TREC).
Zurück zum Zitat Marsden, P. (2009). Crowdsourcing. Contagious Magazine, 18, 24–28. Marsden, P. (2009). Crowdsourcing. Contagious Magazine, 18, 24–28.
Zurück zum Zitat Mason, W., & Suri, S. (2011). Conducting behavioral research on amazons mechanical turk. Behavior Research Methods. Mason, W., & Suri, S. (2011). Conducting behavioral research on amazons mechanical turk. Behavior Research Methods.
Zurück zum Zitat Mason, W., & Watts, D. J. (2009). Financial incentives and the “performance of crowds”. In HCOMP ’09: Proceedings of the ACM SIGKDD workshop on human computation (pp. 77–85). New York, NY: ACM. Mason, W., & Watts, D. J. (2009). Financial incentives and the “performance of crowds”. In HCOMP ’09: Proceedings of the ACM SIGKDD workshop on human computation (pp. 77–85). New York, NY: ACM.
Zurück zum Zitat Nowak, S., & Rüger, S. (2010). How reliable are annotations via crowdsourcing: A study about inter-annotator agreement for multi-label image annotation. In MIR ’10: Proceedings of the international conference on Multimedia information retrieval (pp. 557–566). New York, NY: ACM. Nowak, S., & Rüger, S. (2010). How reliable are annotations via crowdsourcing: A study about inter-annotator agreement for multi-label image annotation. In MIR ’10: Proceedings of the international conference on Multimedia information retrieval (pp. 557–566). New York, NY: ACM.
Zurück zum Zitat Oppenheim, A. N. (1966). Questionnaire design and attitude measurement. London: Heinemann. Oppenheim, A. N. (1966). Questionnaire design and attitude measurement. London: Heinemann.
Zurück zum Zitat Quinn, A. J., & Bederson, B. B. (2009). A taxonomy of distributed human computation. Technical Report HCIL-2009-23. University of Maryland. Quinn, A. J., & Bederson, B. B. (2009). A taxonomy of distributed human computation. Technical Report HCIL-2009-23. University of Maryland.
Zurück zum Zitat Quinn, A. J., & Bederson, B. B. (2011). Human computation: A survey and taxonomy of a growing field. In Proceedings of CHI 2011. Quinn, A. J., & Bederson, B. B. (2011). Human computation: A survey and taxonomy of a growing field. In Proceedings of CHI 2011.
Zurück zum Zitat Radlinski, F., Kurup, M., & Joachims, T. (2008). How does clickthrough data reflect retrieval quality?. In J. G. Shanahan, S. Amer-Yahia, I. Manolescu, Y. Zhang, D. A. Evans, A. Kolcz, K. S. Choi, & A. Chowdhury (Eds). CIKM (pp. 43–52). ACM. Radlinski, F., Kurup, M., & Joachims, T. (2008). How does clickthrough data reflect retrieval quality?. In J. G. Shanahan, S. Amer-Yahia, I. Manolescu, Y. Zhang, D. A. Evans, A. Kolcz, K. S. Choi, & A. Chowdhury (Eds). CIKM (pp. 43–52). ACM.
Zurück zum Zitat Ross, J., Irani, L., Silberman, M. S., Zaldivar, A., & Tomlinson, B. (2010). Who are the crowdworkers?: Shifting demographics in mechanical turk. In Proceedings of the 28th international conference on human factors in computing systems, CHI 2010, extended abstracts volume (pp. 2863–2872). ACM. Ross, J., Irani, L., Silberman, M. S., Zaldivar, A., & Tomlinson, B. (2010). Who are the crowdworkers?: Shifting demographics in mechanical turk. In Proceedings of the 28th international conference on human factors in computing systems, CHI 2010, extended abstracts volume (pp. 2863–2872). ACM.
Zurück zum Zitat Shaw, A., Horton, J., & Chen, D. (2011). Designing incentives for inexpert human raters. In Proceedings of the ACM Conference on computer supported cooperative work, CSCW ’11. Shaw, A., Horton, J., & Chen, D. (2011). Designing incentives for inexpert human raters. In Proceedings of the ACM Conference on computer supported cooperative work, CSCW ’11.
Zurück zum Zitat Silberman, M. S., Ross, J., Irani, L., & Tomlinson, B. (2010). Sellers’ problems in human computation markets. In Proceedings of the ACM SIGKDD workshop on human computation (HCOMP ’10) (pp. 18–21). ACM. Silberman, M. S., Ross, J., Irani, L., & Tomlinson, B. (2010). Sellers’ problems in human computation markets. In Proceedings of the ACM SIGKDD workshop on human computation (HCOMP ’10) (pp. 18–21). ACM.
Zurück zum Zitat Snow, R., O’Connor, B., Jurafsky, D., & Ng, A. Y. (2008). Cheap and fast—but is it good?: Evaluating non-expert annotations for natural language tasks. In Proceedings of the conference on empirical methods in natural language processing (EMNLP ’08) (pp. 254–263). ACL. Snow, R., O’Connor, B., Jurafsky, D., & Ng, A. Y. (2008). Cheap and fast—but is it good?: Evaluating non-expert annotations for natural language tasks. In Proceedings of the conference on empirical methods in natural language processing (EMNLP ’08) (pp. 254–263). ACL.
Zurück zum Zitat von Ahn, L., & Dabbish, L. (2004). Labeling images with a computer game. In Proceedings of the SIGCHI conference on Human factors in computing systems, CHI ’04 (pp. 319–326). New York, NY: ACM. von Ahn, L., & Dabbish, L. (2004). Labeling images with a computer game. In Proceedings of the SIGCHI conference on Human factors in computing systems, CHI ’04 (pp. 319–326). New York, NY: ACM.
Zurück zum Zitat Voorhees, E. M. (2000). Variations in relevance judgments and the measurement of retrieval effectiveness. Information Processing & Management, 36(5), 697–716.CrossRef Voorhees, E. M. (2000). Variations in relevance judgments and the measurement of retrieval effectiveness. Information Processing & Management, 36(5), 697–716.CrossRef
Zurück zum Zitat Voorhees, E. M., & Harman, D. K. (Eds.). (2005). TREC: Experimentation and evaluation in information retrieval. Cambridge, MA: MIT Press. Voorhees, E. M., & Harman, D. K. (Eds.). (2005). TREC: Experimentation and evaluation in information retrieval. Cambridge, MA: MIT Press.
Zurück zum Zitat Vuurens, J., Vries, A. P. D., & Eickhoff, C. (2011). How much Spam can you take? An analysis of crowdsourcing results to increase accuracy. In M. Lease, V. Hester, A. Sorokin, & E. Yilmaz (Eds.), Proceedings of the ACM SIGIR 2011 workshop on crowdsourcing for information retrieval (CIR 2011) (pp. 48–55). Beijing, China. Vuurens, J., Vries, A. P. D., & Eickhoff, C. (2011). How much Spam can you take? An analysis of crowdsourcing results to increase accuracy. In M. Lease, V. Hester, A. Sorokin, & E. Yilmaz (Eds.), Proceedings of the ACM SIGIR 2011 workshop on crowdsourcing for information retrieval (CIR 2011) (pp. 48–55). Beijing, China.
Zurück zum Zitat Welinder, P., Branson, S., Belongie, S., & Perona, P. (2010). The multidimensional wisdom of crowds. In J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. Zemel, & A. Culotta (Eds.), Advances in neural information processing systems (NIPS ’10) (pp. 2424–2432). Welinder, P., Branson, S., Belongie, S., & Perona, P. (2010). The multidimensional wisdom of crowds. In J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. Zemel, & A. Culotta (Eds.), Advances in neural information processing systems (NIPS ’10) (pp. 2424–2432).
Zurück zum Zitat Zhu, D., & Carterette, B. (2010). An analysis of assessor behavior in crowdsourced preference judgments. In SIGIR 2010 workshop on crowdsourcing for search evaluation. Zhu, D., & Carterette, B. (2010). An analysis of assessor behavior in crowdsourced preference judgments. In SIGIR 2010 workshop on crowdsourcing for search evaluation.
Metadaten
Titel
An analysis of human factors and label accuracy in crowdsourcing relevance judgments
verfasst von
Gabriella Kazai
Jaap Kamps
Natasa Milic-Frayling
Publikationsdatum
01.04.2013
Verlag
Springer Netherlands
Erschienen in
Discover Computing / Ausgabe 2/2013
Print ISSN: 2948-2984
Elektronische ISSN: 2948-2992
DOI
https://doi.org/10.1007/s10791-012-9205-0

Weitere Artikel der Ausgabe 2/2013

Discover Computing 2/2013 Zur Ausgabe