ABSTRACT
Information retrieval systems require human contributed relevance labels for their training and evaluation. Increasingly such labels are collected under the anonymous, uncontrolled conditions of crowdsourcing, leading to varied output quality. While a range of quality assurance and control techniques have now been developed to reduce noise during or after task completion, little is known about the workers themselves and possible relationships between workers' characteristics and the quality of their work. In this paper, we ask how do the relatively well or poorly-performing crowds, working under specific task conditions, actually look like in terms of worker characteristics, such as demographics or personality traits. Our findings show that the face of a crowd is in fact indicative of the quality of their work.
- O. Alonso, D. E. Rose, and B. Stewart. Crowdsourcing for relevance evaluation. phSIGIR Forum, 42: 9--15, November 2008. Google ScholarDigital Library
- P. Bailey, N. Craswell, I. Soboroff, P. Thomas, A. P. de Vries, and E. Yilmaz. Relevance assessment: are judges exchangeable and does it matter. In phProc. of SIGIR 2008, pages 667--674, 2008. Google ScholarDigital Library
- D. Chandler and A. Kapelner. Breaking monotony with meaning: Motivation in crowdsourcing markets. Technical report, Working paper, University of Chicago, 2010.Google Scholar
- J. S. Downs, M. B. Holbrook, S. Sheng, and L. F. Cranor. Are your participants gaming the system?: screening mechanical turk workers. In phProc. of CHI 2010, pages 2399--2402. ACM, 2010. Google ScholarDigital Library
- C. Grady and M. Lease. Crowdsourcing document relevance assessment with mechanical turk. In phProc. of CSLDAMT '10, pages 172--179, 2010. Google ScholarDigital Library
- P. G. Ipeirotis. Analyzing the amazon mechanical turk marketplace. phXRDS, 17: 16--21, 2010. Google ScholarDigital Library
- P. G. Ipeirotis, F. Provost, and J. Wang. Quality management on amazon mechanical turk. In phProc. of HCOMP '10, pages 64--67, 2010. ACM. Google ScholarDigital Library
- O. P. John, L. P. Naumann, and C. J. Soto. Paradigm shift to the integrative big-five trait taxonomy: History, measurement, and conceptual issues. In phHandbook of personality: Theory and research, chapter 4, pages 114--212. Guilford Press, New York NY, 2008.Google Scholar
- G. Kazai, M. Koolen, A. Doucet, and M. Landoni. Overview of the INEX 2010 book track: At the mercy of crowdsourcing. In phINEX 2010 Workshop Pre-proceedings, pages 89--99, 2010.Google Scholar
- G. Kazai, J. Kamps, M. Koolen, and N. Milic-Frayling. Crowdsourcing for book search evaluation: Impact of quality on comparative system ranking. In phProc. of SIGIR 2011. ACM, 2011. Google ScholarDigital Library
- A. Kittur, E. H. Chi, and B. Suh. Crowdsourcing user studies with mechanical turk. In phProc. of CHI 2008, pages 453--456, 2008. ACM. Google ScholarDigital Library
- J. Le, A. Edmonds, V. Hester, and L. Biewald. Ensuring quality in crowdsourced search relevance evaluation: The effects of training question distribution. In phSIGIR 2010 Workshop on Crowdsourcing for Search Evaluation, pages 21--26, 2010.Google Scholar
- W. Mason and D. J. Watts. Financial incentives and the "performance of crowds". In phProc. of HCOMP '09, pages 77--85, 2009. ACM. Google ScholarDigital Library
- B. Rammstedt and O. P. John. Measuring personality in one minute or less: A 10-item short version of the Big Five Inventory in English and German. phJournal of Research in Personality, 41: 203--212, 2007.Google ScholarCross Ref
- J. Ross, L. Irani, M. S. Silberman, A. Zaldivar, and B. Tomlinson. Who are the crowdworkers?: shifting demographics in mechanical turk. In phProc. of CHI 2010, Extended Abstracts Volume, pages 2863--2872. ACM, 2010. Google ScholarDigital Library
- A. Shaw, J. Horton, and D. Chen. Designing incentives for inexpert human raters. In phProc. of CSCW '11, 2011. Google ScholarDigital Library
- V. S. Sheng, F. Provost, and P. G. Ipeirotis. Get another label? improving data quality and data mining using multiple, noisy labelers. In phProc. of KDD '08, pages 614--622, 2008. Google ScholarDigital Library
- R. Snow, B. O'Connor, D. Jurafsky, and A. Y. Ng. Cheap and fast--but is it good?: evaluating non-expert annotations for natural language tasks. In phProc. of the EMNLP '08, pages 254--263, 2008. Google ScholarDigital Library
- E. M. Voorhees and D. K. Harman, editors. phTREC: Experimentation and Evaluation in Information Retrieval. MIT Press, 2005.Google Scholar
- J. Vuurens, A. de Vries, and C. Eickhoff. How much spam can you take? an analysis of crowdsourcing results to increase accuracy. In phProc. of the ACM SIGIR Workshop on Crowdsourcing for Information Retrieval, 2011. ACM.Google Scholar
- D. Zhu and B. Carterette. An analysis of assessor behavior in crowdsourced preference judgments. In phSIGIR 2010 Workshop on Crowdsourcing for Search Evaluation, pages 17--20, 2010.Google Scholar
Index Terms
- The face of quality in crowdsourcing relevance labels: demographics, personality and labeling accuracy
Recommendations
Quality Control in Crowdsourcing: A Survey of Quality Attributes, Assessment Techniques, and Assurance Actions
Crowdsourcing enables one to leverage on the intelligence and wisdom of potentially large groups of individuals toward solving problems. Common problems approached with crowdsourcing are labeling images, translating or transcribing text, providing ...
Worker types and personality traits in crowdsourcing relevance labels
CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge managementCrowdsourcing platforms offer unprecedented opportunities for creating evaluation benchmarks, but suffer from varied output quality from crowd workers who possess different levels of competence and aspiration. This raises new challenges for quality ...
Who are the crowdworkers?: shifting demographics in mechanical turk
CHI EA '10: CHI '10 Extended Abstracts on Human Factors in Computing SystemsAmazon Mechanical Turk (MTurk) is a crowdsourcing system in which tasks are distributed to a population of thousands of anonymous workers for completion. This system is increasingly popular with researchers and developers. Here we extend previous ...
Comments