ABSTRACT
Crowdsourcing relevance labels has become an accepted practice for the evaluation of IR systems, where the task of constructing a test collection is distributed over large populations of unknown users with widely varied skills and motivations. Typical methods to check and ensure the quality of the crowd's output is to inject work tasks with known answers (gold tasks) on which workers' performance can be measured. However, gold tasks are expensive to create and have limited application. A more recent trend is to monitor the workers' interactions during a task and estimate their work quality based on their behavior. In this paper, we show that without gold behavior signals that reflect trusted interaction patterns, classifiers can perform poorly, especially for complex tasks, which can lead to high quality crowd workers getting blocked while poorly performing workers remain undetected. Through a series of crowdsourcing experiments, we compare the behaviors of trained professional judges and crowd workers and then use the trained judges' behavior signals as gold behavior to train a classifier to detect poorly performing crowd workers. Our experiments show that classification accuracy almost doubles in some tasks with the use of gold behavior data.
- O. Alonso. Implementing crowdsourcing-based relevance experimentation: an industrial perspective. Information Retrieval}, 16(2):101--120, Apr. 2013. Google ScholarDigital Library
- Y. Bachrach, T. Graepel, T. Minka, and J. Guiver. How to grade a test without knowing the answers--a bayesian graphical model for adaptive crowdsourcing and aptitude testing. In Proc. of the 29th International Conference on Machine Learning (ICML-12), pages 1183--1190, 2012.Google Scholar
- P. Bailey, N. Craswell, I. Soboroff, P. Thomas, A. P. de Vries, and E. Yilmaz. Relevance assessment: are judges exchangeable and does it matter. In Proc. of the 31st ACM SIGIR conf. on Research and development in IR}, SIGIR '08, pages 667--674, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
- R. Blanco, H. Halpin, D. M. Herzig, P. Mika, J. Pound, H. S. Thompson, and D. T. Tran. Repeatable and reliable search systevaluation using crowdsourcing. In W.-Y. Ma, J.-Y. Nie, R. A. Baeza-Yates, T.-S. Chua, and W. B. Croft, editors, {SIGIR}, pages 923--932. ACM, 2011. Google ScholarDigital Library
- J. S. Downs, M. B. Holbrook, S. Sheng, and L. F. Cranor. Are your participants gaming the system?: screening Mechanical Turk workers. In E. D. Mynatt, D. Schoner, G. Fitzpatrick, S. E. Hudson, W. K. Edwards, and T. Rodden, editors, {CHI}, pages 2399--2402. ACM, 2010. Google ScholarDigital Library
- C. Eickhoff and A. P. de Vries. How crowdsourcable is your task? In Proc. of the Workshop on Crowdsourcing for Search and Data Mining (CSDM 2011)}, pages 11--14. ACM, 2011.Google Scholar
- J. Friedman. Greedy function approximation: A gradient boosting machine. The Annals of Statistics, 29(5):1189--1232, 2001.Google ScholarCross Ref
- J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: a statistical view of boosting. The Annals of Statistics, 28(2):337--407, 2000.Google ScholarCross Ref
- C. Grady and M. Lease. Crowdsourcing document relevance assessment with mechanical turk. In Proc. of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk}, CSLDAMT '10, pages 172--179, 2010. Google ScholarDigital Library
- M. Hirth, S. Scheuring, T. Hoßfeld, C. Schwartz, and P. Tran-Gia. Predicting Result Quality in Crowdsourcing Using Application Layer Monitoring. In 5th International Conference on Communications and Electronics ({ICCE} 2014)}, Da Nang, Vietnam, July 2014.Google Scholar
- D. Hovy, T. Berg-Kirkpatrick, A. Vaswani, and E. Hovy. Learning whom to trust with mace. In Proceedings of NAACL-HLT, pages 1120--1130, 2013.Google Scholar
- J. Howe. Crowdsourcing: Why the Power of the Crowd Is Driving the Future of Business. Crown Publishing Group, New York, NY, USA, 1 edition, 2008. Google ScholarDigital Library
- P. G. Ipeirotis, F. Provost, and J. Wang. Quality management on Amazon Mechanical Turk. In Proc. of the ACM SIGKDD Workshop on Human Computation}, HCOMP'10, pages 64--67, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
- E. Kamar, S. Hacker, and E. Horvitz. Combining human and machine intelligence in large-scale crowdsourcing. In Proc. of the 11th International Conference on Autonomous Agents and Multiagent Systems-Volume 1}, pages 467--474. International Foundation for Autonomous Agents and Multiagent Systems, 2012. Google ScholarDigital Library
- A. Kapelner and D. Chandler. Preventing satisficing in online surveys: A 'kapcha' to ensure higher quality data. In The World's First Conference on the Future of Distributed Work (CrowdConf2010), 2010.Google Scholar
- G. Kasneci, J. Van Gael, R. Herbrich, and T. Graepel. Bayesian knowledge corroboration with logical rules and user feedback In Proc. of the 2010 European Conference on Machine learning and knowledge Discovery in Databases: Part II}, ECML PKDD'10, pages 1--18, Berlin, Heidelberg, 2010. Springer-Verlag. Google ScholarDigital Library
- G. Kazai. In search of quality in crowdsourcing for search engine evaluation. In Advances in Information Retrieval - 33rd European Conference on IR Research (ECIR 2011)}, volume 6611 of LNCS, pages 165--176. Springer, 2011. Google ScholarDigital Library
- G. Kazai, J. Kamps, and N. Milic-Frayling. An analysis of human factors and label accuracy in crowdsourcing relevance judgments. Inf. Retr., 16(2):138--178, 2013. Google ScholarDigital Library
- A. Kittur, E. H. Chi, and B. Suh. Crowdsourcing user studies with mechanical turk. In Proceeding of the 26th SIGCHI Conference on Human factors in computing systems}, CHI '08, pages 453--456, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
- A. Kittur, J. V. Nickerson, M. Bernstein, E. Gerber, A. Shaw, J. Zimmerman, M. Lease, and J. Horton. The future of crowd work. In Proceedings of the 2013 conference on Computer supported cooperative work, pages 1301--1318. ACM, 2013. Google ScholarDigital Library
- J. Le, A. Edmonds, V. Hester, and L. Biewald. Ensuring quality in crowdsourced search relevance evaluation: The effects of training question distribution. In SIGIR Workshop on Crowdsourcing for Search Evaluation, pages 21--26, 2010.Google Scholar
- M. Lease and G. Kazai. Overview of the {TREC} 2011 crowdsourcing track. In Proceedings of TREC, 2011.Google Scholar
- S. Nowak and S. Rüger. How reliable are annotations via crowdsourcing: a study about inter-annotator agreement for multi-label image annotation. In Proc. of the International Conference on Multimedia Information Retrieval}, MIR '10, pages 557--566, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
- A. J. Quinn and B. B. Bederson. Human computation: a survey and taxonomy of a growing field. In Proc. of the 2011 Annual Conference on Human factors in computing systems}, CHI '11, pages 1403--1412, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
- V. C. Raykar and S. Yu. Ranking annotators for crowdsourced labeling tasks. In NIPS, pages 1809--1817, 2011.Google ScholarDigital Library
- J. Rzeszotarski and A. Kittur. Crowdscape: interactively visualizing user behavior and output. In Proc. of the 25th annual ACM symposium on User interface software and technology, UIST '12, pages 55--62, New York, NY, USA, 2012. ACM. Google ScholarDigital Library
- F. Scholer, A. Turpin, and M. Sanderson. Quantifying test collection quality based on the consistency of relevance judgements.In Proc. of the 34th ACM SIGIR conf. on Research and development in IR, SIGIR '11, pages 1063--1072, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
- A. Shaw, J. Horton, and D. Chen. Designing incentives for inexpert human raters. In Proc. of the ACM Conference on Computer Supported Cooperative Work, CSCW '11, 2011. Google ScholarDigital Library
- R. Snow, B. O'Connor, D. Jurafsky, and A. Y. Ng. Cheap and fast--but is it good?: evaluating non-expert annotations for natural language tasks. In Proc. of the Conference on Empirical Methods in Natural Language Processing}, EMNLP '08, pages 254--263, Stroudsburg, PA, USA, 2008. Association for Computational Linguistics. Google ScholarDigital Library
- L. Tran-Thanh, M. Venanzi, A. Rogers, and N. R. Jennings. Efficient budget allocation with accuracy guarantees for crowdsourcing classification tasks. In Proc. of the 2013 international conference on Autonomous agents and multi-agent systems}, pages 901--908. International Foundation for Autonomous Agents and Multiagent Systems, 2013. Google ScholarDigital Library
- D. Zhu and B. Carterette. An analysis of assessor behavior in crowdsourced preference judgments. In SIGIR 2010 Workshop on Crowdsourcing for Search Evaluation, pages 17--20, 2010.Google Scholar
Index Terms
- Quality Management in Crowdsourcing using Gold Judges Behavior
Recommendations
Assessing internet video quality using crowdsourcing
CrowdMM '13: Proceedings of the 2nd ACM international workshop on Crowdsourcing for multimediaIn this paper, we present a subjective video quality evaluation system that has been integrated with different crowdsourcing platforms. We try to evaluate the feasibility of replacing the time consuming and expensive traditional tests with a faster and ...
Understanding Malicious Behavior in Crowdsourcing Platforms: The Case of Online Surveys
CHI '15: Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing SystemsCrowdsourcing is increasingly being used as a means to tackle problems requiring human intelligence. With the ever-growing worker base that aims to complete microtasks on crowdsourcing platforms in exchange for financial gains, there is a need for ...
The Dynamics of Micro-Task Crowdsourcing: The Case of Amazon MTurk
WWW '15: Proceedings of the 24th International Conference on World Wide WebMicro-task crowdsourcing is rapidly gaining popularity among research communities and businesses as a means to leverage Human Computation in their daily operations. Unlike any other service, a crowdsourcing platform is in fact a marketplace subject to ...
Comments