research-article

Quality Management in Crowdsourcing using Gold Judges Behavior

Authors:
Gabriella Kazai

Lumi, London, United Kingdom

Lumi, London, United Kingdom
View Profile

,
Imed Zitouni

Microsoft, Bellevue, WA, USA

Microsoft, Bellevue, WA, USA
View Profile

WSDM '16: Proceedings of the Ninth ACM International Conference on Web Search and Data MiningFebruary 2016Pages 267–276https://doi.org/10.1145/2835776.2835835

Published:08 February 2016Publication History

WSDM '16: Proceedings of the Ninth ACM International Conference on Web Search and Data Mining

Pages 267–276

ABSTRACT

Crowdsourcing relevance labels has become an accepted practice for the evaluation of IR systems, where the task of constructing a test collection is distributed over large populations of unknown users with widely varied skills and motivations. Typical methods to check and ensure the quality of the crowd's output is to inject work tasks with known answers (gold tasks) on which workers' performance can be measured. However, gold tasks are expensive to create and have limited application. A more recent trend is to monitor the workers' interactions during a task and estimate their work quality based on their behavior. In this paper, we show that without gold behavior signals that reflect trusted interaction patterns, classifiers can perform poorly, especially for complex tasks, which can lead to high quality crowd workers getting blocked while poorly performing workers remain undetected. Through a series of crowdsourcing experiments, we compare the behaviors of trained professional judges and crowd workers and then use the trained judges' behavior signals as gold behavior to train a classifier to detect poorly performing crowd workers. Our experiments show that classification accuracy almost doubles in some tasks with the use of gold behavior data.

References

O. Alonso. Implementing crowdsourcing-based relevance experimentation: an industrial perspective. Information Retrieval}, 16(2):101--120, Apr. 2013. Google ScholarDigital Library
Y. Bachrach, T. Graepel, T. Minka, and J. Guiver. How to grade a test without knowing the answers--a bayesian graphical model for adaptive crowdsourcing and aptitude testing. In Proc. of the 29th International Conference on Machine Learning (ICML-12), pages 1183--1190, 2012.Google Scholar
P. Bailey, N. Craswell, I. Soboroff, P. Thomas, A. P. de Vries, and E. Yilmaz. Relevance assessment: are judges exchangeable and does it matter. In Proc. of the 31st ACM SIGIR conf. on Research and development in IR}, SIGIR '08, pages 667--674, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
R. Blanco, H. Halpin, D. M. Herzig, P. Mika, J. Pound, H. S. Thompson, and D. T. Tran. Repeatable and reliable search systevaluation using crowdsourcing. In W.-Y. Ma, J.-Y. Nie, R. A. Baeza-Yates, T.-S. Chua, and W. B. Croft, editors, {SIGIR}, pages 923--932. ACM, 2011. Google ScholarDigital Library
J. S. Downs, M. B. Holbrook, S. Sheng, and L. F. Cranor. Are your participants gaming the system?: screening Mechanical Turk workers. In E. D. Mynatt, D. Schoner, G. Fitzpatrick, S. E. Hudson, W. K. Edwards, and T. Rodden, editors, {CHI}, pages 2399--2402. ACM, 2010. Google ScholarDigital Library
C. Eickhoff and A. P. de Vries. How crowdsourcable is your task? In Proc. of the Workshop on Crowdsourcing for Search and Data Mining (CSDM 2011)}, pages 11--14. ACM, 2011.Google Scholar
J. Friedman. Greedy function approximation: A gradient boosting machine. The Annals of Statistics, 29(5):1189--1232, 2001.Google ScholarCross Ref
J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: a statistical view of boosting. The Annals of Statistics, 28(2):337--407, 2000.Google ScholarCross Ref
C. Grady and M. Lease. Crowdsourcing document relevance assessment with mechanical turk. In Proc. of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk}, CSLDAMT '10, pages 172--179, 2010. Google ScholarDigital Library
M. Hirth, S. Scheuring, T. Hoßfeld, C. Schwartz, and P. Tran-Gia. Predicting Result Quality in Crowdsourcing Using Application Layer Monitoring. In 5th International Conference on Communications and Electronics ({ICCE} 2014)}, Da Nang, Vietnam, July 2014.Google Scholar
D. Hovy, T. Berg-Kirkpatrick, A. Vaswani, and E. Hovy. Learning whom to trust with mace. In Proceedings of NAACL-HLT, pages 1120--1130, 2013.Google Scholar
J. Howe. Crowdsourcing: Why the Power of the Crowd Is Driving the Future of Business. Crown Publishing Group, New York, NY, USA, 1 edition, 2008. Google ScholarDigital Library
P. G. Ipeirotis, F. Provost, and J. Wang. Quality management on Amazon Mechanical Turk. In Proc. of the ACM SIGKDD Workshop on Human Computation}, HCOMP'10, pages 64--67, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
E. Kamar, S. Hacker, and E. Horvitz. Combining human and machine intelligence in large-scale crowdsourcing. In Proc. of the 11th International Conference on Autonomous Agents and Multiagent Systems-Volume 1}, pages 467--474. International Foundation for Autonomous Agents and Multiagent Systems, 2012. Google ScholarDigital Library
A. Kapelner and D. Chandler. Preventing satisficing in online surveys: A 'kapcha' to ensure higher quality data. In The World's First Conference on the Future of Distributed Work (CrowdConf2010), 2010.Google Scholar
G. Kasneci, J. Van Gael, R. Herbrich, and T. Graepel. Bayesian knowledge corroboration with logical rules and user feedback In Proc. of the 2010 European Conference on Machine learning and knowledge Discovery in Databases: Part II}, ECML PKDD'10, pages 1--18, Berlin, Heidelberg, 2010. Springer-Verlag. Google ScholarDigital Library
G. Kazai. In search of quality in crowdsourcing for search engine evaluation. In Advances in Information Retrieval - 33rd European Conference on IR Research (ECIR 2011)}, volume 6611 of LNCS, pages 165--176. Springer, 2011. Google ScholarDigital Library
G. Kazai, J. Kamps, and N. Milic-Frayling. An analysis of human factors and label accuracy in crowdsourcing relevance judgments. Inf. Retr., 16(2):138--178, 2013. Google ScholarDigital Library
A. Kittur, E. H. Chi, and B. Suh. Crowdsourcing user studies with mechanical turk. In Proceeding of the 26th SIGCHI Conference on Human factors in computing systems}, CHI '08, pages 453--456, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
A. Kittur, J. V. Nickerson, M. Bernstein, E. Gerber, A. Shaw, J. Zimmerman, M. Lease, and J. Horton. The future of crowd work. In Proceedings of the 2013 conference on Computer supported cooperative work, pages 1301--1318. ACM, 2013. Google ScholarDigital Library
J. Le, A. Edmonds, V. Hester, and L. Biewald. Ensuring quality in crowdsourced search relevance evaluation: The effects of training question distribution. In SIGIR Workshop on Crowdsourcing for Search Evaluation, pages 21--26, 2010.Google Scholar
M. Lease and G. Kazai. Overview of the {TREC} 2011 crowdsourcing track. In Proceedings of TREC, 2011.Google Scholar
S. Nowak and S. Rüger. How reliable are annotations via crowdsourcing: a study about inter-annotator agreement for multi-label image annotation. In Proc. of the International Conference on Multimedia Information Retrieval}, MIR '10, pages 557--566, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
A. J. Quinn and B. B. Bederson. Human computation: a survey and taxonomy of a growing field. In Proc. of the 2011 Annual Conference on Human factors in computing systems}, CHI '11, pages 1403--1412, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
V. C. Raykar and S. Yu. Ranking annotators for crowdsourced labeling tasks. In NIPS, pages 1809--1817, 2011.Google ScholarDigital Library
J. Rzeszotarski and A. Kittur. Crowdscape: interactively visualizing user behavior and output. In Proc. of the 25th annual ACM symposium on User interface software and technology, UIST '12, pages 55--62, New York, NY, USA, 2012. ACM. Google ScholarDigital Library
F. Scholer, A. Turpin, and M. Sanderson. Quantifying test collection quality based on the consistency of relevance judgements.In Proc. of the 34th ACM SIGIR conf. on Research and development in IR, SIGIR '11, pages 1063--1072, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
A. Shaw, J. Horton, and D. Chen. Designing incentives for inexpert human raters. In Proc. of the ACM Conference on Computer Supported Cooperative Work, CSCW '11, 2011. Google ScholarDigital Library
R. Snow, B. O'Connor, D. Jurafsky, and A. Y. Ng. Cheap and fast--but is it good?: evaluating non-expert annotations for natural language tasks. In Proc. of the Conference on Empirical Methods in Natural Language Processing}, EMNLP '08, pages 254--263, Stroudsburg, PA, USA, 2008. Association for Computational Linguistics. Google ScholarDigital Library
L. Tran-Thanh, M. Venanzi, A. Rogers, and N. R. Jennings. Efficient budget allocation with accuracy guarantees for crowdsourcing classification tasks. In Proc. of the 2013 international conference on Autonomous agents and multi-agent systems}, pages 901--908. International Foundation for Autonomous Agents and Multiagent Systems, 2013. Google ScholarDigital Library
D. Zhu and B. Carterette. An analysis of assessor behavior in crowdsourced preference judgments. In SIGIR 2010 Workshop on Crowdsourcing for Search Evaluation, pages 17--20, 2010.Google Scholar

Index Terms

Quality Management in Crowdsourcing using Gold Judges Behavior
1. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results

Recommendations

Assessing internet video quality using crowdsourcing
CrowdMM '13: Proceedings of the 2nd ACM international workshop on Crowdsourcing for multimedia

In this paper, we present a subjective video quality evaluation system that has been integrated with different crowdsourcing platforms. We try to evaluate the feasibility of replacing the time consuming and expensive traditional tests with a faster and ...
Read More
Understanding Malicious Behavior in Crowdsourcing Platforms: The Case of Online Surveys
CHI '15: Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems

Crowdsourcing is increasingly being used as a means to tackle problems requiring human intelligence. With the ever-growing worker base that aims to complete microtasks on crowdsourcing platforms in exchange for financial gains, there is a need for ...
Read More
The Dynamics of Micro-Task Crowdsourcing: The Case of Amazon MTurk
WWW '15: Proceedings of the 24th International Conference on World Wide Web

Micro-task crowdsourcing is rapidly gaining popularity among research communities and businesses as a means to leverage Human Computation in their daily operations. Unlike any other service, a crowdsourcing platform is in fact a marketplace subject to ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WSDM '16: Proceedings of the Ninth ACM International Conference on Web Search and Data Mining
February 2016
746 pages
ISBN:9781450337168
DOI:10.1145/2835776
General Chairs:
Paul N. Bennett
Microsoft Research
,
Vanja Josifovski
Pinterest
,
Program Chairs:
Jennifer Neville
Purdue University
,
Filip Radlinski
Microsoft
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 8 February 2016
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
experimentation
measurement
Qualifiers
- research-article
Conference

Acceptance Rates
WSDM '16 Paper Acceptance Rate67of368submissions,18%Overall Acceptance Rate498of2,863submissions,17%
More
Upcoming Conference
WSDM '25

Sponsor:

sigir

sigir

sigir

sigir

The Eighteenth ACM International Conference on Web Search and Data Mining

April 7 - 11, 2025

Hannover , Germany
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 32
  Total Citations
  View Citations
- 664
  Total Downloads
- Downloads (Last 12 months)16
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Quality Management in Crowdsourcing using Gold Judges Behavior

WSDM '16: Proceedings of the Ninth ACM International Conference on Web Search and Data Mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Assessing internet video quality using crowdsourcing

Understanding Malicious Behavior in Crowdsourcing Platforms: The Case of Online Surveys

The Dynamics of Micro-Task Crowdsourcing: The Case of Amazon MTurk