skip to main content
10.1145/2835776.2835835acmconferencesArticle/Chapter ViewAbstractPublication PageswsdmConference Proceedingsconference-collections
research-article

Quality Management in Crowdsourcing using Gold Judges Behavior

Published:08 February 2016Publication History

ABSTRACT

Crowdsourcing relevance labels has become an accepted practice for the evaluation of IR systems, where the task of constructing a test collection is distributed over large populations of unknown users with widely varied skills and motivations. Typical methods to check and ensure the quality of the crowd's output is to inject work tasks with known answers (gold tasks) on which workers' performance can be measured. However, gold tasks are expensive to create and have limited application. A more recent trend is to monitor the workers' interactions during a task and estimate their work quality based on their behavior. In this paper, we show that without gold behavior signals that reflect trusted interaction patterns, classifiers can perform poorly, especially for complex tasks, which can lead to high quality crowd workers getting blocked while poorly performing workers remain undetected. Through a series of crowdsourcing experiments, we compare the behaviors of trained professional judges and crowd workers and then use the trained judges' behavior signals as gold behavior to train a classifier to detect poorly performing crowd workers. Our experiments show that classification accuracy almost doubles in some tasks with the use of gold behavior data.

References

  1. O. Alonso. Implementing crowdsourcing-based relevance experimentation: an industrial perspective. Information Retrieval}, 16(2):101--120, Apr. 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Y. Bachrach, T. Graepel, T. Minka, and J. Guiver. How to grade a test without knowing the answers--a bayesian graphical model for adaptive crowdsourcing and aptitude testing. In Proc. of the 29th International Conference on Machine Learning (ICML-12), pages 1183--1190, 2012.Google ScholarGoogle Scholar
  3. P. Bailey, N. Craswell, I. Soboroff, P. Thomas, A. P. de Vries, and E. Yilmaz. Relevance assessment: are judges exchangeable and does it matter. In Proc. of the 31st ACM SIGIR conf. on Research and development in IR}, SIGIR '08, pages 667--674, New York, NY, USA, 2008. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. R. Blanco, H. Halpin, D. M. Herzig, P. Mika, J. Pound, H. S. Thompson, and D. T. Tran. Repeatable and reliable search systevaluation using crowdsourcing. In W.-Y. Ma, J.-Y. Nie, R. A. Baeza-Yates, T.-S. Chua, and W. B. Croft, editors, {SIGIR}, pages 923--932. ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. J. S. Downs, M. B. Holbrook, S. Sheng, and L. F. Cranor. Are your participants gaming the system?: screening Mechanical Turk workers. In E. D. Mynatt, D. Schoner, G. Fitzpatrick, S. E. Hudson, W. K. Edwards, and T. Rodden, editors, {CHI}, pages 2399--2402. ACM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. C. Eickhoff and A. P. de Vries. How crowdsourcable is your task? In Proc. of the Workshop on Crowdsourcing for Search and Data Mining (CSDM 2011)}, pages 11--14. ACM, 2011.Google ScholarGoogle Scholar
  7. J. Friedman. Greedy function approximation: A gradient boosting machine. The Annals of Statistics, 29(5):1189--1232, 2001.Google ScholarGoogle ScholarCross RefCross Ref
  8. J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: a statistical view of boosting. The Annals of Statistics, 28(2):337--407, 2000.Google ScholarGoogle ScholarCross RefCross Ref
  9. C. Grady and M. Lease. Crowdsourcing document relevance assessment with mechanical turk. In Proc. of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk}, CSLDAMT '10, pages 172--179, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. M. Hirth, S. Scheuring, T. Hoßfeld, C. Schwartz, and P. Tran-Gia. Predicting Result Quality in Crowdsourcing Using Application Layer Monitoring. In 5th International Conference on Communications and Electronics ({ICCE} 2014)}, Da Nang, Vietnam, July 2014.Google ScholarGoogle Scholar
  11. D. Hovy, T. Berg-Kirkpatrick, A. Vaswani, and E. Hovy. Learning whom to trust with mace. In Proceedings of NAACL-HLT, pages 1120--1130, 2013.Google ScholarGoogle Scholar
  12. J. Howe. Crowdsourcing: Why the Power of the Crowd Is Driving the Future of Business. Crown Publishing Group, New York, NY, USA, 1 edition, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. P. G. Ipeirotis, F. Provost, and J. Wang. Quality management on Amazon Mechanical Turk. In Proc. of the ACM SIGKDD Workshop on Human Computation}, HCOMP'10, pages 64--67, New York, NY, USA, 2010. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. E. Kamar, S. Hacker, and E. Horvitz. Combining human and machine intelligence in large-scale crowdsourcing. In Proc. of the 11th International Conference on Autonomous Agents and Multiagent Systems-Volume 1}, pages 467--474. International Foundation for Autonomous Agents and Multiagent Systems, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. A. Kapelner and D. Chandler. Preventing satisficing in online surveys: A 'kapcha' to ensure higher quality data. In The World's First Conference on the Future of Distributed Work (CrowdConf2010), 2010.Google ScholarGoogle Scholar
  16. G. Kasneci, J. Van Gael, R. Herbrich, and T. Graepel. Bayesian knowledge corroboration with logical rules and user feedback In Proc. of the 2010 European Conference on Machine learning and knowledge Discovery in Databases: Part II}, ECML PKDD'10, pages 1--18, Berlin, Heidelberg, 2010. Springer-Verlag. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. G. Kazai. In search of quality in crowdsourcing for search engine evaluation. In Advances in Information Retrieval - 33rd European Conference on IR Research (ECIR 2011)}, volume 6611 of LNCS, pages 165--176. Springer, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. G. Kazai, J. Kamps, and N. Milic-Frayling. An analysis of human factors and label accuracy in crowdsourcing relevance judgments. Inf. Retr., 16(2):138--178, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. A. Kittur, E. H. Chi, and B. Suh. Crowdsourcing user studies with mechanical turk. In Proceeding of the 26th SIGCHI Conference on Human factors in computing systems}, CHI '08, pages 453--456, New York, NY, USA, 2008. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. A. Kittur, J. V. Nickerson, M. Bernstein, E. Gerber, A. Shaw, J. Zimmerman, M. Lease, and J. Horton. The future of crowd work. In Proceedings of the 2013 conference on Computer supported cooperative work, pages 1301--1318. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. J. Le, A. Edmonds, V. Hester, and L. Biewald. Ensuring quality in crowdsourced search relevance evaluation: The effects of training question distribution. In SIGIR Workshop on Crowdsourcing for Search Evaluation, pages 21--26, 2010.Google ScholarGoogle Scholar
  22. M. Lease and G. Kazai. Overview of the {TREC} 2011 crowdsourcing track. In Proceedings of TREC, 2011.Google ScholarGoogle Scholar
  23. S. Nowak and S. Rüger. How reliable are annotations via crowdsourcing: a study about inter-annotator agreement for multi-label image annotation. In Proc. of the International Conference on Multimedia Information Retrieval}, MIR '10, pages 557--566, New York, NY, USA, 2010. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. A. J. Quinn and B. B. Bederson. Human computation: a survey and taxonomy of a growing field. In Proc. of the 2011 Annual Conference on Human factors in computing systems}, CHI '11, pages 1403--1412, New York, NY, USA, 2011. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. V. C. Raykar and S. Yu. Ranking annotators for crowdsourced labeling tasks. In NIPS, pages 1809--1817, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. J. Rzeszotarski and A. Kittur. Crowdscape: interactively visualizing user behavior and output. In Proc. of the 25th annual ACM symposium on User interface software and technology, UIST '12, pages 55--62, New York, NY, USA, 2012. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. F. Scholer, A. Turpin, and M. Sanderson. Quantifying test collection quality based on the consistency of relevance judgements.In Proc. of the 34th ACM SIGIR conf. on Research and development in IR, SIGIR '11, pages 1063--1072, New York, NY, USA, 2011. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. A. Shaw, J. Horton, and D. Chen. Designing incentives for inexpert human raters. In Proc. of the ACM Conference on Computer Supported Cooperative Work, CSCW '11, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. R. Snow, B. O'Connor, D. Jurafsky, and A. Y. Ng. Cheap and fast--but is it good?: evaluating non-expert annotations for natural language tasks. In Proc. of the Conference on Empirical Methods in Natural Language Processing}, EMNLP '08, pages 254--263, Stroudsburg, PA, USA, 2008. Association for Computational Linguistics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. L. Tran-Thanh, M. Venanzi, A. Rogers, and N. R. Jennings. Efficient budget allocation with accuracy guarantees for crowdsourcing classification tasks. In Proc. of the 2013 international conference on Autonomous agents and multi-agent systems}, pages 901--908. International Foundation for Autonomous Agents and Multiagent Systems, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. D. Zhu and B. Carterette. An analysis of assessor behavior in crowdsourced preference judgments. In SIGIR 2010 Workshop on Crowdsourcing for Search Evaluation, pages 17--20, 2010.Google ScholarGoogle Scholar

Index Terms

  1. Quality Management in Crowdsourcing using Gold Judges Behavior

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      WSDM '16: Proceedings of the Ninth ACM International Conference on Web Search and Data Mining
      February 2016
      746 pages
      ISBN:9781450337168
      DOI:10.1145/2835776

      Copyright © 2016 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 8 February 2016

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      WSDM '16 Paper Acceptance Rate67of368submissions,18%Overall Acceptance Rate498of2,863submissions,17%

      Upcoming Conference

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader