ABSTRACT
Crowdsourcing has recently become popular among machine learning researchers and social scientists as an effective way to collect large-scale experimental data from distributed workers. To extract useful information from the cheap but potentially unreliable answers to tasks, a key problem is to identify reliable workers as well as unambiguous tasks. Although for objective tasks that have one correct answer per task, previous works can estimate worker reliability and task clarity based on the single gold standard assumption, for tasks that are subjective and accept multiple reasonable answers that workers may be grouped into, a phenomenon called schools of thought, existing models cannot be trivially applied. In this work, we present a statistical model to estimate worker reliability and task clarity without resorting to the single gold standard assumption. This is instantiated by explicitly characterizing the grouping behavior to form schools of thought with a rank-1 factorization of a worker-task groupsize matrix. Instead of performing an intermediate inference step, which can be expensive and unstable, we present an algorithm to analytically compute the sizes of different groups. We perform extensive empirical studies on real data collected from Amazon Mechanical Turk. Our method discovers the schools of thought, shows reasonable estimation of worker reliability and task clarity, and is robust to hyperparameter changes. Furthermore, our estimated worker reliability can be used to improve the gold standard prediction for objective tasks.
Supplemental Material
- J. Abernethy and R.M. Frongillo. A collaborative mechanism for crowdsourcing prediction problems. In NIPS, 2011.Google Scholar
- E. Adar. Why I hate Mechanical Turk research (and workshops). In CHI, 2011.Google Scholar
- C.E. Antoniak. Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. Annals of statistics, 2(6):1152--1174, 1974.Google ScholarCross Ref
- D. Blackwell and J.B. MacQueen. Ferguson distributions via Pólya urn schemes. Annals of statistics, 1(2):353--355, 1973.Google ScholarCross Ref
- AP Dawid and AM Skene. Maximum likelihood estimation of observer error-rates using the EM algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 28(1):20--28, 1979.Google Scholar
- S.J. Gershman and D. Blei. A tutorial on bayesian nonparametric models. Journal of Mathematical Psychology (in press), 2011.Google Scholar
- T.L. Griffiths and M. Steyvers. Finding scientific topics. National Academy of Sciences of the United States of America, 101(Suppl 1):5228, 2004.Google ScholarCross Ref
- P.D. Hoff, A.E. Raftery, and M.S. Handcock. Latent space approaches to social network analysis. Journal of the American Statistical Association, 97(460):1090--1098, 2002.Google ScholarCross Ref
- P.G. Ipeirotis, F. Provost, and J. Wang. Quality management on Amazon Mechanical Turk. In ACM SIGKDD workshop on human computation, 2010. Google ScholarDigital Library
- D.R. Karger, S. Oh, and D. Shah. Iterative learning for reliable crowdsourcing systems. In NIPS, 2011.Google ScholarDigital Library
- A. Kittur, E. Chi, and B. Suh. Crowdsourcing user studies with Mechanical Turk. In CHI, 2008. Google ScholarDigital Library
- R.M. Neal. Markov chain sampling methods for Dirichlet process mixture models. Journal of computational and graphical statistics, 9(2):249--265, 2000.Google Scholar
- A. Ng, M. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. In NIPS, 2001.Google ScholarDigital Library
- D. Pinar, J. Carbonell, and J. Schneider. Efficiently learning the accuracy of labeling sources for selective sampling. In SIGKDD, 2009. Google ScholarDigital Library
- V.C. Raykar, S. Yu, L.H. Zhao, G.H. Valadez, C. Florin, L. Bogoni, and L. Moy. Learning from crowds. Journal of Machine Learning Research, 99:1297--1322, 2010. Google ScholarDigital Library
- J. Ross, L. Irani, M. Silberman, A. Zaldivar, and B. Tomlinson. Who are the crowdworkers? Shifting demographics in Mechanical Turk. In CHI, 2010. Google ScholarDigital Library
- D.M. Russell, M.J. Stefik, P. Pirolli, and S.K. Card. The cost structure of sensemaking. In CHI, 1993. Google ScholarDigital Library
- P. Smyth, U. Fayyad, M. Burl, P. Perona, and P. Baldi. Inferring ground truth from subjective labelling of venus images. In NIPS, 1995.Google Scholar
- R. Snow, B. O'Connor, D. Jurafsky, and A. Ng. Cheap and fast -- but is it good? evaluating non-expert annotations for natural language tasks. In EMNLP, 2008. Google ScholarDigital Library
- Y.W. Teh, D. Newman, and M. Welling. A collapsed variational Bayesian inference algorithm for latent Dirichlet allocation. In NIPS, 2006.Google ScholarDigital Library
- P. Welinder, S. Branson, S. Belongie, and P. Perona. The multidimensional wisdom of crowds. In NIPS, 2010.Google ScholarDigital Library
- J. Whitehill, P. Ruvolo, T. Wu, J. Bergsma, and J. Movellan. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In NIPS, 2009.Google ScholarDigital Library
- Y. Yan, R. Rosales, G. Fung, M. Schmidt, G. Hermosillo, L. Bogoni, L. Moy, J.G. Dy, and PA Malvern. Modeling annotator expertise: Learning when everybody knows a bit of something. In AISTATS, 2010.Google Scholar
- J. Zhu, N. Chen, and E. Xing. Infinite Latent SVM for Classification and Multi-task Learning. In NIPS, 2011.Google ScholarDigital Library
Recommendations
Self-correcting crowds
CHI EA '12: CHI '12 Extended Abstracts on Human Factors in Computing SystemsMuch of the current work in crowdsourcing is focused on increasing the quality of responses. Quality issues are most often due to a small subset of low quality workers. The ability to distinguish between high and low quality workers would allow a wide ...
Crowds in two seconds: enabling realtime crowd-powered interfaces
UIST '11: Proceedings of the 24th annual ACM symposium on User interface software and technologyInteractive systems must respond to user input within seconds. Therefore, to create realtime crowd-powered interfaces, we need to dramatically lower crowd latency. In this paper, we introduce the use of synchronous crowds for on-demand, realtime ...
Parting Crowds: Characterizing Divergent Interpretations in Crowdsourced Annotation Tasks
CSCW '16: Proceedings of the 19th ACM Conference on Computer-Supported Cooperative Work & Social ComputingCrowdsourcing is a common strategy for collecting the “gold standard” labels required for many natural language applications. Crowdworkers differ in their responses for many reasons, but existing approaches often treat disagreements as "noise" to be ...
Comments