research-article

A Game Theory Approach for Estimating Reliability of Crowdsourced Relevance Assessments

Authors:
Yashar Moshfeghi

University of Strathclyde, Glasgow, UK

University of Strathclyde, Glasgow, UK

0000-0003-4186-1088
View Profile

,
Alvaro Francisco Huertas-Rosero

Centre for Human Ecology, Glasgow, UK

Centre for Human Ecology, Glasgow, UK
View Profile

Authors Info & Claims

ACM Transactions on Information Systems Volume 40 Issue 3Article No.: 60pp 1–29https://doi.org/10.1145/3480965

Published:17 November 2021Publication History

ACM Transactions on Information Systems

Abstract

In this article, we propose an approach to improve quality in crowdsourcing (CS) tasks using Task Completion Time (TCT) as a source of information about the reliability of workers in a game-theoretical competitive scenario. Our approach is based on the hypothesis that some workers are more risk-inclined and tend to gamble with their use of time when put to compete with other workers. This hypothesis is supported by our previous simulation study. We test our approach with 35 topics from experiments on the TREC-8 collection being assessed as relevant or non-relevant by crowdsourced workers both in a competitive (referred to as “Game”) and non-competitive (referred to as “Base”) scenario. We find that competition changes the distributions of TCT, making them sensitive to the quality (i.e., wrong or right) and outcome (i.e., relevant or non-relevant) of the assessments. We also test an optimal function of TCT as weights in a weighted majority voting scheme. From probabilistic considerations, we derive a theoretical upper bound for the weighted majority performance of cohorts of 2, 3, 4, and 5 workers, which we use as a criterion to evaluate the performance of our weighting scheme. We find our approach achieves a remarkable performance, significantly closing the gap between the accuracy of the obtained relevance judgements and the upper bound. Since our approach takes advantage of TCT, which is an available quantity in any CS tasks, we believe it is cost-effective and, therefore, can be applied for quality assurance in crowdsourcing for micro-tasks.

REFERENCES

[1] Omar Alonso. 2013. Implementing crowdsourcing-based relevance experimentation: An industrial perspective. Info, Retriev, 16, 2 (2013), 1–20. Google ScholarDigital Library
[2] Omar Alonso and Ricardo Baeza-Yates. 2011. Design and implementation of relevance assessments using crowdsourcing. In Proceedings of the 33rd ECIR Conference, Advances in Information Retrieval(ECIR’11). Springer, 153–164. Google ScholarDigital Library
[3] Omar Alonso and Stefano Mizzaro. 2009. Can we get rid of TREC assessors? Using mechanical turk for relevance assessment. In Proceedings of the SIGIR Workshop on the Future of IR Evaluation. ACM, 557–566.Google Scholar
[4] Omar Alonso and Stefano Mizzaro. 2012. Using crowdsourcing for TREC relevance assessment. Info. Process. Manage. 48, 6 (2012), 1053–1066. http://dx.doi.org/10.1016/j.ipm.2012.01.004 Google ScholarDigital Library
[5] Omar Alonso, Daniel E. Rose, and Benjamin Stewart. 2008. Crowdsourcing for relevance evaluation. SIGIR Forum 42, 2 (2008), 9–15. http://dx.doi.org/10.1145/1480506.1480508 Google ScholarDigital Library
[6] Nicholas C. Barberis. 2012. Thirty Years of Prospect Theory in Economics: A Review and Assessment. Technical Report. National Bureau of Economic Research.Google Scholar
[7] Daniel Berend and Aryeh Kontorovich. 2014. Consistency of weighted majority votes. In Proceedings of the 27th International Conference on Neural Information Processing Systems-Volume 2. Curran Associates, Montreal, Canada, 3446–3454. Google ScholarDigital Library
[8] Michael S. Bernstein, Joel Brandt, Robert C. Miller, and David R. Karger. 2011. Crowds in two seconds: Enabling realtime crowd-powered interfaces. In Proceedings of the 24th Annual ACM Symposium on User Interface Software and Technology. ACM, Santa Barbara, CA, 33–42. Google ScholarDigital Library
[9] Stephanie Burnett, Nadège Bault, Giorgio Coricelli, and Sarah-Jayne Blakemore. 2010. Adolescents’ heightened risk-seeking in a probabilistic gambling task. Cogn. Dev. 25, 2 (2010), 183–196.Google Scholar
[10] Ben Carterette and Ian Soboroff. 2010. The effect of assessor error on IR system evaluation. In Proceedings of the 33rd Annual International ACM SIGIR Conference(SIGIR’10). ACM, 539–546. Google ScholarDigital Library
[11] Nicholas J. Cepeda, Katharine A. Blackwell, and Yuko Munakata. 2013. Speed isn’t everything: Complex processing speed measures mask individual differences and developmental changes in executive control. Dev. Sci. 16, 2 (2013), 269–286.Google Scholar
[12] Yiling Chen, Arpita Ghosh, Michael Kearns, Tim Roughgarden, and Jennifer Wortman Vaughan. 2016. Mathematical foundations for social computing. Commun. ACM 59, 12 (Dec. 2016), 102–108. http://dx.doi.org/10.1145/2960403Google Scholar
[13] James S. Cramer. 2003. The origins and development of the logit model. Logit Models Econ. Fields 2003 (2003), 1–19.Google Scholar
[14] Vincent P. Crawford and Juanjuan Meng. 2011. New york city cab drivers’ labor supply revisited: Reference-dependent preferences with rationalexpectations targets for hours and income. Amer. Econ. Rev. 101, 5 (2011), 1912–1932.Google Scholar
[15] David Easley and Arpita Ghosh. 2015. Behavioral mechanism design: Optimal crowdsourcing contracts and prospect theory. In Proceedings of the 16th ACM Conference on Economics and Computation. ACM, 679–696. Google ScholarDigital Library
[16] Carsten Eickhoff and Arjen de Vries. 2011. How crowdsourcable is your task. In Proceedings of the Workshop on Crowdsourcing for Search and Data Mining (CSDM) at the 4th ACM WSDM Conference. 11–14.Google Scholar
[17] Carsten Eickhoff, Christopher G. Harris, Arjen P. de Vries, and Padmini Srinivasan. 2012. Quality through flow and immersion: Gamifying crowdsourced relevance assessments. In Proceedings of the 35th Annual International ACM SIGIR Conference(SIGIR’12). 871–880. Google ScholarDigital Library
[18] Shlomo Geva, Jaap Kamps, Carol Peters, Tetsuya Sakai, Andrew Trotman, and Ellen Voorhees. 2009. Report on the SIGIR 2009 workshop on the future of IR evaluation. In Proceedings of the ACM SIGIR Forum, Vol. 43. ACM New York, NY, 13–23.Google Scholar
[19] Richard Gonzalez and George Wu. 1999. On the shape of the probability weighting function. Cogn. Psychol. 38, 1 (1999), 129–166.Google Scholar
[20] Catherine Grady and Matthew Lease. 2010. Crowdsourcing document relevance assessment with mechanical turk. In Proceedings of the NAACL HLT Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk(CSLDAMT’10). 172–179. Google ScholarDigital Library
[21] Lei Han, Eddy Maddalena, Alessandro Checco, Cristina Sarasua, Ujwal Gadiraju, Kevin Roitero, and Gianluca Demartini. 2020. Crowd worker strategies in relevance judgment tasks. In Proceedings of the 13th ACM WSDM Conference(WSDM’20). Association for Computing Machinery, New York, NY, 241–249. http://dx.doi.org/10.1145/3336191.3371857Google Scholar
[22] Lei Han, Kevin Roitero, Eddy Maddalena, Stefano Mizzaro, and Gianluca Demartini. 2019. On transforming relevance scales. In Proceedings of the 28th ACM CIKM Conference(CIKM’19). Association for Computing Machinery, New York, NY, 39–48. http://dx.doi.org/10.1145/3357384.3357988 Google ScholarDigital Library
[23] John C. Harsanyi. 2004. Games with incomplete information played by “Bayesian” players, I–III: Part I. the basic model. Manage. Sci. 50, 12 supplement (2004), 1804–1817. Google ScholarDigital Library
[24] John Joseph Horton and Lydia B. Chilton. 2010. The labor economics of paid crowdsourcing. In Proceedings of the 11th ACM Conference on Electronic Commerce. ACM, 209–218. Google ScholarDigital Library
[25] Miles Hutson, Olga Kanzheleva, Caitlin Taggart, Bilson J. L. Campana, and Quang Duong. 2019. Quality control challenges in crowdsourcing medical labeling. In Proceedings of the SIGKDD Workshop on Data Collection, Curation, and Labeling for Mining and Learning (DCCL’19).Google Scholar
[26] Panagiotis G. Ipeirotis. 2010. Analyzing the amazon mechanical turk marketplace. XRDS 17, 2 (Dec. 2010), 16–21. http://dx.doi.org/10.1145/1869086.1869094 Google ScholarDigital Library
[27] Panagiotis G. Ipeirotis, Foster Provost, and Jing Wang. 2010. Quality management on amazon mechanical turk. In Proceedings of the 2nd Human Computation Workshop(HCOMP’10). 64–67. Google ScholarDigital Library
[28] Daniel Kahneman and Amos Tversky. 1979. Prospect theory: An analysis of decision under risk. Econometrica: J. Econ. Soc. 47, 2 (1979), 263–291.Google Scholar
[29] Daniel Kahneman and Amos Tversky. 2013. Prospect theory: An analysis of decision under risk. In Handbook of the Fundamentals of Financial Decision Making: Part I. World Scientific, Singapore, 99–127.Google Scholar
[30] Jaap Kamps, Shlomo Geva, and Andrew Trotman. 2008. Report on the SIGIR 2008 workshop on focused retrieval. SIGIR Forum 42 (2008), 59–65. Google ScholarDigital Library
[31] Adam Kapelner and Dana Chandler. 2010. Preventing satisficing in online surveys: A “kapcha” to ensure higher quality data. In Proceedings of the World’s First Conference on the Future of Distributed Work (CrowdConf’10).Google Scholar
[32] Nicolas Kaufmann, Thimo Schulze, and Daniel Veit. 2011. More than fun and money. Worker motivation in crowdsourcing-a study on mechanical turk. In Proceedings of the 17th Americas Conference on Information Systems, AMCIS’11.Google Scholar
[33] Gabriella Kazai. 2011. In search of quality in crowdsourcing for search engine evaluation. In Proceedings of the 33rd European Conference on IR Research, Advances in Information Retrieval, Vol. 6611. Springer, 165–176. Google ScholarDigital Library
[34] Gabriella Kazai, Jaap Kamps, Marijn Koolen, and Natasa Milic-Frayling. 2011. Crowdsourcing for book search evaluation: Impact of hit design on comparative system ranking. In Proceedings of the 34th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR’11). ACM, New York, NY, 205–214. http://dx.doi.org/10.1145/2009916.2009947 Google ScholarDigital Library
[35] Gabriella Kazai, Jaap Kamps, and Natasa Milic-Frayling. 2011. Worker types and personality traits in crowdsourcing relevance labels. In Proceedings of the 20th ACM WSDM Conference(WSDM’19). ACM, 1941–1944. Google ScholarDigital Library
[36] Gabriella Kazai, Jaap Kamps, and Natasa Milic-Frayling. 2012. The face of quality in crowdsourcing relevance labels: Demographics, personality and labeling accuracy. In Proceedings of the 21st ACM CIKM Conference(CIKM’12). 2583–2586. Google ScholarDigital Library
[37] Gabriella Kazai, Jaap Kamps, and Natasa Milic-Frayling. 2013. An analysis of human factors and label accuracy in crowdsourcing relevance judgments. Info. Retr. 16, 2 (Apr. 2013), 138–178. Google ScholarDigital Library
[38] Robert Kern, Hans Thies, and Gerhard Satzger. 2010. Statistical quality control for human-based electronic services. In Service-Oriented Computing. Springer, 243–257.Google Scholar
[39] Edith Law, Paul N. Bennett, and Eric Horvitz. 2011. The effects of choice in routing relevance judgments. In Proceedings of the 34th Annual International ACM SIGIR Conference(SIGIR’11). ACM, New York, NY, 1127–1128. http://dx.doi.org/10.1145/2009916.2010082 Google ScholarDigital Library
[40] John Le, Andy Edmonds, Vaughn Hester, and Lukas Biewald. 2010. Ensuring quality in crowdsourced search relevance evaluation: The effects of training question distribution. In Proceedings of the SIGIR Workshop on Crowdsourcing for Search Evaluation (CSE’10). 21–26.Google Scholar
[41] Hongwei Li and Bin Yu. 2014. Error rate bounds and iterative weighted majority voting for crowdsourcing. Retrieved from https://arXiv:1411.4086.Google Scholar
[42] Yuan Li. 2019. Probabilistic Models for Aggregating Crowdsourced Annotations. Ph.D. Dissertation. University of Melbourne. Google Scholar
[43] Eddy Maddalena, Marco Basaldella, Dario De Nart, Dante Degl’Innocenti, Stefano Mizzaro, and Gianluca Demartini. 2016. Crowdsourcing relevance assessments: The unexpected benefits of limiting the time to judge. In Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, Vol. 4.Google Scholar
[44] Eddy Maddalena, Stefano Mizzaro, Falk Scholer, and Andrew Turpin. 2017a. On crowdsourcing relevance magnitudes for information retrieval evaluation. ACM Trans. Info. Syst. 35, 3, Article 19 (Jan. 2017), 32 pages. http://dx.doi.org/10.1145/3002172 Google ScholarDigital Library
[45] Eddy Maddalena, Kevin Roitero, Gianluca Demartini, and Stefano Mizzaro. 2017b. Considering assessor agreement in ir evaluation. In Proceedings of the ACM ICTIR Conference(ICTIR’17). 75–82. Google ScholarDigital Library
[46] Winter Mason and Siddharth Suri. 2011. Conducting behavioral research on Amazon’s mechanical turk.Behav. Res. Methods 44, 1 (June 2011), 1–23.Google Scholar
[47] Winter Mason and Duncan J. Watts. 2009. Financial incentives and the “performance of crowds.” 11, 2 (2009), 77–85. http://dx.doi.org/10.1145/1600150.1600175 Google ScholarDigital Library
[48] Sandeep Mishra and R. Nicholas Carleton. 2017. Use of online crowdsourcing platforms for gambling research. Int. Gambl. Studies 17, 1 (2017), 125–143. http://dx.doi.org/10.1080/14459795.2017.1284250Google Scholar
[49] Yashar Moshfeghi, Alvaro F. Huertas-Rosero, and Joemon M. Jose. 2016a. Identifying careless workers in crowdsourcing platforms: A game theory approach. In Proceedings of the 39th Annual International ACM SIGIR Conference. ACM, 857–860. Google ScholarDigital Library
[50] Yashar Moshfeghi, Alvaro Francisco Huertas Rosero, and Joemon M. Jose. 2016b. A game-theory approach for effective crowdsource-based relevance assessment. ACM Trans. Intell. Syst. Technol. 7, 4, Article 55 (Mar. 2016), 25 pages. http://dx.doi.org/10.1145/2873063 Google ScholarDigital Library
[51] Teemu Pääkkönen, Jaana Kekäläinen, Heikki Keskustalo, Leif Azzopardi, David Maxwell, and Kalervo Järvelin. 2017. Validating simulated interaction for retrieval evaluation. Info. Retriev. J. 20, 4 (2017), 338–362. Google ScholarDigital Library
[52] Kevin Roitero, Eddy Maddalena, Gianluca Demartini, and Stefano Mizzaro. 2018. On fine-grained relevance scales. In Proceedings of the 41st Annual International ACM SIGIR Conference(SIGIR’18). Association for Computing Machinery, New York, NY, 675–684. http://dx.doi.org/10.1145/3209978.3210052 Google ScholarDigital Library
[53] Joel Ross, Lilly Irani, M. Six Silberman, Andrew Zaldivar, and Bill Tomlinson. 2010. Who are the crowdworkers?: Shifting demographics in mechanical turk. In Extended Abstracts on Human Factors in Computing Systems(CHI EA’10). ACM, New York, NY, 2863–2872. http://dx.doi.org/10.1145/1753846.1753873 Google ScholarDigital Library
[54] Aaron D. Shaw, John J. Horton, and Daniel L. Chen. 2011. Designing incentives for inexpert human raters. In Proceedings of the ACM Conference on Computer Supported Cooperative Work(CSCW’11). ACM, New York, NY, 275–284. http://dx.doi.org/10.1145/1958824.1958865 Google ScholarDigital Library
[55] Victor S. Sheng, Foster Provost, and Panagiotis G. Ipeirotis. 2008. Get another label? Improving data quality and data mining using multiple, noisy labelers. In Proceedings of the 14th ACM SIGKDD Conference(KDD’08). ACM, New York, NY, 614–622. http://dx.doi.org/10.1145/1401890.1401965 Google ScholarDigital Library
[56] Mark D. Smucker and Chandra Prakash Jethani. 2011. Measuring assessor accuracy: A comparison of NIST assessors and user study participants. In Proceedings of the 34th Annual International ACM SIGIR Conference(SIGIR’11). ACM, New York, NY, 1231–1232. http://dx.doi.org/10.1145/2009916.2010134 Google ScholarDigital Library
[57] Rion Snow, Brendan O’Connor, Daniel Jurafsky, and Andrew Y. Ng. 2008. Cheap and fast—but is it good?: Evaluating non-expert annotations for natural language tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing(EMNLP’08). Association for Computational Linguistics, Stroudsburg, PA, 254–263. http://dl.acm.org/citation.cfm?id=1613715.1613751 Google ScholarDigital Library
[58] T. Tian, J. Zhu, and Y. Qiaoben. 2019. Max-margin majority voting for learning from crowds. IEEE Transactions On Pattern Analysis And Machine Intelligence 41, 10 (2019), 2480–2494. http://dx.doi.org/10.1109/TPAMI.2018.2860987Google Scholar
[59] Andrew Turpin, Falk Scholer, Stefano Mizzaro, and Eddy Maddalena. 2015. The benefits of magnitude estimation relevance assessments for information retrieval evaluation. In Proceedings of the 38th Annual International ACM SIGIR Conference. 565–574. Google ScholarDigital Library
[60] Paul Verhaeghen, Reinhold Kliegl, and Ulrich Mayr. 1997. Sequential and coordinative complexity in time–accuracy functions for mental arithmetic.Psychol. Aging 12, 4 (1997), 555.Google Scholar
[61] Luis Von Ahn and Laura Dabbish. 2008. Designing games with a purpose. Commun. ACM 51, 8 (2008), 58–67. Google ScholarDigital Library
[62] E. Voorhees, D. K. Harman, and National Institute of Standards and Technology (U.S.). 2005. TREC: Experiment and Evaluation in Information Retrieval. Vol. 63. MIT Press, Cambridge, MA.Google Scholar
[63] Ellen M. Voorhees. 2001. The philosophy of information retrieval evaluation. In Proceedings of the Workshop of the Cross-language Evaluation forum for European Languages. Springer, 355–370. Google ScholarDigital Library
[64] Gang Wang, Tianyi Wang, Haitao Zheng, and Ben Y. Zhao. 2014. Man vs. Machine: Practical adversarial detection of malicious crowdsourcing workers. In Proceedings of the 23rd USENIX Security Symposium(USENIX’14). USENIX Association, 239–254. Retrieved from https://www.usenix.org/conference/usenixsecurity14/technical-sessions/presentation/wang. Google ScholarDigital Library
[65] Jing Wang, Siamak Faridani, and P. Ipeirotis. 2011. Estimating the completion time of crowdsourced tasks using survival analysis models. In Proceedings of the Conference on Crowdsourcing for Search and Data Mining (CSDM’11) 31 (2011).Google Scholar
[66] Jing Wang, Panagiotis G. Ipeirotis, and Foster Provost. 2013. A Framework for Quality Assurance in Crowdsourcing. Technical Report 2451/31833. NYU.Google Scholar
[67] Mark N. Wexler. 2011. Reconfiguring the sociology of the crowd: Exploring crowdsourcing. Int. J. Sociol. Soc. Policy 31, 1/2 (2011), 6–20.Google Scholar
[68] Song Xu, Lei Liu, Lizhen Cui, Qingzhong Li, and Zhongmin Yan. 2019. Promoting higher revenues for both crowdsourcer and crowds in crowdsourcing via contest. In Proceedings of the IEEE International Conference on Web Services(ICWS’19). IEEE, 403–407.Google Scholar
[69] J. Ye, J. Li, M. G. Newman, R. B. Adams, and J. Z. Wang. 2017. Probabilistic multigraph modeling for improving the quality of crowdsourced affective data. IEEE Trans. Affect. Comput. 10, 1 (2017), 1. http://dx.doi.org/10.1109/TAFFC.2017.2678472Google Scholar
[70] Yuxiang Zhao and Qinghua Zhu. 2012. Evaluation on crowdsourcing research: Current status and future direction. Info. Syst. Front. 16, 3 (2012), 1–18. http://dx.doi.org/10.1007/s10796-012-9350-4Google Scholar
[71] Yao Zhou and Jingrui He. 2016. Crowdsourcing via tensor augmentation and completion. In Proceedings of the 25th International Joint Conference on Artificial Intelligence(IJCAI’16). 2435–2441.Google Scholar

Index Terms

A Game Theory Approach for Estimating Reliability of Crowdsourced Relevance Assessments
1. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results

Recommendations

Using crowdsourcing for TREC relevance assessment

Crowdsourcing has recently gained a lot of attention as a tool for conducting different kinds of relevance evaluations. At a very high level, crowdsourcing describes outsourcing of tasks to a large group of people instead of assigning such tasks to an ...
Read More
On the effect of relevance scales in crowdsourcing relevance assessments for Information Retrieval evaluation
Abstract
Relevance is a key concept in information retrieval and widely used for the evaluation of search systems using test collections. We present a comprehensive study of the effect of the choice of relevance scales on the evaluation of ...
Highlights
- We collect relevance judgments for 4 crowdsourced scales.
- We compare the crowd ...
Read More
Identifying Careless Workers in Crowdsourcing Platforms: A Game Theory Approach
SIGIR '16: Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval

In this paper we introduce a game scenario for crowdsourcing (CS) using incentives as a bait for careless (gambler) workers, who respond to them in a characteristic way. We hypothesise that careless workers are risk-inclined and can be detected in the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Information Systems Volume 40, Issue 3
July 2022
650 pages
ISSN:1046-8188
EISSN:1558-2868
DOI:10.1145/3498357
Editor:
Min Zhang
Tsinghua University, China
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 17 November 2021
- Accepted: 1 August 2021
- Revised: 1 July 2021
- Received: 1 May 2020
Published in tois Volume 40, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Game theory
crowdsourcing
relevance assessment
Qualifiers
- research-article
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 340
  Total Downloads
- Downloads (Last 12 months)66
- Downloads (Last 6 weeks)8
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

View Full Text

HTML Format

View this article in HTML Format .

View HTML Format

A Game Theory Approach for Estimating Reliability of Crowdsourced Relevance Assessments

ACM Transactions on Information Systems

Abstract

REFERENCES

Cited By

Index Terms

Recommendations

Using crowdsourcing for TREC relevance assessment

On the effect of relevance scales in crowdsourcing relevance assessments for Information Retrieval evaluation

Identifying Careless Workers in Crowdsourcing Platforms: A Game Theory Approach