nach oben

Discover Computing

Erschienen in:

01.04.2013 | Crowd Sourcing

Crowdsourcing interactions: using crowdsourcing for evaluating interactive information retrieval systems

verfasst von: Guido Zuccon, Teerapong Leelanupab, Stewart Whiting, Emine Yilmaz, Joemon M. Jose, Leif Azzopardi

Erschienen in: Discover Computing | Ausgabe 2/2013

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

In the field of information retrieval (IR), researchers and practitioners are often faced with a demand for valid approaches to evaluate the performance of retrieval systems. The Cranfield experiment paradigm has been dominant for the in-vitro evaluation of IR systems. Alternative to this paradigm, laboratory-based user studies have been widely used to evaluate interactive information retrieval (IIR) systems, and at the same time investigate users’ information searching behaviours. Major drawbacks of laboratory-based user studies for evaluating IIR systems include the high monetary and temporal costs involved in setting up and running those experiments, the lack of heterogeneity amongst the user population and the limited scale of the experiments, which usually involve a relatively restricted set of users. In this paper, we propose an alternative experimental methodology to laboratory-based user studies. Our novel experimental methodology uses a crowdsourcing platform as a means of engaging study participants. Through crowdsourcing, our experimental methodology can capture user interactions and searching behaviours at a lower cost, with more data, and within a shorter period than traditional laboratory-based user studies, and therefore can be used to assess the performances of IIR systems. In this article, we show the characteristic differences of our approach with respect to traditional IIR experimental and evaluation procedures. We also perform a use case study comparing crowdsourcing-based evaluation with laboratory-based evaluation of IIR systems, which can serve as a tutorial for setting up crowdsourcing-based IIR evaluations.

Vorheriger Artikel Crowdsourcing and the crisis-affected community

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

The TREC Interactive Track (e.g. see Over 1997, 2001) represents a notable exception.

http://www.mturk.com/.

http://crowdflower.com/.

We ignore the possibility of performing interviews of workers, given the remote and asymmetric nature of crowdsourcing.

Although similar considerations may apply also to laboratory-based user studies.

http://www.utest.com/, allows requesters to have access to a large population for testing software applications.

It can be argued that the average user of crowdsourcing platforms is reasonably well educated to know English and how computers and crowdsourcing platform work; furthermore, they would have sufficient economic means and geographic access to be using a computer and the Internet.

Researchers select a group of qualified subjects and ask their personal information.

https://requester.mturk.com/mturk/help?helpPage=policies.

The two systems employed in our experiments are described in Sect. 4.4

See for example the work of Ipeirotis et al. which presents an algorithm for identifying bias and errors in labelling tasks by assigning a score to each worker so as to represent the quality of their work (Ipeirotis et al. 2010).

http://www.ets.org/toefl/.

A Graeco-Latin Square is formed by merging two orthogonal Latin square of an n × m arrangement over two sets of variables, e.g. systems and tasks.

This is due to several constraints in laboratory-based user experiments such as the limited number of experimenters and participants as well as time and budget.

The observation of qualification tests slowing down batch-completion time is consistent with the findings reported by Alonso and Baeza-Yates (Alonso and Baeza-Yates 2011).

http://www.bing.com/developers.

Recall that due to Bing API’s limitations we only retrieved maximum 50 results per query

Workers has been contacted using the API service made available by AMT.

Remember we did not implement filters that exclude this kind of behaviour.

As judged by themselves.

Alonso, O., & Baeza-Yates, R. (2011). Design and implementation of relevance assessments using crowdsourcing. In: P. Clough, C. Foley, C. Gurrin, G. Jones, W. Kraaij, H. Lee, & V. Mudoch (Eds.), Advances in information retrieval, volume 6611 of lecture notes in computer science (pp. 153–164). New York: Springer.

Alonso, O., & Mizzaro, S. (2009). Can we get rid of trec assessors? using mechanical turk for relevance assessment. In SIGIR ’09: workshop on the future of IR evaluation.

Alonso, O., Rose, D. E., & Stewart, B. (2008). Crowdsourcing for relevance evaluation. SIGIR Forum, 42, 9–15.CrossRef

Arguello, J., Diaz, F., Callan, J., & Carterette, B. (2011). A methodology for evaluating aggregated search results. In: P. Clough, C. Foley, C. Gurrin, G. Jones, W. Kraaij, H. Lee, & V. Mudoch (Eds.) Advances in information retrieval, volume 6611 of lecture notes in computer science (pp. 141–152). New York: Springer.

Borlund, P. (2003). The IIR evaluation model: A framework for evaluation of interactive information retrieval systems. Information Research, 8(3), 152. http://www.doaj.org/doaj?func=abstract&id=88950.

Carter P.J. (2007) IQ and psychometric tests. London: Kogan Page.

Dang, H. T., Kelly, D., & Lin, J. (2007). Overview of the trec 2007 question answering track. In Proceedings of the text REtrieval conference.

Dang, H. T., Lin, J., & Kelly, D. (2006). Overview of the trec 2006 question answering track. In Proceedings of the text REtrieval conference.

Feild, H., Jones, R., Miller, R. C., Nayak, R., Churchill, E. F., & Velipasaoglu, E. (2009). Logging the search self-efficacy of amazon mechanical turkers. In SIGIR 2009 work on crowdsourcing for search eval.

Grady, C., & Lease, M. (2010). Crowdsourcing document relevance assessment with mechanical turk. In Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with amazon’s mechanical turk, CSLDAMT ’10, (pp. 172–179). PA, USA: Stroudsburg. Association for Computational Linguistics.

Grimes, C., Tang, D., & Russell, D. (2007). Query logs alone are not enough. In Workshop on query log analysis at WWW.

Ipeirotis, P. G. (2010a). Analyzing the amazon mechanical turk marketplace. XRDS, 17, 16–21CrossRef

Ipeirotis, P. G. (2010b). Demographics of mechanical turk. NYU working paper no. ; CEDER-10-01. Available at http://hdl.handle.net/2451/29585, March 2010.

Ipeirotis, P. G., Provost, F., & Wang, J. (2010). Quality management on amazon mechanical turk. In Proceedings of the ACM SIGKDD workshop on human computation, HCOMP ’10, (pp. 64–67). New York, NY, USA: ACM.

Kazai, G. (2011). In search of quality in crowdsourcing for search engine evaluation. In P. Clough, C. Foley, C. Gurrin, G. Jones, W. Kraaij, H. Lee, & V. Mudoch (Eds.) Advances in information retrieval, volume 6611 of lecture notes in computer science (pp. 165–176). UK: Springer.

Kelly, D. (2009). Methods for evaluating interactive information retrieval systems with users. Foundations and Trends in Information Retrieval, 3(1–2), 1–224.

Kelly, D., Dumais, S., & Pedersen, J. (2009). Evaluation challenges and directions for information-seeking support systems. Computer, 42(3), 60–66.CrossRef

Leelanupab, T. (2012). A Ranking framework and evaluation for diversity-based retrieval. PhD thesis, University of Glasgow.

Leelanupab, T., Hopfgartner, F., & Jose, J. (2009). User centred evaluation of a recommendation based image browsing system. In Proceedings of the 4th Indian international conference on artificial intelligence (pp. 558–573). Citeseer.

Lin, C. Y. (2004). Rouge: a package for automatic evaluation of summaries. In Proceedings of the workshop on text summarization, ACL 2004. Spain: Barcelona.

Mason, W., & Watts, D. J. (2009). Financial incentives and the performance of crowds. In Proceedings of the ACM SIGKDD workshop on human computation, HCOMP ’09, (pp. 77–85), New York, NY, USA: ACM.

McCreadie, R., Macdonald, C., & Ounis, I.: Crowdsourcing Blog Track Top News Judgments at TREC. In M. Lease, V. Carvalho, E. Yilmaz (eds) Proceedings of the workshop on crowdsourcing for search and data mining (CSDM) at the 4th ACM international conference on web search and data mining (WSDM) (pp. 23–26). Hong Kong, China, February 2011.

Over, P. (1997). Trec-6 interactive track report. In Proceedings of the text REtrieval conference (pp. 57–64).

Over P. (2001) The trec interactive track: an annotated bibliography. Information Processing & Management, 37(3):369–381

Potthast, M., Stein, B., Barrón-Cedeño, A., & Rosso, P. (2010). An evaluation framework for plagiarism detection. In Proceedings of the 23rd international conference on computational linguistics: posters, COLING ’10 (pp. 997–1005). Stroudsburg, PA, USA: Association for Computational Linguistics.

Ross, J., Zaldivar, A., Irani, L., Tomlinson, B., & Silberman, M. S. (2010). Who are the crowdworkers? shifting demographics in mechanical turk. In Proceedings CHI 2010 (pp. 2863–2872).

Santos, R., Peng, J., Macdonald, C., & Ounis, I. (2010). Explicit search result diversification through sub-queries. In: C. Gurrin, Y. He, G. Kazai, U. Kruschwitz, S. Little, T. Roelleke, S. Rüger, & K. van Rijsbergen (Eds.) Advances in information retrieval, volume 5993 of lecture notes in computer science. (pp. 87–99). UK: Springer.

Shadish, W. R., Cook, T. D., & Campbell, D. T. (2001). Experimental and quasi-experimental designs for generalized causal inference, (2nd edn.). Boston: Houghton Mifflin.

Voorhees, E. M. (2005). Trec: Improving information access through evaluation. Bulletin of the American Society for Information Science and Technology, 32(1), 16–21.CrossRef

Voorhees, E. M., & Harman, D. (2005). TREC: Experiment and evaluation in information retrieval digital libraries and electronic publishing. Cambridge, MA: MIT Press.

Zuccon, G., Leelanupab, T., Whiting, S., Jose, E. Y. J., & Azzopardi, L. (2011a). Crowdsourcing interactions—Capturing query sessions through crowdsourcing. In B. Carterette, E. Kanoulas, P. Clough, & M. Sanderson (Eds.), Proceedings of the workshop on information retrieval over query sessions at the European conference on information retrieval (ECIR). Dublin, Ireland, April 2011.

Zuccon, G., Leelanupab, T., Whiting, S., Jose, J., & Azzopardi, L. (2011b). Crowdsourcing interactions—A proposal for capturing user interactions through crowdsourcing. In M. Lease, V. Carvalho, & E. Yilmaz (Eds.), Proceedings of the workshop on crowdsourcing for search and data mining (CSDM) at the 4th ACM international conference on web search and data mining (WSDM) (pp. 35–38). Hong Kong, China, February 2011.

Titel: Crowdsourcing interactions: using crowdsourcing for evaluating interactive information retrieval systems
verfasst von: Guido Zuccon
Teerapong Leelanupab
Stewart Whiting
Emine Yilmaz
Joemon M. Jose
Leif Azzopardi
Publikationsdatum: 01.04.2013
Verlag: Springer Netherlands
Erschienen in: Discover Computing / Ausgabe 2/2013
Print ISSN: 2948-2984
Elektronische ISSN: 2948-2992
DOI: https://doi.org/10.1007/s10791-012-9206-z

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 2/2013

Implementing crowdsourcing-based relevance experimentation: an industrial perspective

Identifying top news using crowdsourcing

An analysis of human factors and label accuracy in crowdsourcing relevance judgments

Increasing cheat robustness of crowdsourcing tasks

Crowdsourcing for information retrieval: introduction to the special issue

Crowdsourcing and the crisis-affected community