Skip to main content
Erschienen in: Discover Computing 2/2013

01.04.2013 | Crowd Sourcing

Crowdsourcing interactions: using crowdsourcing for evaluating interactive information retrieval systems

verfasst von: Guido Zuccon, Teerapong Leelanupab, Stewart Whiting, Emine Yilmaz, Joemon M. Jose, Leif Azzopardi

Erschienen in: Discover Computing | Ausgabe 2/2013

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

In the field of information retrieval (IR), researchers and practitioners are often faced with a demand for valid approaches to evaluate the performance of retrieval systems. The Cranfield experiment paradigm has been dominant for the in-vitro evaluation of IR systems. Alternative to this paradigm, laboratory-based user studies have been widely used to evaluate interactive information retrieval (IIR) systems, and at the same time investigate users’ information searching behaviours. Major drawbacks of laboratory-based user studies for evaluating IIR systems include the high monetary and temporal costs involved in setting up and running those experiments, the lack of heterogeneity amongst the user population and the limited scale of the experiments, which usually involve a relatively restricted set of users. In this paper, we propose an alternative experimental methodology to laboratory-based user studies. Our novel experimental methodology uses a crowdsourcing platform as a means of engaging study participants. Through crowdsourcing, our experimental methodology can capture user interactions and searching behaviours at a lower cost, with more data, and within a shorter period than traditional laboratory-based user studies, and therefore can be used to assess the performances of IIR systems. In this article, we show the characteristic differences of our approach with respect to traditional IIR experimental and evaluation procedures. We also perform a use case study comparing crowdsourcing-based evaluation with laboratory-based evaluation of IIR systems, which can serve as a tutorial for setting up crowdsourcing-based IIR evaluations.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
The TREC Interactive Track (e.g. see Over 1997, 2001) represents a notable exception.
 
4
We ignore the possibility of performing interviews of workers, given the remote and asymmetric nature of crowdsourcing.
 
5
Although similar considerations may apply also to laboratory-based user studies.
 
6
http://​www.​utest.​com/​, allows requesters to have access to a large population for testing software applications.
 
7
It can be argued that the average user of crowdsourcing platforms is reasonably well educated to know English and how computers and crowdsourcing platform work; furthermore, they would have sufficient economic means and geographic access to be using a computer and the Internet.
 
8
Researchers select a group of qualified subjects and ask their personal information.
 
10
The two systems employed in our experiments are described in Sect. 4.4
 
11
See for example the work of Ipeirotis et al. which presents an algorithm for identifying bias and errors in labelling tasks by assigning a score to each worker so as to represent the quality of their work (Ipeirotis et al. 2010).
 
13
A Graeco-Latin Square is formed by merging two orthogonal Latin square of an n × m arrangement over two sets of variables, e.g. systems and tasks.
 
14
This is due to several constraints in laboratory-based user experiments such as the limited number of experimenters and participants as well as time and budget.
 
15
The observation of qualification tests slowing down batch-completion time is consistent with the findings reported by Alonso and Baeza-Yates (Alonso and Baeza-Yates 2011).
 
17
Recall that due to Bing API’s limitations we only retrieved maximum 50 results per query
 
18
Workers has been contacted using the API service made available by AMT.
 
19
Remember we did not implement filters that exclude this kind of behaviour.
 
20
As judged by themselves.
 
Literatur
Zurück zum Zitat Alonso, O., & Baeza-Yates, R. (2011). Design and implementation of relevance assessments using crowdsourcing. In: P. Clough, C. Foley, C. Gurrin, G. Jones, W. Kraaij, H. Lee, & V. Mudoch (Eds.), Advances in information retrieval, volume 6611 of lecture notes in computer science (pp. 153–164). New York: Springer. Alonso, O., & Baeza-Yates, R. (2011). Design and implementation of relevance assessments using crowdsourcing. In: P. Clough, C. Foley, C. Gurrin, G. Jones, W. Kraaij, H. Lee, & V. Mudoch (Eds.), Advances in information retrieval, volume 6611 of lecture notes in computer science (pp. 153–164). New York: Springer.
Zurück zum Zitat Alonso, O., & Mizzaro, S. (2009). Can we get rid of trec assessors? using mechanical turk for relevance assessment. In SIGIR ’09: workshop on the future of IR evaluation. Alonso, O., & Mizzaro, S. (2009). Can we get rid of trec assessors? using mechanical turk for relevance assessment. In SIGIR ’09: workshop on the future of IR evaluation.
Zurück zum Zitat Alonso, O., Rose, D. E., & Stewart, B. (2008). Crowdsourcing for relevance evaluation. SIGIR Forum, 42, 9–15.CrossRef Alonso, O., Rose, D. E., & Stewart, B. (2008). Crowdsourcing for relevance evaluation. SIGIR Forum, 42, 9–15.CrossRef
Zurück zum Zitat Arguello, J., Diaz, F., Callan, J., & Carterette, B. (2011). A methodology for evaluating aggregated search results. In: P. Clough, C. Foley, C. Gurrin, G. Jones, W. Kraaij, H. Lee, & V. Mudoch (Eds.) Advances in information retrieval, volume 6611 of lecture notes in computer science (pp. 141–152). New York: Springer. Arguello, J., Diaz, F., Callan, J., & Carterette, B. (2011). A methodology for evaluating aggregated search results. In: P. Clough, C. Foley, C. Gurrin, G. Jones, W. Kraaij, H. Lee, & V. Mudoch (Eds.) Advances in information retrieval, volume 6611 of lecture notes in computer science (pp. 141–152). New York: Springer.
Zurück zum Zitat Carter P.J. (2007) IQ and psychometric tests. London: Kogan Page. Carter P.J. (2007) IQ and psychometric tests. London: Kogan Page.
Zurück zum Zitat Dang, H. T., Kelly, D., & Lin, J. (2007). Overview of the trec 2007 question answering track. In Proceedings of the text REtrieval conference. Dang, H. T., Kelly, D., & Lin, J. (2007). Overview of the trec 2007 question answering track. In Proceedings of the text REtrieval conference.
Zurück zum Zitat Dang, H. T., Lin, J., & Kelly, D. (2006). Overview of the trec 2006 question answering track. In Proceedings of the text REtrieval conference. Dang, H. T., Lin, J., & Kelly, D. (2006). Overview of the trec 2006 question answering track. In Proceedings of the text REtrieval conference.
Zurück zum Zitat Feild, H., Jones, R., Miller, R. C., Nayak, R., Churchill, E. F., & Velipasaoglu, E. (2009). Logging the search self-efficacy of amazon mechanical turkers. In SIGIR 2009 work on crowdsourcing for search eval. Feild, H., Jones, R., Miller, R. C., Nayak, R., Churchill, E. F., & Velipasaoglu, E. (2009). Logging the search self-efficacy of amazon mechanical turkers. In SIGIR 2009 work on crowdsourcing for search eval.
Zurück zum Zitat Grady, C., & Lease, M. (2010). Crowdsourcing document relevance assessment with mechanical turk. In Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with amazon’s mechanical turk, CSLDAMT ’10, (pp. 172–179). PA, USA: Stroudsburg. Association for Computational Linguistics. Grady, C., & Lease, M. (2010). Crowdsourcing document relevance assessment with mechanical turk. In Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with amazon’s mechanical turk, CSLDAMT ’10, (pp. 172–179). PA, USA: Stroudsburg. Association for Computational Linguistics.
Zurück zum Zitat Grimes, C., Tang, D., & Russell, D. (2007). Query logs alone are not enough. In Workshop on query log analysis at WWW. Grimes, C., Tang, D., & Russell, D. (2007). Query logs alone are not enough. In Workshop on query log analysis at WWW.
Zurück zum Zitat Ipeirotis, P. G. (2010a). Analyzing the amazon mechanical turk marketplace. XRDS, 17, 16–21CrossRef Ipeirotis, P. G. (2010a). Analyzing the amazon mechanical turk marketplace. XRDS, 17, 16–21CrossRef
Zurück zum Zitat Ipeirotis, P. G., Provost, F., & Wang, J. (2010). Quality management on amazon mechanical turk. In Proceedings of the ACM SIGKDD workshop on human computation, HCOMP ’10, (pp. 64–67). New York, NY, USA: ACM. Ipeirotis, P. G., Provost, F., & Wang, J. (2010). Quality management on amazon mechanical turk. In Proceedings of the ACM SIGKDD workshop on human computation, HCOMP ’10, (pp. 64–67). New York, NY, USA: ACM.
Zurück zum Zitat Kazai, G. (2011). In search of quality in crowdsourcing for search engine evaluation. In P. Clough, C. Foley, C. Gurrin, G. Jones, W. Kraaij, H. Lee, & V. Mudoch (Eds.) Advances in information retrieval, volume 6611 of lecture notes in computer science (pp. 165–176). UK: Springer. Kazai, G. (2011). In search of quality in crowdsourcing for search engine evaluation. In P. Clough, C. Foley, C. Gurrin, G. Jones, W. Kraaij, H. Lee, & V. Mudoch (Eds.) Advances in information retrieval, volume 6611 of lecture notes in computer science (pp. 165–176). UK: Springer.
Zurück zum Zitat Kelly, D. (2009). Methods for evaluating interactive information retrieval systems with users. Foundations and Trends in Information Retrieval, 3(1–2), 1–224. Kelly, D. (2009). Methods for evaluating interactive information retrieval systems with users. Foundations and Trends in Information Retrieval, 3(1–2), 1–224.
Zurück zum Zitat Kelly, D., Dumais, S., & Pedersen, J. (2009). Evaluation challenges and directions for information-seeking support systems. Computer, 42(3), 60–66.CrossRef Kelly, D., Dumais, S., & Pedersen, J. (2009). Evaluation challenges and directions for information-seeking support systems. Computer, 42(3), 60–66.CrossRef
Zurück zum Zitat Leelanupab, T. (2012). A Ranking framework and evaluation for diversity-based retrieval. PhD thesis, University of Glasgow. Leelanupab, T. (2012). A Ranking framework and evaluation for diversity-based retrieval. PhD thesis, University of Glasgow.
Zurück zum Zitat Leelanupab, T., Hopfgartner, F., & Jose, J. (2009). User centred evaluation of a recommendation based image browsing system. In Proceedings of the 4th Indian international conference on artificial intelligence (pp. 558–573). Citeseer. Leelanupab, T., Hopfgartner, F., & Jose, J. (2009). User centred evaluation of a recommendation based image browsing system. In Proceedings of the 4th Indian international conference on artificial intelligence (pp. 558–573). Citeseer.
Zurück zum Zitat Lin, C. Y. (2004). Rouge: a package for automatic evaluation of summaries. In Proceedings of the workshop on text summarization, ACL 2004. Spain: Barcelona. Lin, C. Y. (2004). Rouge: a package for automatic evaluation of summaries. In Proceedings of the workshop on text summarization, ACL 2004. Spain: Barcelona.
Zurück zum Zitat Mason, W., & Watts, D. J. (2009). Financial incentives and the performance of crowds. In Proceedings of the ACM SIGKDD workshop on human computation, HCOMP ’09, (pp. 77–85), New York, NY, USA: ACM. Mason, W., & Watts, D. J. (2009). Financial incentives and the performance of crowds. In Proceedings of the ACM SIGKDD workshop on human computation, HCOMP ’09, (pp. 77–85), New York, NY, USA: ACM.
Zurück zum Zitat McCreadie, R., Macdonald, C., & Ounis, I.: Crowdsourcing Blog Track Top News Judgments at TREC. In M. Lease, V. Carvalho, E. Yilmaz (eds) Proceedings of the workshop on crowdsourcing for search and data mining (CSDM) at the 4th ACM international conference on web search and data mining (WSDM) (pp. 23–26). Hong Kong, China, February 2011. McCreadie, R., Macdonald, C., & Ounis, I.: Crowdsourcing Blog Track Top News Judgments at TREC. In M. Lease, V. Carvalho, E. Yilmaz (eds) Proceedings of the workshop on crowdsourcing for search and data mining (CSDM) at the 4th ACM international conference on web search and data mining (WSDM) (pp. 23–26). Hong Kong, China, February 2011.
Zurück zum Zitat Over, P. (1997). Trec-6 interactive track report. In Proceedings of the text REtrieval conference (pp. 57–64). Over, P. (1997). Trec-6 interactive track report. In Proceedings of the text REtrieval conference (pp. 57–64).
Zurück zum Zitat Over P. (2001) The trec interactive track: an annotated bibliography. Information Processing & Management, 37(3):369–381 Over P. (2001) The trec interactive track: an annotated bibliography. Information Processing & Management, 37(3):369–381
Zurück zum Zitat Potthast, M., Stein, B., Barrón-Cedeño, A., & Rosso, P. (2010). An evaluation framework for plagiarism detection. In Proceedings of the 23rd international conference on computational linguistics: posters, COLING ’10 (pp. 997–1005). Stroudsburg, PA, USA: Association for Computational Linguistics. Potthast, M., Stein, B., Barrón-Cedeño, A., & Rosso, P. (2010). An evaluation framework for plagiarism detection. In Proceedings of the 23rd international conference on computational linguistics: posters, COLING ’10 (pp. 997–1005). Stroudsburg, PA, USA: Association for Computational Linguistics.
Zurück zum Zitat Ross, J., Zaldivar, A., Irani, L., Tomlinson, B., & Silberman, M. S. (2010). Who are the crowdworkers? shifting demographics in mechanical turk. In Proceedings CHI 2010 (pp. 2863–2872). Ross, J., Zaldivar, A., Irani, L., Tomlinson, B., & Silberman, M. S. (2010). Who are the crowdworkers? shifting demographics in mechanical turk. In Proceedings CHI 2010 (pp. 2863–2872).
Zurück zum Zitat Santos, R., Peng, J., Macdonald, C., & Ounis, I. (2010). Explicit search result diversification through sub-queries. In: C. Gurrin, Y. He, G. Kazai, U. Kruschwitz, S. Little, T. Roelleke, S. Rüger, & K. van Rijsbergen (Eds.) Advances in information retrieval, volume 5993 of lecture notes in computer science. (pp. 87–99). UK: Springer. Santos, R., Peng, J., Macdonald, C., & Ounis, I. (2010). Explicit search result diversification through sub-queries. In: C. Gurrin, Y. He, G. Kazai, U. Kruschwitz, S. Little, T. Roelleke, S. Rüger, & K. van Rijsbergen (Eds.) Advances in information retrieval, volume 5993 of lecture notes in computer science. (pp. 87–99). UK: Springer.
Zurück zum Zitat Shadish, W. R., Cook, T. D., & Campbell, D. T. (2001). Experimental and quasi-experimental designs for generalized causal inference, (2nd edn.). Boston: Houghton Mifflin. Shadish, W. R., Cook, T. D., & Campbell, D. T. (2001). Experimental and quasi-experimental designs for generalized causal inference, (2nd edn.). Boston: Houghton Mifflin.
Zurück zum Zitat Voorhees, E. M. (2005). Trec: Improving information access through evaluation. Bulletin of the American Society for Information Science and Technology, 32(1), 16–21.CrossRef Voorhees, E. M. (2005). Trec: Improving information access through evaluation. Bulletin of the American Society for Information Science and Technology, 32(1), 16–21.CrossRef
Zurück zum Zitat Voorhees, E. M., & Harman, D. (2005). TREC: Experiment and evaluation in information retrieval digital libraries and electronic publishing. Cambridge, MA: MIT Press. Voorhees, E. M., & Harman, D. (2005). TREC: Experiment and evaluation in information retrieval digital libraries and electronic publishing. Cambridge, MA: MIT Press.
Zurück zum Zitat Zuccon, G., Leelanupab, T., Whiting, S., Jose, E. Y. J., & Azzopardi, L. (2011a). Crowdsourcing interactions—Capturing query sessions through crowdsourcing. In B. Carterette, E. Kanoulas, P. Clough, & M. Sanderson (Eds.), Proceedings of the workshop on information retrieval over query sessions at the European conference on information retrieval (ECIR). Dublin, Ireland, April 2011. Zuccon, G., Leelanupab, T., Whiting, S., Jose, E. Y. J., & Azzopardi, L. (2011a). Crowdsourcing interactions—Capturing query sessions through crowdsourcing. In B. Carterette, E. Kanoulas, P. Clough, & M. Sanderson (Eds.), Proceedings of the workshop on information retrieval over query sessions at the European conference on information retrieval (ECIR). Dublin, Ireland, April 2011.
Zurück zum Zitat Zuccon, G., Leelanupab, T., Whiting, S., Jose, J., & Azzopardi, L. (2011b). Crowdsourcing interactions—A proposal for capturing user interactions through crowdsourcing. In M. Lease, V. Carvalho, & E. Yilmaz (Eds.), Proceedings of the workshop on crowdsourcing for search and data mining (CSDM) at the 4th ACM international conference on web search and data mining (WSDM) (pp. 35–38). Hong Kong, China, February 2011. Zuccon, G., Leelanupab, T., Whiting, S., Jose, J., & Azzopardi, L. (2011b). Crowdsourcing interactions—A proposal for capturing user interactions through crowdsourcing. In M. Lease, V. Carvalho, & E. Yilmaz (Eds.), Proceedings of the workshop on crowdsourcing for search and data mining (CSDM) at the 4th ACM international conference on web search and data mining (WSDM) (pp. 35–38). Hong Kong, China, February 2011.
Metadaten
Titel
Crowdsourcing interactions: using crowdsourcing for evaluating interactive information retrieval systems
verfasst von
Guido Zuccon
Teerapong Leelanupab
Stewart Whiting
Emine Yilmaz
Joemon M. Jose
Leif Azzopardi
Publikationsdatum
01.04.2013
Verlag
Springer Netherlands
Erschienen in
Discover Computing / Ausgabe 2/2013
Print ISSN: 2948-2984
Elektronische ISSN: 2948-2992
DOI
https://doi.org/10.1007/s10791-012-9206-z

Weitere Artikel der Ausgabe 2/2013

Discover Computing 2/2013 Zur Ausgabe