ABSTRACT
The primary problem confronting any new kind of search task is how to boot-strap a reliable and repeatable evaluation campaign, and a crowd-sourcing approach provides many advantages. However, can these crowd-sourced evaluations be repeated over long periods of time in a reliable manner? To demonstrate, we investigate creating an evaluation campaign for the semantic search task of keyword-based ad-hoc object retrieval. In contrast to traditional search over web-pages, object search aims at the retrieval of information from factual assertions about real-world objects rather than searching over web-pages with textual descriptions. Using the first large-scale evaluation campaign that specifically targets the task of ad-hoc Web object retrieval over a number of deployed systems, we demonstrate that crowd-sourced evaluation campaigns can be repeated over time and still maintain reliable results. Furthermore, we show how these results are comparable to expert judges when ranking systems and that the results hold over different evaluation and relevance metrics. This work provides empirical support for scalable, reliable, and repeatable search system evaluation using crowdsourcing.
- O. Alonso, D. E. Rose, and B. Stewart. Crowdsourcing for relevance evaluation. SIGIR Forum, 42(2):9--15, 2008. Google ScholarDigital Library
- O. Alonso, R. Schenkel, and M. Theobald. Crowdsourcing assessments for XML ranked retrieval. In ECIR, pages 602--606, 2010. Google ScholarDigital Library
- P. Bailey, N. Craswell, I. Soboroff, P. Thomas, A. P. de Vries, and E. Yilmaz. Relevance assessment: are judges exchangeable and does it matter. In SIGIR, pages 667--674, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
- K. Balog, A. P. de Vries, P. Serdyukov, P. Thomas, and T. Westerveld. Overview of the TREC 2009 Entity Track. In NIST Special Publication: SP 500--278, 2009.Google Scholar
- C. Callison-Burch. Fast, cheap, and creative: Evaluating translation quality using Amazon's Mechanical Turk. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages 286--295, Singapore, August 2009. Association for Computational Linguistics. Google ScholarDigital Library
- B. Carpenter. Multilevel bayesian models of categorical data annotation. technical report. Technical report, Alias-I, 2008. http://lingpipe.files.wordpress.com/2008/11/carp-bayesian-multilevel-annotation.pdf.Google Scholar
- L. Ding, T. Finin, A. Joshi, R. Pan, S. R. Cost, Y. Peng, P. Reddivari, V. Doshi, and J. Sachs. Swoogle: a search and metadata engine for the semantic web. In CIKM, pages 652--659, 2004. Google ScholarDigital Library
- S. Elbassuoni, M. Ramanath, R. Schenkel, M. Sydow, and G. Weikum. Language-model-based Ranking for Queries on RDF-graphs. In CIKM, pages 977--986. ACM, 2009. Google ScholarDigital Library
- R. Guha, R. McCool, and E. Miller. Semantic Search. In WWW, pages 700--709. ACM, 2003. Google ScholarDigital Library
- H. Halpin. A query-driven characterization of linked data. In Proceedings of the WWW Workshop on Linked Data on the Web, Madrid, Spain, 2009.Google Scholar
- S. P. Harter. Variations in relevance assessments and the measurement of retrieval effectiveness. J. Am. Soc. Inf. Sci., 47(1):37--49, 1996. Google ScholarDigital Library
- J. Kamps, S. Geva, A. Trotman, A. Woodley, and M. Koolen. Overview of the INEX 2008 Ad Hoc Track. Advances in Focused Retrieval: 7th International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX 2008, pages 1--28, 2009. Google ScholarDigital Library
- Y. Luo, W. Wang, and X. Lin. SPARK: A Keyword Search Engine on Relational Databases. In ICDE, pages 1552--1555, 2008. Google ScholarDigital Library
- W. Mason and D. J. Watts. Financial Incentives and the "Performance of Crowds". In Human Computation Workshop (HComp2009), 2009. Google ScholarDigital Library
- S. Nowak and S. M. Rüger. How reliable are annotations via crowdsourcing: a study about inter-annotator agreement for multi-label image annotation. In Multimedia Information Retrieval, pages 557--566, 2010. Google ScholarDigital Library
- E. Oren, R. Delbru, M. Catasta, R. Cyganiak, H. Stenzhorn, and G. Tummarello. Sindice.com: a document-oriented lookup index for open linked data. IJMSO, 3(1):37--52, 2008. Google ScholarDigital Library
- J. Pound, P. Mika, and H. Zaragoza. Ad-hoc Object Ranking in the Web of Data. In Proceedings of the WWW, pages 771--780, Raleigh, USA, 2010. Google ScholarDigital Library
- I. Soboroff and D. Harman. Novelty detection: the trec experience. In HLT '05, USA, 2005. ACL. Google ScholarDigital Library
- T. Tran, H. Wang, and P. Haase. SearchWebDB: Data Web Search on a Pay-As-You-Go Integration Infrastructure, 2008.Google Scholar
- E. Voorhees. The philosophy of information retrieval evaluation. In In Proceedings of the The Second Workshop of the Cross-Language Evaluation Forum on Evaluation of Cross-Language Information Retrieval Systems, pages 355--370. Springer-Verlag, 2001. Google ScholarDigital Library
Index Terms
- Repeatable and reliable search system evaluation using crowdsourcing
Recommendations
The influence of commercial intent of search results on their perceived relevance
iConference '11: Proceedings of the 2011 iConferenceWe carried out a retrieval effectiveness test on the three major web search engines (i.e., Google, Microsoft and Yahoo). In addition to relevance judgments, we classified the results according to their commercial intent and whether or not they carried ...
Brand and its effect on user perception of search engine performance
In this research we investigate the effect of search engine brand on the evaluation of searching performance. Our research is motivated by the large amount of search traffic directed to a handful of Web search engines, even though many have similar ...
What users see - Structures in search engine results pages
This paper investigates the composition of search engine results pages. We define what elements the most popular web search engines use on their results pages (e.g., organic results, advertisements, shortcuts) and to which degree they are used for ...
Comments