skip to main content
10.1145/2009916.2010039acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

Repeatable and reliable search system evaluation using crowdsourcing

Published:24 July 2011Publication History

ABSTRACT

The primary problem confronting any new kind of search task is how to boot-strap a reliable and repeatable evaluation campaign, and a crowd-sourcing approach provides many advantages. However, can these crowd-sourced evaluations be repeated over long periods of time in a reliable manner? To demonstrate, we investigate creating an evaluation campaign for the semantic search task of keyword-based ad-hoc object retrieval. In contrast to traditional search over web-pages, object search aims at the retrieval of information from factual assertions about real-world objects rather than searching over web-pages with textual descriptions. Using the first large-scale evaluation campaign that specifically targets the task of ad-hoc Web object retrieval over a number of deployed systems, we demonstrate that crowd-sourced evaluation campaigns can be repeated over time and still maintain reliable results. Furthermore, we show how these results are comparable to expert judges when ranking systems and that the results hold over different evaluation and relevance metrics. This work provides empirical support for scalable, reliable, and repeatable search system evaluation using crowdsourcing.

References

  1. O. Alonso, D. E. Rose, and B. Stewart. Crowdsourcing for relevance evaluation. SIGIR Forum, 42(2):9--15, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. O. Alonso, R. Schenkel, and M. Theobald. Crowdsourcing assessments for XML ranked retrieval. In ECIR, pages 602--606, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. P. Bailey, N. Craswell, I. Soboroff, P. Thomas, A. P. de Vries, and E. Yilmaz. Relevance assessment: are judges exchangeable and does it matter. In SIGIR, pages 667--674, New York, NY, USA, 2008. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. K. Balog, A. P. de Vries, P. Serdyukov, P. Thomas, and T. Westerveld. Overview of the TREC 2009 Entity Track. In NIST Special Publication: SP 500--278, 2009.Google ScholarGoogle Scholar
  5. C. Callison-Burch. Fast, cheap, and creative: Evaluating translation quality using Amazon's Mechanical Turk. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages 286--295, Singapore, August 2009. Association for Computational Linguistics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. B. Carpenter. Multilevel bayesian models of categorical data annotation. technical report. Technical report, Alias-I, 2008. http://lingpipe.files.wordpress.com/2008/11/carp-bayesian-multilevel-annotation.pdf.Google ScholarGoogle Scholar
  7. L. Ding, T. Finin, A. Joshi, R. Pan, S. R. Cost, Y. Peng, P. Reddivari, V. Doshi, and J. Sachs. Swoogle: a search and metadata engine for the semantic web. In CIKM, pages 652--659, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. S. Elbassuoni, M. Ramanath, R. Schenkel, M. Sydow, and G. Weikum. Language-model-based Ranking for Queries on RDF-graphs. In CIKM, pages 977--986. ACM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. R. Guha, R. McCool, and E. Miller. Semantic Search. In WWW, pages 700--709. ACM, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. H. Halpin. A query-driven characterization of linked data. In Proceedings of the WWW Workshop on Linked Data on the Web, Madrid, Spain, 2009.Google ScholarGoogle Scholar
  11. S. P. Harter. Variations in relevance assessments and the measurement of retrieval effectiveness. J. Am. Soc. Inf. Sci., 47(1):37--49, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. Kamps, S. Geva, A. Trotman, A. Woodley, and M. Koolen. Overview of the INEX 2008 Ad Hoc Track. Advances in Focused Retrieval: 7th International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX 2008, pages 1--28, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Y. Luo, W. Wang, and X. Lin. SPARK: A Keyword Search Engine on Relational Databases. In ICDE, pages 1552--1555, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. W. Mason and D. J. Watts. Financial Incentives and the "Performance of Crowds". In Human Computation Workshop (HComp2009), 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. S. Nowak and S. M. Rüger. How reliable are annotations via crowdsourcing: a study about inter-annotator agreement for multi-label image annotation. In Multimedia Information Retrieval, pages 557--566, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. E. Oren, R. Delbru, M. Catasta, R. Cyganiak, H. Stenzhorn, and G. Tummarello. Sindice.com: a document-oriented lookup index for open linked data. IJMSO, 3(1):37--52, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. J. Pound, P. Mika, and H. Zaragoza. Ad-hoc Object Ranking in the Web of Data. In Proceedings of the WWW, pages 771--780, Raleigh, USA, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. I. Soboroff and D. Harman. Novelty detection: the trec experience. In HLT '05, USA, 2005. ACL. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. T. Tran, H. Wang, and P. Haase. SearchWebDB: Data Web Search on a Pay-As-You-Go Integration Infrastructure, 2008.Google ScholarGoogle Scholar
  20. E. Voorhees. The philosophy of information retrieval evaluation. In In Proceedings of the The Second Workshop of the Cross-Language Evaluation Forum on Evaluation of Cross-Language Information Retrieval Systems, pages 355--370. Springer-Verlag, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Repeatable and reliable search system evaluation using crowdsourcing

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGIR '11: Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
      July 2011
      1374 pages
      ISBN:9781450307574
      DOI:10.1145/2009916

      Copyright © 2011 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 24 July 2011

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader