Elsevier

Journal of Web Semantics

Volume 21, August 2013, Pages 14-29
Journal of Web Semantics

Repeatable and reliable semantic search evaluation

https://doi.org/10.1016/j.websem.2013.05.005Get rights and content

Abstract

An increasing amount of structured data on the Web has attracted industry attention and renewed research interest in what is collectively referred to as semantic search. These solutions exploit the explicit semantics captured in structured data such as RDF for enhancing document representation and retrieval, or for finding answers by directly searching over the data. These data have been used for different tasks and a wide range of corresponding semantic search solutions have been proposed in the past. However, it has been widely recognized that a standardized setting to evaluate and analyze the current state-of-the-art in semantic search is needed to monitor and stimulate further progress in the field. In this paper, we present an evaluation framework for semantic search, analyze the framework with regard to repeatability and reliability, and report on our experiences on applying it in the Semantic Search Challenge 2010 and 2011.

Introduction

There exist a wide range of semantic search solutions targeting different tasks—from using semantics captured in structured data for enhancing document representation (and document retrieval   [1], [2], [3], [4]) to processing keyword search queries and natural language questions directly over structured data (data retrieval[5], [6], [7]).

In general, the term ‘semantic search’ is highly contested, primarily because of the perpetual and endemic ambiguity around the term ‘semantics’. While ‘search’ is understood to be some form of information retrieval, ‘semantics’ typically refers to the interpretation of some syntactic structure to another structure, the ‘semantic’ structure, that more explicitly defines the meaning that is implicit in the surface syntax. Already in the early days of information retrieval (IR) research, thesauri capturing senses of words in the form of concepts and their relationships were used  [8]. More recently, the large and increasing amount of structured data that are embedded in Web pages or available as publicly accessible datasets constitute another popular type of semantic structure. The advantage here is that these data are commonly represented in RDF (Resource Description Framework), a standard knowledge representation formalism recommended by the W3C. RDF is a flexible graph-structured model that can capture the semantics embodied in information networks, social networks as well as (semi-)structured data in databases. Data represented in RDF is composed of subject–predicate–object triples, where the subject is an identifier for a resource (e.g. a real-world object), the predicate an identifier for a relationship, and the object is either an identifier of another resource or some information given as a concrete value (e.g. a string or data-typed value). As opposed to the wide range of proprietary models that have been used to capture semantics in the past, RDF provides a standardized vehicle for representation, exchange and usage, resulting in a large and increasing amount of publicly and Web-accessible data that can be used for search (e.g. Linked Data).

The explicit semantics captured by these structures have been used by semantic search systems for different tasks (e.g. document and data retrieval). More specifically, it can be used for enhancing the representation of the information needs (queries) and resources (documents, objects). While this helps in dealing with the core task of search, i.e.,  matching information needs against resources, it has been shown that semantics can be beneficial throughout the broader search process  [9], from the specification of the needs in terms of queries to matching queries against resources and ranking results, to refining the information needs and up to the presentation and analysis of results.

While there is active research in this field of semantic search, it has been concluded in plenary discussions at the Semantic Search 2009 workshop that the lack of standardized evaluation has become a serious bottleneck to further progress in this field. One of the principle reasons for the lack of a standardized evaluation campaign is the cost of creating a new and realistically sized “gold-standard” dataset and conducting annual evaluation campaign was considered too high by the community.

In response to this conclusion, we elaborate on an approach for semantic search evaluation that is based on crowdsourcing. In this work we show that crowdsourcing-based evaluation is not only affordable but in particular, it satisfies the criteria of reliability and repeatability that are essential for a standardized evaluation framework. We organized public evaluation campaigns in the last two years at the SemSearch workshops and tested the proposed evaluation framework. While the main ideas behind our crowdsourcing-based evaluation may be extended and generalized to the general case (i.e., other search tasks), the kind of semantic search we have focused on in the last two campaigns were the keyword search over structured data in RDF. We were motivated by the increasing need to locate particular information quickly and effectively and in a way that is accessible to non-expert users. In particular, the semantic search task of interest is similar to the classic ad-hoc document retrieval (ADR) retrieval task, where the goal is to retrieve a ranked list of (text) documents from a fixed corpus in response to free-form keyword queries. In accordance with ADR, we define the semantic search task of ad-hoc object retrieval (AOR)  [10], where the goal is to retrieve a ranked list of objects (also referred to as resources or entities) from a collection of RDF documents in response to free-form keyword queries. The unit of retrieval is thus individual entities and not RDF documents, and so the task differs from the classic textual information retrieval insofar as the primary unit is structured data rather than unstructured textual data. In particular, we focus on the tasks of entity search, which is about one specific named entity, and list search, which is about a set of entities.

This paper provides a comprehensive overview of work on semantic search evaluation we did in the last three years and reports on recent progress on semantic search as observed in the evaluation campaigns in 2010 and 2011. It builds on the first work towards this direction on AOR  [10], which provided an evaluation protocol and tested a number of metrics for their stability and discriminating power. We instantiated this methodology in the sense of creating a standard set of queries and data (Section  3) which we execute the methodology using a crowdsourcing approach (Section  4). A thorough study on the reliability and repeatability of the framework has been presented in  [11]. Lastly, we discuss the application of this framework and its concrete instantiation in the Semantic Search Challenge held in 2010 and 2011 (Section  5). Details on these campaigns can be found in  [12], [13], respectively.

Outline. This paper is organized as follows. In Section  2 we discuss different directions of related work. In Section  3, we present the evaluation framework, discuss its details and the underlying methodology. How the evaluation framework can be instantiated is detailed in Section  4, where we also examine its reliability and repeatability. In Section  5, we report on two evaluation campaigns, the Semantic Search Challenge, held in 2010 and 2011 and show the applicability of our evaluation framework in the real-world. Finally, we conclude in Section  6.

Section snippets

Related work

We discuss related work from the perspectives of crowdsourcing-based evaluation, semantic search evaluation and search evaluation campaigns.

Evaluation framework

In the Information Retrieval community the Cranfield methodology  [29], [30] is the de-facto standard for the performance evaluation of IR-systems. The standardized setting for retrieval experiments following this methodology consists of a document collection, a set of topics and relevant assessments denoting which documents are (not) relevant for a given topic. We adapted this methodology to semantic search. In this section, we describe the data collection used in our evaluation framework and

Reliability and repeatability of the evaluation framework

Advances in information retrieval have long been driven by evaluation campaigns using standardized collections of datasets, query workloads, and most importantly, result relevance judgments. TREC (Text REtrieval Conference)  [33] is a forerunner in IR evaluations, but campaigns also take place in specialized forums like INEX (INitiative for the Evaluation of XML Retrieval)  [22] and CLEF (Cross Language Evaluation Forum). The main premises of these campaigns is that a limited and controlled set

Semantic Search Challenge

We applied the evaluation framework in the Semantic Search Challenge 2010 and 2011, which were held as part of the Semantic Search Workshop at WWW2010 and WWW2011. The main difference between the challenges is that 2011 challenge comprised also a List Search Track in addition to the Entity Search Track.

Conclusion

The topic of semantic search has attracted large interests both from industry and research, resulting in a variety of solutions that target different tasks. There is however no standardized evaluation framework that helps to monitor and stimulate the progress in this field. We define the two standard tasks of entity search and entity list search, which are commonly supported by semantic search systems. Starting with these tasks, we run evaluation campaigns organized in the context of the series

Acknowledgments

We acknowledge Yahoo! Research for making available under license the ‘Yahoo! Search Query Log Tiny Sample, version 1.0’ dataset as part of the WebScope program. We also thank Evelyne Viegas and Microsoft Research for allowing a portion of the Microsoft Live Query Log to be used in the 2010 campaign. In particular, we would like to thank Amazon and the NAACL workshop on using the Mechanical Turk for providing the initial funding for the 2010 evaluation, and would like to thank the European

References (52)

  • T. Tran et al.

    Hermes: data web search on a pay-as-you-go integration infrastructure

    J. Web Sem.

    (2009)
  • T. Tran et al.

    SemSearchPro—using semantics throughout the search process

    J. Web Sem.

    (2011)
  • C. Bizer et al.

    The semantic web challenge, 2009

    J. Web Sem.

    (2010)
  • J. Chu-Carroll, J.M. Prager, K. Czuba, D.A. Ferrucci, P.A. Duboué, Semantic search via XML fragments: a high-precision...
  • J. Chu-Carroll, J.M. Prager, An experimental study of the impact of information extraction accuracy on semantic search...
  • P. Castells et al.

    An adaptation of the vector-space model for ontology-based information retrieval

    IEEE Trans. Knowl. Data Eng.

    (2007)
  • T. Tran, S. Bloehdorn, P. Cimiano, P. Haase, Expressive resource descriptions for ontology-based information retrieval,...
  • R. Guha et al.

    Semantic search

  • E. Oren et al.

    Sindice.com: a document-oriented lookup index for open linked data

    International Journal of Metadata, Semantics and Ontologies (IJMSO)

    (2008)
  • E.M. Voorhees, Query expansion using lexical–semantic relations, in: SIGIR, 1994, pp....
  • J. Pound, P. Mika, H. Zaragoza, Ad-hoc object ranking in the web of data, in: Proceedings of the WWW, Raleigh, United...
  • R. Blanco, H. Halpin, D.M. Herzig, P. Mika, J. Pound, H.S. Thompson, D.T. Tran, Repeatable and reliable search system...
  • H. Halpin, D.M. Herzig, P. Mika, R. Blanco, J. Pound, H.S. Thompson, D.T. Tran, Evaluating ad-hoc object retrieval, in:...
  • R. Blanco, H. Halpin, D.M. Herzig, P. Mika, J. Pound, H.S. Thompson, D.T. Tran, Entity search evaluation over...
  • K. Balog, A.P. de Vries, P. Serdyukov, P. Thomas, T. Westerveld, Overview of the TREC 2009 entity track, in: NIST...
  • O. Alonso et al.

    Crowdsourcing for relevance evaluation

    SIGIR Forum

    (2008)
  • O. Alonso, R. Schenkel, M. Theobald, Crowdsourcing assessments for XML ranked retrieval, in: ECIR, 2010, pp....
  • C. Callison-Burch

    Fast, cheap, and creative: evaluating translation quality using Amazon’s mechanical turk

  • S. Nowak et al.

    How reliable are annotations via crowdsourcing: a study about inter-annotator agreement for multi-label image annotation

  • J. Waitelonis et al.

    Whoknows? Evaluating linked data heuristics with a quiz that cleans up DBPedia

    International Journal of Interactive Technology and Smart Education (ITSE)

    (2011)
  • L. Von Ahn et al.

    reCAPTCHA: human-based character recognition via web security measures

    Science

    (2008)
  • D.M. Herzig, T. Tran, Heterogeneous web data search using relevance-based on the fly data integration, in: WWW, 2012,...
  • J. Kamps, S. Geva, A. Trotman, A. Woodley, M. Koolen, Overview of the INEX 2008 ad hoc track, in: Advances in Focused...
  • Y. Luo, W. Wang, X. Lin, SPARK: a keyword search engine on relational databases, in: ICDE, 2008, pp....
  • K. Balog, P. Serdyukov, A. de Vries, Overview of the TREC 2010 entity track, in: TREC 2010 Working Notes,...
  • G. Demartini et al.

    Overview of the INEX 2009 entity ranking track

  • Cited by (0)

    View full text