Skip to main content
Top

2019 | Book

Information Retrieval Evaluation in a Changing World

Lessons Learned from 20 Years of CLEF

insite
SEARCH

About this book

This volume celebrates the twentieth anniversary of CLEF - the Cross-Language Evaluation Forum for the first ten years, and the Conference and Labs of the Evaluation Forum since – and traces its evolution over these first two decades. CLEF’s main mission is to promote research, innovation and development of information retrieval (IR) systems by anticipating trends in information management in order to stimulate advances in the field of IR system experimentation and evaluation.
The book is divided into six parts. Parts I and II provide background and context, with the first part explaining what is meant by experimental evaluation and the underlying theory, and describing how this has been interpreted in CLEF and in other internationally recognized evaluation initiatives. Part II presents research architectures and infrastructures that have been developed to manage experimental data and to provide evaluation services in CLEF and elsewhere. Parts III, IV and V represent the core of the book, presenting some of the most significant evaluation activities in CLEF, ranging from the early multilingual text processing exercises to the later, more sophisticated experiments on multimodal collections in diverse genres and media. In all cases, the focus is not only on describing “what has been achieved”, but above all on “what has been learnt”. The final part examines the impact CLEF has had on the research world and discusses current and future challenges, both academic and industrial, including the relevance of IR benchmarking in industrial settings.
Mainly intended for researchers in academia and industry, it also offers useful insights and tips for practitioners in industry working on the evaluation and performance issues of IR tools, and graduate students specializing in information retrieval.

Table of Contents

Frontmatter

Experimental Evaluation and CLEF

Frontmatter
From Multilingual to Multimodal: The Evolution of CLEF over Two Decades
Abstract
This introductory chapter begins by explaining briefly what is intended by experimental evaluation in information retrieval in order to provide the necessary background for the rest of this volume. The major international evaluation initiatives that have adopted and implemented in various ways this common framework are then presented and their relationship to CLEF indicated. The second part of the chapter details how the experimental evaluation paradigm has been implemented in CLEF by providing a brief overview of the main activities and results obtained over the last two decades. The aim has been to build a strong multidisciplinary research community and to create a sustainable technical framework that would not simply support but would also empower both research and development and evaluation activities, while meeting and at times anticipating the demands of a rapidly evolving information society.
Nicola Ferro, Carol Peters
The Evolution of Cranfield
Abstract
Evaluating search system effectiveness is a foundational hallmark of information retrieval research. Doing so requires infrastructure appropriate for the task at hand, which generally follows the Cranfield paradigm: test collections and associated evaluation measures. A primary purpose of Information Retrieval (IR) evaluation campaigns such as Text REtrieval Conference (TREC) and Conference and Labs of the Evaluation Forum (CLEF) is to build this infrastructure. The first TREC collections targeted the same task as the original Cranfield tests and used measures that were familiar to test collection users of the time. But as evaluation tasks have multiplied and diversified, test collection construction techniques and evaluation measure definitions have also been forced to evolve. This chapter examines how the Cranfield paradigm has been adapted to meet the changing requirements for search systems enabling it to continue to support a vibrant research community.
Ellen M. Voorhees
How to Run an Evaluation Task
With a Primary Focus on Ad Hoc Information Retrieval
Abstract
This chapter provides a general guideline for researchers who are planning to run a shared evaluation task for the first time, with a primary focus on simple ad hoc Information Retrieval (IR). That is, it is assumed that we have a static target document collection and a set of test topics (i.e., search requests), where participating systems are required to produce a ranked list of documents for each topic. The chapter provides a step-by-step description of what a task organiser team is expected to do. Section 1 discusses how to define the evaluation task; Sect. 2 how to publicise it and why it is important. Section 3 describes how to design and build test collections, as well as how inter-assessor agreement can be quantified. Section 4 explains how the results submitted by participants can be evaluated; examples of tools for computing evaluation measures and conducting statistical significance tests are provided. Finally, Sect. 5 discusses how the fruits of running the task should be shared to the research community, how progress should be monitored, and how we may be able to improve the task design for the next round. N.B.: A prerequisite to running a successful task is that you have a good team of organisers who can collaborate effectively. Each team member should be well-motivated and committed to running the task. They should respond to emails in a timely manner and should be able to meet deadlines. Organisers should be well-organised!
Tetsuya Sakai

Evaluation Infrastructures

Frontmatter
An Innovative Approach to Data Management and Curation of Experimental Data Generated Through IR Test Collections
Abstract
This paper describes the steps that led to the invention, design and development of the Distributed Information Retrieval Evaluation Campaign Tool (DIRECT) system for managing and accessing the data used and produced within experimental evaluation in Information Retrieval (IR). We present the context in which DIRECT was conceived, its conceptual model and its extension to make the data available on the Web as Linked Open Data (LOD) by enabling and enhancing their enrichment, discoverability and re-use. Finally, we discuss possible further evolutions of the system.
Maristella Agosti, Giorgio Maria Di Nunzio, Nicola Ferro, Gianmaria Silvello
TIRA Integrated Research Architecture
Abstract
Data and software are immaterial. Scientists in computer science hence have the unique chance to let other scientists easily reproduce their findings. Similarly, and with the same ease, the organization of shared tasks, i.e., the collaborative search for new algorithms given a predefined problem, is possible. Experience shows that the potential of reproducibility is hardly tapped in either case. Based on this observation, and driven by the ambitious goal to find the best solutions for certain problems in our research field, we have been developing the TIRA Integrated Research Architecture. Within TIRA, the reproducibility requirement got top priority right from the start. This chapter introduces the platform, its design requirements, its workflows from both the participants’ and the organizers’ perspectives, alongside a report on user experience and usage scenarios.
Martin Potthast, Tim Gollub, Matti Wiegmann, Benno Stein
EaaS: Evaluation-as-a-Service and Experiences from the VISCERAL Project
Abstract
The Cranfield paradigm has dominated information retrieval evaluation for almost 50 years. It has had a major impact on the entire domain of information retrieval since the 1960s and, compared with systematic evaluation in other domains, is very well developed and has helped very much to advance the field. This chapter summarizes some of the shortcomings in information analysis evaluation and how recent techniques help to leverage these shortcomings. The term Evaluation-as-a-Service (EaaS) was defined at a workshop that combined several approaches that do not distribute the data but use source code submission, APIs or the cloud to run evaluation campaigns. The outcomes of a white paper and the experiences gained in the VISCERAL project on cloud-based evaluation for medical imaging are explained in this paper. In the conclusions, the next steps for research infrastructures are imagined and the impact that EaaS can have in this context to make research in data science more efficient and effective.
Henning Müller, Allan Hanbury

Multilingual and Multimedia Information Retrieval

Frontmatter
Lessons Learnt from Experiments on the Ad Hoc Multilingual Test Collections at CLEF
Abstract
This chapter describes the lessons learnt from the ad hoc track at CLEF in the years 2000 to 2009. This contribution focuses on Information Retrieval (IR) for languages other than English (monolingual IR), as well as bilingual IR (also termed “cross-lingual”; the request is written in one language and the searched collection in another), and multilingual IR (the information items are written in many different languages). During these years the ad hoc track has used mainly newspaper test collections, covering more than 15 languages. The authors themselves have designed, implemented and evaluated IR tools for all these languages during those CLEF campaigns. Based on our own experience and the lessons reported by other participants in these years, we are able to describe the most important challenges when designing a IR system for a new language. When dealing with bilingual IR, our experiments indicate that the critical point is the translation process. However, currently online translating systems tend to offer rather effective translation from one language to another, especially when one of these languages is English. In order to solve the multilingual IR question, different IR architectures are possible. For the simplest approach based on query translation of individual language pairs, the crucial component is the merging of the intermediate bilingual results. When considering both document and query translation, the complexity of the whole system represents clearly a main issue.
Jacques Savoy, Martin Braschler
The Challenges of Language Variation in Information Access
Abstract
This chapter will give an overview of how human languages differ from each other and how those differences are relevant to the development of human language understanding technology for the purposes of information access. It formulates what requirements information access technology poses (and might pose) to language technology. We also discuss a number of relevant approaches and current challenges to meet those requirements.
Jussi Karlgren, Turid Hedlund, Kalervo Järvelin, Heikki Keskustalo, Kimmo Kettunen
Multi-Lingual Retrieval of Pictures in ImageCLEF
Abstract
CLEF first launched a multi-lingual visual information retrieval task in 2003 as part of the ImageCLEF track. Several such tasks subsequently followed, encompassing both the medical and non-medical domains. The main aim of such ad hoc image retrieval tasks was to investigate the effectiveness of retrieval approaches that exploit textual and visual evidence in the context of large and heterogeneous collections of images that are searched for by users with diverse information needs. This chapter presents an overview of the image retrieval activities within ImageCLEF from 2003 to 2011, focusing on the non-medical domain and, in particular, on the photographic retrieval and Wikipedia image retrieval tasks. We review the available test collections built in the context of these activities, present the main evaluation results, and summarise the contributions and lessons learned.
Paul Clough, Theodora Tsikrika
Experiences from the ImageCLEF Medical Retrieval and Annotation Tasks
Abstract
The medical tasks in ImageCLEF have been run every year from 2004–2018 and many different tasks and data sets have been used over these years. The created resources are being used by many researchers well beyond the actual evaluation campaigns and are allowing to compare the performance of many techniques on the same grounds and in a reproducible way. Many of the larger data sets are from the medical literature, as such images are easier to obtain and to share than clinical data, which was used in a few smaller ImageCLEF challenges that are specifically marked with the disease type and anatomic region. This chapter describes the main results of the various tasks over the years, including data, participants, types of tasks evaluated and also the lessons learned in organizing such tasks for the scientific community.
Henning Müller, Jayashree Kalpathy-Cramer, Alba García Seco de Herrera
Automatic Image Annotation at ImageCLEF
Abstract
Automatic image annotation is the task of automatically assigning some form of semantic label to images, such as words, phrases or sentences describing the objects, attributes, actions, and scenes depicted in the image. In this chapter, we present an overview of the various automatic image annotation tasks that were organized in conjunction with the ImageCLEF track at CLEF between 2009–2016. Throughout the 8 years, the image annotation tasks have evolved from annotating Flickr photos by learning from clean data to annotating web images by learning from large-scale noisy web data. The tasks are divided into three distinct phases, and this chapter will provide a discussion for each of these phases. We will also compare and contrast other related benchmarking challenges, and provide some insights into the future of automatic image annotation.
Josiah Wang, Andrew Gilbert, Bart Thomee, Mauricio Villegas
Image Retrieval Evaluation in Specific Domains
Abstract
Image retrieval was, and still is, a hot topic in research. It comes with many challenges that changed over the years with the emergence of more advanced methods for analysis and enormous growth of images created, shared and consumed. This chapter gives an overview of domain-specific image retrieval evaluation approaches, which were part of the ImageCLEF evaluation campaign . Specifically, the robot vision, photo retrieval, scalable image annotation and lifelogging tasks are presented. The ImageCLEF medical activity is described in a separate chapter in this volume. Some of the presented tasks have been available for several years, whereas others are quite new (like lifelogging). This mix of new and old topics has been chosen to give the reader an idea about the development and trends within image retrieval. For each of the tasks, the datasets, participants, techniques used and lessons learned are presented and discussed leading to a comprehensive summary.
Luca Piras, Barbara Caputo, Duc-Tien Dang-Nguyen, Michael Riegler, Pål Halvorsen
About Sound and Vision: CLEF Beyond Text Retrieval Tasks
Abstract
CLEF was initiated with intention of providing a catalyst to research in Cross-Language Information Retrieval (CLIR) and Multilingual Information Retrieval (MIR). Focusing principally on European languages, it initially provided CLIR benchmark tasks to the research community within an annual cycle of task design, conduct and reporting. While the early focus was on textual data, the emergence of technologies to enable collection, archiving and content processing of multimedia content led to several initiatives which sought to address search for spoken and visual content. Similar to the interest in multilingual search for text, interest arose in working multilingually with multimedia content. To support research in these areas CLEF introduced a number of tasks in multilingual search for multimedia content. While investigation of image retrieval has formed the focus of the ImageCLEF task over many years, this chapter reviews tasks examining speech and video retrieval carried out within CLEF during its first 10 years, and overviews related work reported at other information retrieval benchmarks.
Gareth J. F. Jones

Retrieval in New Domains

Frontmatter
The Scholarly Impact and Strategic Intent of CLEF eHealth Labs from 2012 to 2017
Abstract
Since 2012, the CLEF eHealth initiative has aimed to gather researchers working on health text analytics and to provide them with annual shared tasks. This chapter reports on measuring its scholarly impact in 2012–2017 and describing its future objectives. The large number of submissions and citations demonstrate the substantial community interest in the tasks and their resources. Consequently, the initiative continues to run in 2018 and 2019 with its goal to support patients, their family, clinical staff, health scientists, and healthcare policy makers in accessing and authoring health information in a multilingual setting.
Hanna Suominen, Liadh Kelly, Lorraine Goeuriot
Multilingual Patent Text Retrieval Evaluation: CLEF–IP
Abstract
The CLEF–IP evaluation lab ran between 2009 and 2013 with a two-fold expressed purpose: (a) to encourage research in the area of patent retrieval with a focus on cross language retrieval, and (b) to provide a large and clean data set of patent related data, in the three main European languages, for experimentation. In its first year, CLEF–IP organized one task only, a text retrieval task that modelled the “Search for Prior Art” done by experts at patent offices. In the following years the types of CLEF–IP tasks broadened to include patent text classification, patent image retrieval and classification, and (formal) structure recognition. With each task, the test collection was extended to accommodate for the additional tasks. In this chapter we overview the evaluation tasks dealing with the textual content of the patents. The Intellectual Property (IP) domain is one where specific expertise is critical, implementing Information Retrieval (IR) approaches to support some of its tasks cannot be done without the use of this domain know-how. Even when such know-how is at hand, retrieval results, in general, do not come close to the expectations of patent experts.
Florina Piroi, Allan Hanbury
Biodiversity Information Retrieval Through Large Scale Content-Based Identification: A Long-Term Evaluation
Abstract
Identifying and naming living plants or animals is usually impossible for the general public and often a difficult task for professionals and naturalists. Bridging this gap is a key challenge towards enabling effective biodiversity information retrieval systems. This taxonomic gap was actually already identified as one of the main ecological challenges to be solved during the Rio de Janeiro United Nations “Earth Summit” in 1992. Since 2011, the LifeCLEF challenges conducted in the context of the CLEF evaluation forum have been boosting and evaluating the advances in this domain. Data collections with an unprecedented volume and diversity have been shared with the scientific community to allow repeatable and long-term experiments. This paper describes the methodology of the conducted evaluation campaigns as well as providing a synthesis of the main results and lessons learned along the years.
Alexis Joly, Hervé Goëau, Hervé Glotin, Concetto Spampinato, Pierre Bonnet, Willem-Pier Vellinga, Jean-Christophe Lombardo, Robert Planqué, Simone Palazzo, Henning Müller
From XML Retrieval to Semantic Search and Beyond
The INEX, SBS, and MC2 Labs of CLEF 2012–2018
Abstract
INEX ran as an independent evaluation forum for 10 years before it teamed up with CLEF in 2012. Even before 2012 there was considerable collaboration between INEX and CLEF, and these collaborations increased in intensity when CLEF moved beyond its traditional cross-lingual focus in 2009/2010 shifting to include all experimental IR. This led to the merger of CLEF and INEX, and effectively to the inclusion of INEX as a large track or lab into CLEF in 2012. This chapter details the efforts of the INEX lab in CLEF (2012–2014), as well as the ongoing activities as separate labs, under the labels Social Book Search (2015–2016), and Microblog Contextualization (2016–2018).
Jaap Kamps, Marijn Koolen, Shlomo Geva, Ralf Schenkel, Eric SanJuan, Toine Bogers

Beyond Retrieval

Frontmatter
Results and Lessons of the Question Answering Track at CLEF
Abstract
The Question Answering track at CLEF ran for 13 years, from 2003 until 2015. Along these years, many different tasks, resources and evaluation methodologies were developed. We divide the CLEF Question Answering campaigns into four eras: (1) Ungrouped mainly factoid questions asked against monolingual newspapers (2003–2006), (2) Grouped questions asked against newspapers and Wikipedias (2007–2008), (3) Ungrouped questions against multilingual parallel-aligned EU legislative documents (2009–2010), and (4) Questions about a single document using a related document collection as background information (2011–2015). We provide the description and the main results for each of these eras, together with the pilot exercises and other Question Answering tasks that ran in CLEF. Finally, we conclude with some of the lessons learnt along these years.
Anselmo Peñas, Álvaro Rodrigo, Bernardo Magnini, Pamela Forner, Eduard Hovy, Richard Sutcliffe, Danilo Giampiccolo
Evolution of the PAN Lab on Digital Text Forensics
Abstract
PAN is a networking initiative for digital text forensics, where researchers and practitioners study technologies for text analysis with regard to originality, authorship, and trustworthiness. The practical importance of such technologies is obvious for law enforcement, cyber-security, and marketing, yet the general public needs to be aware of their capabilities as well to make informed decisions about them. This is particularly true since almost all of these technologies are still in their infancy, and active research is required to push them forward. Hence PAN focuses on the evaluation of selected tasks from the digital text forensics in order to develop large-scale, standardized benchmarks, and to assess the state of the art. In this chapter we present the evolution of three shared tasks: plagiarism detection, author identification, and author profiling.
Paolo Rosso, Martin Potthast, Benno Stein, Efstathios Stamatatos, Francisco Rangel, Walter Daelemans
RepLab: An Evaluation Campaign for Online Monitoring Systems
Abstract
Over a period of 3 years, RepLab was a CLEF initiative where computer scientists and online reputation experts worked together to identify and formalize the computational challenges in the area of online reputation monitoring. Two main results emerged from RepLab: a community of researchers engaged in the problem, and an extensive Twitter test collection comprising more than half a million expert annotations, which cover many relevant tasks in the field of online reputation: named entity resolution, topic detection and tracking, reputational alerts identification, reputational polarity, author profiling, opinion makers identification and reputational dimension classification. It has probably been one of the CLEF labs with a larger set of expert annotations provided to participants in a single year, and one of the labs where the target user community has been more actively engaged in the evaluation campaign. Here we summarize the design and results of the Replab campaigns, and also report on research that has built on RepLab datasets after completion of the 3-year competition cycle.
Jorge Carrillo-de-Albornoz, Julio Gonzalo, Enrique Amigó
Continuous Evaluation of Large-Scale Information Access Systems: A Case for Living Labs
Abstract
A/B testing is currently being increasingly adopted for the evaluation of commercial information access systems with a large user base since it provides the advantage of observing the efficiency and effectiveness of information access systems under real conditions. Unfortunately, unless university-based researchers closely collaborate with industry or develop their own infrastructure or user base, they cannot validate their ideas in live settings with real users. Without online testing opportunities open to the research communities, academic researchers are unable to employ online evaluation on a larger scale. This means that they do not get feedback for their ideas and cannot advance their research further. Businesses, on the other hand, miss the opportunity to have higher customer satisfaction due to improved systems. In addition, users miss the chance to benefit from an improved information access system. In this chapter, we introduce two evaluation initiatives at CLEF, NewsREEL and Living Labs for IR (LL4IR), that aim to address this growing “evaluation gap” between academia and industry. We explain the challenges and discuss the experiences organizing theses living labs.
Frank Hopfgartner, Krisztian Balog, Andreas Lommatzsch, Liadh Kelly, Benjamin Kille, Anne Schuth, Martha Larson

Impact and Future Challenges

Frontmatter
The Scholarly Impact of CLEF 2010–2017
A Google Scholar Analysis of CLEF Proceedings and Working Notes
Abstract
This chapter assesses the scholarly impact of the CLEF evaluation campaign by performing a bibliometric analysis of the citations of the CLEF 2010–2017 papers collected through Google Scholar. The analysis extends an earlier 2013 study by Tsikrika et al. of the CLEF Proceedings for the period 2000–2009 and compares the impact of the first half of CLEF to the second. It also extends the analysis by including the CLEF Working notes, a less formal but important part of the CLEF oeuvre. Results show that, despite the different nature of the peer-reviewed CLEF Proceedings papers and the less formal and much more numerous Working note papers, both types of publications have high citation impact. In particular, overview papers from the various labs and tasks in CLEF attract large amounts of citations in both Proceedings and Working Notes. A significant proportion of the total number of citations appear to be from outside CLEF—there are simply not enough CLEF papers every year to explain that many citations. In conclusion, the analysis of the productivity and citation impact of CLEF in the period 2010–2017 shows that CLEF is a very strong and vibrant initiative that has managed a major change of format between 2009/2010 and continues to produce relevant research, datasets and tools.
Birger Larsen
Reproducibility and Validity in CLEF
Abstract
In this paper, we investigate CLEF’s contribution to the reproducibility of IR experiments. After discussing the concepts of reproducibility and validity, we show that CLEF has not only produced test collections that can be re-used by other researchers, but also undertaken various efforts in enabling reproducibility.
Norbert Fuhr
Visual Analytics and IR Experimental Evaluation
Abstract
We investigate the application of Visual Analytics (VA) techniques to the exploration and interpretation of Information Retrieval (IR) experimental data. We first briefly introduce the main concepts about VA and then we present some relevant examples of VA prototypes developed for better investigating IR evaluation data. Finally, we conclude with an discussion of the current trends and future challenges on this topic.
Nicola Ferro, Giuseppe Santucci
Adopting Systematic Evaluation Benchmarks in Operational Settings
Abstract
Evaluation of information systems in commercial and industrial settings differs from academic evaluation of methodology in important ways. Those differences have to do with differing organisational priorities between practice and research. Some of those priorities can be adjusted, others must be taken into account, to be able to include evaluation into an operational development pipeline.
Jussi Karlgren
Backmatter
Metadata
Title
Information Retrieval Evaluation in a Changing World
Editors
Nicola Ferro
Carol Peters
Copyright Year
2019
Electronic ISBN
978-3-030-22948-1
Print ISBN
978-3-030-22947-4
DOI
https://doi.org/10.1007/978-3-030-22948-1