Skip to main content

About this book

This book constitutes the refereed proceedings of the 11th International Conference of the CLEF Association, CLEF 2020, held in Thessaloniki, Greece, in September 2020.*

The conference has a clear focus on experimental information retrieval with special attention to the challenges of multimodality, multilinguality, and interactive search ranging from unstructured to semi structures and structured data.

The 5 full papers and 2 short papers presented in this volume were carefully reviewed and selected from 9 submissions. This year, the contributions addressed the following challenges: a large-scale evaluation of translation effects in academic search, advancement of assessor-driven aggregation methods for efficient relevance assessments, and development of a new test dataset.

In addition to this, the volume presents 7 “best of the labs” papers which were reviewed as full paper submissions with the same review criteria. The 12 lab overview papers were accepted out of 15 submissions and represent scientific challenges based on new data sets and real world problems in multimodal and multilingual information access.

* The conference was held virtually due to the COVID-19 pandemic.

Table of Contents


Full Papers


SberQuAD – Russian Reading Comprehension Dataset: Description and Analysis

The paper presents SberQuAD – a large Russian reading comprehension (RC) dataset created similarly to English SQuAD. SberQuAD contains about 50K question-paragraph-answer triples and is seven times larger compared to the next competitor. We provide its description, thorough analysis, and baseline experimental results. We scrutinized various aspects of the dataset that can have impact on the task performance: question/paragraph similarity, misspellings in questions, answer structure, and question types. We applied five popular RC models to SberQuAD and analyzed their performance. We believe our work makes an important contribution to research in multilingual question answering.
Pavel Efimov, Andrey Chertok, Leonid Boytsov, Pavel Braslavski

s-AWARE: Supervised Measure-Based Methods for Crowd-Assessors Combination

Ground-truth creation is one of the most demanding activities in terms of time, effort, and resources needed for creating an experimental collection. For this reason, crowdsourcing has emerged as a viable option to reduce the costs and time invested in it.
An effective assessor merging methodology is crucial to guarantee a good ground-truth quality. The classical approach involve the aggregation of labels from multiple assessors using some voting and/or classification methods. Recently, Assessor-driven Weighted Averages for Retrieval Evaluation (AWARE) has been proposed as an unsupervised alternative, which optimizes the final evaluation measure, rather than the labels, computed from multiple judgments.
In this paper, we propose s-AWARE, a supervised version of AWARE. We tested s-AWARE against a range of state-of-the-art methods and the unsupervised AWARE on several TREC collections. We analysed how the performance of these methods changes by increasing assessors’ judgement sparsity, highlighting that s-AWARE is an effective approach in a real scenario.
Marco Ferrante, Nicola Ferro, Luca Piazzon

Query or Document Translation for Academic Search – What’s the Real Difference?

We compare query and document translation from and to English, French, German and Spanish for multilingual retrieval in an academic search portal: PubPsych. Both translation approaches improve the retrieval performance of the system with document translation providing better results. Performance inversely correlates with the amount of available original language documents. The more documents already available in a language, the fewer improvements can be observed. Retrieval performance with English as a source language does not improve with translation as most documents already contained English-language content in our text collection. The large-scale evaluation study is based on a corpus of more than 1M metadata documents and 50 real queries taken from the query log files of the portal.
Vivien Petras, Andreas Lüschow, Roland Ramthun, Juliane Stiller, Cristina España-Bonet, Sophie Henning

Question Answering When Knowledge Bases are Incomplete

While systems for question answering over knowledge bases (KB) continue to progress, real world usage requires systems that are robust to incomplete KBs. Dependence on the closed world assumption is highly problematic, as in many practical cases the information is constantly evolving and KBs cannot keep up. In this paper we formalize a typology of missing information in knowledge bases, and present a dataset based on the Spider KB question answering dataset, where we deliberately remove information from several knowledge bases, in this case implemented as relational databases (The dataset and the code to reproduce experiments are available at https://​github.​com/​camillepradel/​IDK.). Our dataset, called IDK (Incomplete Data in Knowledge base question answering), allows to perform studies on how to detect and recover from such cases. The analysis shows that simple baselines fail to detect most of the unanswerable questions.
Camille Pradel, Damien Sileo, Álvaro Rodrigo, Anselmo Peñas, Eneko Agirre

2AIRTC: The Amharic Adhoc Information Retrieval Test Collection

Evaluation is highly important for designing, developing, and maintaining information retrieval (IR) systems. The IR community has developed shared tasks where evaluation framework, evaluation measures and test collections have been developed for different languages. Although Amharic is the official language of Ethiopia currently having an estimated population of over 110 million, it is one of the under-resourced languages and there is no Amharic adhoc IR test collection to date. In this paper, we promote the monolingual Amharic IR test collection that we build for the IR community. Following the framework of Cranfield project and TREC, the collection that we named 2AIRTC consists of 12,583 documents, 240 topics and the corresponding relevance judgments.
Tilahun Yeshambel, Josiane Mothe, Yaregal Assabie

Short Papers


The Curious Case of Session Identification

Dividing interaction logs into meaningful segments has been a core problem in supporting users in search tasks for over 20 years. Research has brought up many different definitions: from simplistic mechanical sessions to complex search missions spanning multiple days. Having meaningful segments is essential for many tasks depending on context, yet many research projects over the last years still rely on early proposals. This position paper gives a quick overview of session identification development and questions the widespread use of the industry standard.
Florian Dietz

Argument Retrieval from Web

We are well beyond the days of expecting search engines to help us find documents containing the answer to a question or information about a query. We expect a search engine to help us in the decision-making process. Argument retrieval task in Touché Track at CLEF2020 has been defined to address this problem. The user is looking for information about several alternatives to make a choice between them. The search engine should retrieve opinionated documents containing comparisons between the alternatives rather than documents about one option or documents including personal opinions or no suggestion at all. In this paper, we discuss argument retrieval from web documents. In order to retrieve argumentative documents from the web, we use three features (PageRank scores, domains, argumentative classifier) and try to strike a balance between them. We evaluate the method based on three dimensions: relevance, argumentativeness, and trustworthiness. Since the labeled data and final results for Toucheé Track have not been out yet, the evaluation has been done by manually labeling documents for 5 queries.
Mahsa S. Shahshahani, Jaap Kamps

Best of CLEF 2019 Labs


File Forgery Detection Using a Weighted Rule-Based System

The society is becoming increasingly dependent on digital data sources. However, our trust on the sources and its contents is only ensured if we can also rely on robust methods that prevent fraudulent forgery. As digital forensic experts are continually dealing with the detection of forged data, new fraudulent approaches are emerging, making it difficult to use automated systems. This security breach is also a good challenge that motivates researchers to explore computational solutions to efficiently address the problem. This paper describes a weighted rule-based system for file forgery detection. The system was developed and validated in the several tasks of ImageCLEFsecurity 2019 track challenge, where promising results were obtained.
João Rafael Almeida, Olga Fajarda, José Luís Oliveira

Protest Event Detection: When Task-Specific Models Outperform an Event-Driven Method

2019 has been characterized by worldwide waves of protests. Each country’s protests is different but there appear to be common factors. In this paper we present two approaches for identifying protest events in news in English. Our goal is to provide political science and discourse analysis scholars with tools that may facilitate the understanding of this on-going phenomenon. We test our approaches against the ProtestNews Lab 2019 benchmark that challenges systems to perform unsupervised domain adaptation on protest events on three sub-tasks: document classification, sentence classification, and event extraction. Results indicate that developing dedicated architectures and models for each task outperforms simpler solutions based on the propagation of labels from lexical items to documents. Furthermore, we complete the description of our systems with a detailed data analysis to shed light on the limits of the methods.
Angelo Basile, Tommaso Caselli

A Study on a Stopping Strategy for Systematic Reviews Based on a Distributed Effort Approach

Systematic reviews are scientific investigations that use strategies to include a comprehensive search of all potentially relevant articles and the use of explicit, reproducible criteria in the selection of articles for review. As time and resources are limited for compiling a systematic review, limits to the search are needed. In this paper, we describe the stopping strategy that we have been designed and refined over three years of participation to the CLEF eHealth Technology Assisted Review Task. In particular, we present a comparison of a Continuous Active Learning approach that uses either a fixed amount or a variable amount of resources according to the size of the pool. The results show that our approach performs on average much better than any other participant in the CLEF 2019 eHealth TAR task. Nevertheless, a failure analysis allows to understand the weak points of this approach and possible future directions.
Giorgio Maria Di Nunzio

Fact Check-Worthiness Detection with Contrastive Ranking

Check-worthiness detection aims at predicting which sentences should be prioritized for fact-checking. A typical use is to rank sentences in political debates and speeches according to their degree of check-worthiness. We present the first direct optimization of sentence ranking for check-worthiness; in contrast, all previous work has solely used standard classification based loss functions. We present a recurrent neural network model that learns a sentence encoding, from which a check-worthiness score is predicted. The model is trained by jointly optimizing a binary cross entropy loss, as well as a ranking based pairwise hinge loss. We obtain sentence pairs for training through contrastive sampling, where for each sentence we find the top most semantically similar sentences with opposite label. Through a comparison to existing state-of-the-art check-worthiness methods, we find that our approach improves the MAP score by 11%.
Casper Hansen, Christian Hansen, Jakob Grue Simonsen, Christina Lioma

Tuberculosis CT Image Analysis Using Image Features Extracted by 3D Autoencoder

This paper presents an approach for the automated analysis of 3D Computed Tomography (CT) images based on the utilization of descriptors extracted using 3D deep convolutional autoencoder (AEC  [8]) networks. Both the common flow of AEC model application and a set of techniques for overcoming the lack of training samples are presented in this work. The described approach was used for accomplishing the two subtasks of the ImageCLEF 2019: Tuberculosis competition  [2, 5] and allowed to achieve the 2nd best performance in the TB Severity Scoring subtask and the 6th best performance in the TB CT Report subtask.
Siarhei Kazlouski

Twitter User Profiling: Bot and Gender Identification

Notebook for PAN at CLEF 2019
Social bots are automated programs that generate a significant amount of social media content. This content can be harmful, as it may target a certain audience to influence opinions, often politically motivated, or to promote individuals to appear more popular than they really are. We proposed a set of feature extraction and transformation methods in conjunction with ensemble classifiers for the PAN 2019 Author Profiling task. For the bot identification subtask we used user behaviour fingerprint and statistical diversity measures, while for the gender identification subtask we used a set of text statistics, as well as syntactic information and raw words.
Dijana Kosmajac, Vlado Keselj

Medical Image Tagging by Deep Learning and Retrieval

Radiologists and other qualified physicians need to examine and interpret large numbers of medical images daily. Systems that would help them spot and report abnormalities in medical images could speed up diagnostic workflows. Systems that would help exploit past diagnoses made by highly skilled physicians could also benefit their more junior colleagues. A task that systems can perform towards this end is medical image classification, which assigns medical concepts to images. This task, called Concept Detection, was part of the ImageCLEF 2019 competition. We describe the methods we implemented and submitted to the Concept Detection 2019 task, where we achieved the best performance with a deep learning method we call ConceptCXN. We also show that retrieval-based methods can perform very well in this task, when combined with deep learning image encoders. Finally, we report additional post-competition experiments we performed to shed more light on the performance of our best systems. Our systems can be installed through PyPi as part of the BioCaption package.
Vasiliki Kougia, John Pavlopoulos, Ion Androutsopoulos

CLEF 2020 Lab Overviews


Overview of ARQMath 2020: CLEF Lab on Answer Retrieval for Questions on Math

The ARQMath Lab at CLEF considers finding answers to new mathematical questions among posted answers on a community question answering site (Math Stack Exchange). Queries are question posts held out from the searched collection, each containing both text and at least one formula. This is a challenging task, as both math and text may be needed to find relevant answer posts. ARQMath also includes a formula retrieval sub-task: individual formulas from question posts are used to locate formulae in earlier question and answer posts, with relevance determined considering the context of the post from which a query formula is taken, and the posts in which retrieved formulae appear.
Richard Zanibbi, Douglas W. Oard, Anurag Agarwal, Behrooz Mansouri

Overview of BioASQ 2020: The Eighth BioASQ Challenge on Large-Scale Biomedical Semantic Indexing and Question Answering

In this paper, we present an overview of the eighth edition of the BioASQ challenge, which ran as a lab in the Conference and Labs of the Evaluation Forum (CLEF) 2020. BioASQ is a series of challenges aiming at the promotion of systems and methodologies for large-scale biomedical semantic indexing and question answering. To this end, shared tasks are organized yearly since 2012, where different teams develop systems that compete on the same demanding benchmark datasets that represent the real information needs of experts in the biomedical domain. This year, the challenge has been extended with the introduction of a new task on medical semantic indexing in Spanish. In total, 34 teams with more than 100 systems participated in the three tasks of the challenge. As in previous years, the results of the evaluation reveal that the top-performing systems managed to outperform the strong baselines, which suggests that state-of-the-art systems keep pushing the frontier of research through continuous improvements.
Anastasios Nentidis, Anastasia Krithara, Konstantinos Bougiatiotis, Martin Krallinger, Carlos Rodriguez-Penagos, Marta Villegas, Georgios Paliouras

Overview of CheckThat! 2020: Automatic Identification and Verification of Claims in Social Media

We present an overview of the third edition of the CheckThat!  Lab at CLEF 2020. The lab featured five tasks in two different languages: English and Arabic. The first four tasks compose the full pipeline of claim verification in social media: Task 1 on check-worthiness estimation, Task 2 on retrieving previously fact-checked claims, Task 3 on evidence retrieval, and Task 4 on claim verification. The lab is completed with Task 5 on check-worthiness estimation in political debates and speeches. A total of 67 teams registered to participate in the lab (up from 47 at CLEF 2019), and 23 of them actually submitted runs (compared to 14 at CLEF 2019). Most teams used deep neural networks based on BERT, LSTMs, or CNNs, and achieved sizable improvements over the baselines on all tasks. Here we describe the tasks setup, the evaluation results, and a summary of the approaches used by the participants, and we discuss some lessons learned. Last but not least, we release to the research community all datasets from the lab as well as the evaluation scripts, which should enable further research in the important tasks of check-worthiness estimation and automatic claim verification.
Alberto Barrón-Cedeño, Tamer Elsayed, Preslav Nakov, Giovanni Da San Martino, Maram Hasanain, Reem Suwaileh, Fatima Haouari, Nikolay Babulkov, Bayan Hamdan, Alex Nikolov, Shaden Shaar, Zien Sheikh Ali

Overview of ChEMU 2020: Named Entity Recognition and Event Extraction of Chemical Reactions from Patents

In this paper, we provide an overview of the Cheminformatics Elsevier Melbourne University (ChEMU) evaluation lab 2020, part of the Conference and Labs of the Evaluation Forum 2020 (CLEF2020). The ChEMU evaluation lab focuses on information extraction over chemical reactions from patent texts. Using the ChEMU corpus of 1500 “snippets” (text segments) sampled from 170 patent documents and annotated by chemical experts, we defined two key information extraction tasks. Task 1 addresses chemical named entity recognition, the identification of chemical compounds and their specific roles in chemical reactions. Task 2 focuses on event extraction, the identification of reaction steps, relating the chemical compounds involved in a chemical reaction. Herein, we describe the resources created for these tasks and the evaluation methodology adopted. We also provide a brief summary of the participants of this lab and the results obtained across 46 runs from 11 teams, finding that several submissions achieve substantially better results than our baseline methods.
Jiayuan He, Dat Quoc Nguyen, Saber A. Akhondi, Christian Druckenbrodt, Camilo Thorne, Ralph Hoessel, Zubair Afzal, Zenan Zhai, Biaoyan Fang, Hiyori Yoshikawa, Ameer Albahem, Lawrence Cavedon, Trevor Cohn, Timothy Baldwin, Karin Verspoor

Overview of the CLEF eHealth Evaluation Lab 2020

In this paper, we provide an overview of the eight annual edition of the Conference and Labs of the Evaluation Forum (CLEF) eHealth evaluation lab. The Conference and Labs of the Evaluation Forum (CLEF) eHealth 2020 continues our development of evaluation tasks and resources since 2012 to address laypeople’s difficulties to retrieve and digest valid and relevant information in their preferred language to make health-centred decisions. This year’s lab advertised two tasks. Task 1 on Information Extraction (IE) was new and focused on automatic clinical coding of diagnosis and procedure the tenth revision of the International Statistical Classification of Diseases and Related Health Problems (ICD10) codes as well as finding the corresponding evidence text snippets for clinical case documents in Spanish. Task 2 on Information Retrieval (IR) was a novel extension of the most popular and established task in the Conference and Labs of the Evaluation Forum (CLEF) eHealth on Consumer Health Search (CHS). In total 55 submissions were made to these tasks. Herein, we describe the resources created for the two tasks and evaluation methodology adopted. We also summarize lab submissions and results. As in previous years, the organizers have made data and tools associated with the lab tasks available for future research and development. The ongoing substantial community interest in the tasks and their resources has led to the Conference and Labs of the Evaluation Forum (CLEF) eHealth maturing as a primary venue for all interdisciplinary actors of the ecosystem for producing, processing, and consuming electronic health information.
Lorraine Goeuriot, Hanna Suominen, Liadh Kelly, Antonio Miranda-Escalada, Martin Krallinger, Zhengyang Liu, Gabriella Pasi, Gabriela Gonzalez Saez, Marco Viviani, Chenchen Xu

Overview of eRisk 2020: Early Risk Prediction on the Internet

This paper provides an overview of eRisk 2020, the fourth edition of this lab under the CLEF conference. The main purpose of eRisk is to explore issues of evaluation methodology, effectiveness metrics and other processes related to early risk detection. Early detection technologies can be employed in different areas, particularly those related to health and safety. This edition of eRisk had two tasks. The first task focused on early detecting signs of self-harm. The second task challenged the participants to automatically filling a depression questionnaire based on user interactions in social media.
David E. Losada, Fabio Crestani, Javier Parapar

Overview of CLEF HIPE 2020: Named Entity Recognition and Linking on Historical Newspapers

This paper presents an overview of the first edition of HIPE (Identifying Historical People, Places and other Entities), a pioneering shared task dedicated to the evaluation of named entity processing on historical newspapers in French, German and English. Since its introduction some twenty years ago, named entity (NE) processing has become an essential component of virtually any text mining application and has undergone major changes. Recently, two main trends characterise its developments: the adoption of deep learning architectures and the consideration of textual material originating from historical and cultural heritage collections. While the former opens up new opportunities, the latter introduces new challenges with heterogeneous, historical and noisy inputs. In this context, the objective of HIPE, run as part of the CLEF 2020 conference, is threefold: strengthening the robustness of existing approaches on non-standard inputs, enabling performance comparison of NE processing on historical texts, and, in the long run, fostering efficient semantic indexing of historical documents. Tasks, corpora, and results of 13 participating teams are presented.
Maud Ehrmann, Matteo Romanello, Alex Flückiger, Simon Clematide

Overview of the ImageCLEF 2020: Multimedia Retrieval in Medical, Lifelogging, Nature, and Internet Applications

This paper presents an overview of the ImageCLEF 2020 lab that was organized as part of the Conference and Labs of the Evaluation Forum - CLEF Labs 2020. ImageCLEF is an ongoing evaluation initiative (first run in 2003) that promotes the evaluation of technologies for annotation, indexing and retrieval of visual data with the aim of providing information access to large collections of images in various usage scenarios and domains. In 2020, the 18th edition of ImageCLEF runs four main tasks: (i) a medical task that groups three previous tasks, i.e., caption analysis, tuberculosis prediction, and medical visual question answering and question generation, (ii) a lifelog task (videos, images and other sources) about daily activity understanding, retrieval and summarization, (iii) a coral task about segmenting and labeling collections of coral reef images, and (iv) a new Internet task addressing the problems of identifying hand-drawn user interface components. Despite the current pandemic situation, the benchmark campaign received a strong participation with over 40 groups submitting more than 295 runs.
Bogdan Ionescu, Henning Müller, Renaud Péteri, Asma Ben Abacha, Vivek Datla, Sadid A. Hasan, Dina Demner-Fushman, Serge Kozlovski, Vitali Liauchuk, Yashin Dicente Cid, Vassili Kovalev, Obioma Pelka, Christoph M. Friedrich, Alba García Seco de Herrera, Van-Tu Ninh, Tu-Khiem Le, Liting Zhou, Luca Piras, Michael Riegler, Pål Halvorsen, Minh-Triet Tran, Mathias Lux, Cathal Gurrin, Duc-Tien Dang-Nguyen, Jon Chamberlain, Adrian Clark, Antonio Campello, Dimitri Fichou, Raul Berari, Paul Brie, Mihai Dogariu, Liviu Daniel Ştefan, Mihai Gabriel Constantin

Overview of LifeCLEF 2020: A System-Oriented Evaluation of Automated Species Identification and Species Distribution Prediction

Building accurate knowledge of the identity, the geographic distribution and the evolution of species is essential for the sustainable development of humanity, as well as for biodiversity conservation. However, the difficulty of identifying plants and animals in the field is hindering the aggregation of new data and knowledge. Identifying and naming living plants or animals is almost impossible for the general public and is often difficult even for professionals and naturalists. Bridging this gap is a key step towards enabling effective biodiversity monitoring systems. The LifeCLEF campaign, presented in this paper, has been promoting and evaluating advances in this domain since 2011. The 2020 edition proposes four data-oriented challenges related to the identification and prediction of biodiversity: (i) PlantCLEF: cross-domain plant identification based on herbarium sheets (ii) BirdCLEF: bird species recognition in audio soundscapes, (iii) GeoLifeCLEF: location-based prediction of species based on environmental and occurrence data, and (iv) SnakeCLEF: snake identification based on image and geographic location.
Alexis Joly, Hervé Goëau, Stefan Kahl, Benjamin Deneu, Maximillien Servajean, Elijah Cole, Lukáš Picek, Rafael Ruiz de Castañeda, Isabelle Bolon, Andrew Durso, Titouan Lorieul, Christophe Botella, Hervé Glotin, Julien Champ, Ivan Eggel, Willem-Pier Vellinga, Pierre Bonnet, Henning Müller

Overview of LiLAS 2020 – Living Labs for Academic Search

Academic Search is a timeless challenge that the field of Information Retrieval has been dealing with for many years. Even today, the search for academic material is a broad field of research that recently started working on problems like the COVID-19 pandemic. However, test collections and specialized data sets like CORD-19 only allow for system-oriented experiments, while the evaluation of algorithms in real-world environments is only available to researchers from industry. In LiLAS, we open up two academic search platforms to allow participating researchers to evaluate their systems in a Docker-based research environment. This overview paper describes the motivation, infrastructure, and two systems LIVIVO and GESIS Search that are part of this CLEF lab.
Philipp Schaer, Johann Schaible, Leyla Jael Garcia Castro

Overview of PAN 2020: Authorship Verification, Celebrity Profiling, Profiling Fake News Spreaders on Twitter, and Style Change Detection

We briefly report on the four shared tasks organized as part of the PAN 2020 evaluation lab on digital text forensics and authorship analysis. Each tasks is introduced, motivated, and the results obtained are presented. Altogether, the four tasks attracted 230 registrations, yielding 83 successful submissions. This, and the fact that we continue to invite the submissions of software rather than its run output using the TIRA experimentation platform, marks for a good start into the second decade of PAN evaluations labs.
Janek Bevendorff, Bilal Ghanem, Anastasia Giachanou, Mike Kestemont, Enrique Manjavacas, Ilia Markov, Maximilian Mayerl, Martin Potthast, Francisco Rangel, Paolo Rosso, Günther Specht, Efstathios Stamatatos, Benno Stein, Matti Wiegmann, Eva Zangerle

Overview of Touché 2020: Argument Retrieval

Extended Abstract
This paper is a condensed report on Touché: the first shared task on argument retrieval that was held at CLEF 2020. With the goal to create a collaborative platform for research in argument retrieval, we run two tasks: (1) supporting individuals in finding arguments on socially important topics and (2) supporting individuals with arguments on everyday personal decisions.
Alexander Bondarenko, Maik Fröbe, Meriem Beloucif, Lukas Gienapp, Yamen Ajjour, Alexander Panchenko, Chris Biemann, Benno Stein, Henning Wachsmuth, Martin Potthast, Matthias Hagen


Additional information

Premium Partner

    Image Credits