Skip to main content

2021 | Book

Advances in Information Retrieval

43rd European Conference on IR Research, ECIR 2021, Virtual Event, March 28 – April 1, 2021, Proceedings, Part II

Editors: Djoerd Hiemstra, Prof. Marie-Francine Moens, Josiane Mothe, Raffaele Perego, Martin Potthast, Fabrizio Sebastiani

Publisher: Springer International Publishing

Book Series : Lecture Notes in Computer Science


About this book

This two-volume set LNCS 12656 and 12657 constitutes the refereed proceedings of the 43rd European Conference on IR Research, ECIR 2021, held virtually in March/April 2021, due to the COVID-19 pandemic.

The 50 full papers presented together with 11 reproducibility papers, 39 short papers, 15 demonstration papers, 12 CLEF lab descriptions papers, 5 doctoral consortium papers, 5 workshop abstracts, and 8 tutorials abstracts were carefully reviewed and selected from 436 submissions.

The accepted contributions cover the state of the art in IR: deep learning-based information retrieval techniques, use of entities and knowledge graphs, recommender systems, retrieval methods, information extraction, question answering, topic and prediction models, multimedia retrieval, and much more.

Table of Contents


Reproducibility Track Papers

Cross-Domain Retrieval in the Legal and Patent Domains: A Reproducibility Study

Domain specific search has always been a challenging information retrieval task due to several challenges such as the domain specific language, the unique task setting, as well as the lack of accessible queries and corresponding relevance judgements. In the last years, pretrained language models – such as BERT – revolutionized web and news search. Naturally, the community aims to adapt these advancements to cross-domain transfer of retrieval models for domain specific search. In the context of legal document retrieval, Shao et al. propose the BERT-PLI framework by modeling the Paragraph-Level Interactions with the language model BERT. In this paper we reproduce the original experiments, we clarify pre-processing steps and add missing scripts for framework steps, however we are not able to reproduce the evaluation results. Contrary to the original paper, we demonstrate that the domain specific paragraph-level modelling does not appear to help the performance of the BERT-PLI model compared to paragraph-level modelling with the original BERT. In addition to our legal search reproducibility study, we investigate BERT-PLI for document retrieval in the patent domain. We find that the BERT-PLI model does not yet achieve performance improvements for patent document retrieval compared to the BM25 baseline. Furthermore, we evaluate the BERT-PLI model for cross-domain retrieval between the legal and patent domain on individual components, both on a paragraph and document-level. We find that the transfer of the BERT-PLI model on the paragraph-level leads to comparable results between both domains as well as first promising results for the cross-domain transfer on the document-level. For reproducibility and transparency as well as to benefit the community we make our source code and the trained models publicly available.

Sophia Althammer, Sebastian Hofstätter, Allan Hanbury
A Critical Assessment of State-of-the-Art in Entity Alignment

In this work, we perform an extensive investigation of two state-of-the-art (SotA) methods for the task of Entity Alignment in Knowledge Graphs. Therefore, we first carefully examine the benchmarking process and identify several shortcomings, making the results reported in the original works not always comparable. Furthermore, we suspect that it is a common practice in the community to make the hyperparameter optimization directly on a test set, reducing the informative value of reported performance. Thus, we select a representative sample of benchmarking datasets and describe their properties. We also examine different initializations for entity representations since they are a decisive factor for model performance. Furthermore, we use a shared train/validation/test split for an appropriate evaluation setting to evaluate all methods on all datasets. In our evaluation, we make several interesting findings. While we observe that most of the time SotA approaches perform better than baselines, they have difficulties when the dataset contains noise, which is the case in most real-life applications. Moreover, in our ablation study, we find out that often different features of SotA method are crucial for good performance than previously assumed. The code is available at .

Max Berrendorf, Ludwig Wacker, Evgeniy Faerman
System Effect Estimation by Sharding: A Comparison Between ANOVA Approaches to Detect Significant Differences

The ultimate goal of the evaluation is to understand when two IR systems are (significantly) different. To this end, many comparison procedures have been developed over time. However, to date, most reproducibility efforts focused just on reproducing systems and algorithms, almost fully neglecting to investigate the reproducibility of the methods we use to compare our systems. In this paper, we focus on methods based on ANalysis Of VAriance (ANOVA), which explicitly model the data in terms of different contributing effects, allowing us to obtain a more accurate estimate of significant differences. In this context, recent studies have shown how sharding the corpus can further improve the estimation of the system effect. We replicate and compare methods based on “traditional” ANOVA (tANOVA) to those based on a bootstrapped version of ANOVA (bANOVA) and those performing multiple comparisons relying on a more conservative Family-wise Error Rate (FWER) controlling approach to those relying on a more lenient False Discovery Rate (FDR) controlling approach. We found that bANOVA shows overall a good degree of reproducibility, with some limitations for what concerns the confidence intervals. Besides, compared to the tANOVA approaches, bANOVA presents greater statistical power, at the cost of lower stability. Overall, with this work, we aim at shifting the focus of reproducibility from systems alone to the methods we use to compare and analyze their performance.

Guglielmo Faggioli, Nicola Ferro
Reliability Prediction for Health-Related Content: A Replicability Study

Determining reliability of online data is a challenge that has recently received increasing attention. In particular, unreliable health-related content has become pervasive during the COVID-19 pandemic. Previous research [37] has approached this problem with standard classification technology using a set of features that have included linguistic and external variables, among others. In this work, we aim to replicate parts of the study conducted by Sondhi and his colleagues using our own code, and make it available for the research community ( ). The performance obtained in this study is as strong as the one reported by the original authors. Moreover, their conclusions are also confirmed by our replicability study. We report on the challenges involved in replication, including that it was impossible to replicate the computation of some features (since some tools or services originally used are now outdated or unavailable). Finally, we also report on a generalisation effort made to evaluate our predictive technology over new datasets [20, 35].

Marcos Fernández-Pichel, David E. Losada, Juan C. Pichel, David Elsweiler
An Empirical Comparison of Web Page Segmentation Algorithms

Over the past two decades, several algorithms have been developed to segment a web page into semantically coherent units, a task with several applications in web content analysis. However, these algorithms have hardly been compared empirically and it thus remains unclear which of them—or rather, which of their underlying paradigms—performs best. To contribute to closing this gap, we report on the reproduction and comparative evaluation of five segmentation algorithms on a large, standardized benchmark dataset for web page segmentation: Three of the algorithms have been specifically developed for web pages and have been selected to represent paradigmatically different approaches to the task, whereas the other two approaches originate from the segmentation of photos and print documents, respectively. For a fair comparison, we tuned each algorithm’s parameters, if applicable, to the dataset. Altogether, the classic rule-based VIPS algorithm achieved the highest performance, closely followed by the purely visual approach of Cormier et al. For reproducibility, we provide our reimplementations of the algorithms along with detailed instructions.

Johannes Kiesel, Lars Meyer, Florian Kneist, Benno Stein, Martin Potthast
Re-assessing the “Classify and Count” Quantification Method

Learning to quantify (a.k.a. quantification) is a task concerned with training unbiased estimators of class prevalence via supervised learning. This task originated with the observation that “Classify and Count” (CC), the trivial method of obtaining class prevalence estimates, is often a biased estimator, and thus delivers suboptimal quantification accuracy. Following this observation, several methods for learning to quantify have been proposed and have been shown to outperform CC. In this work we contend that previous works have failed to use properly optimised versions of CC. We thus reassess the real merits of CC and its variants, and argue that, while still inferior to some cutting-edge methods, they deliver near-state-of-the-art accuracy once (a) hyperparameter optimisation is performed, and (b) this optimisation is performed by using a truly quantification-oriented evaluation protocol. Experiments on three publicly available binary sentiment classification datasets support these conclusions.

Alejandro Moreo, Fabrizio Sebastiani
Reproducibility, Replicability and Beyond: Assessing Production Readiness of Aspect Based Sentiment Analysis in the Wild

With the exponential growth of online marketplaces and user-generated content therein, aspect-based sentiment analysis has become more important than ever. In this work, we critically review a representative sample of the models published during the past six years through the lens of a practitioner, with an eye towards deployment in production. First, our rigorous empirical evaluation reveals poor reproducibility: an average 4–5% drop in test accuracy across the sample. Second, to further bolster our confidence in empirical evaluation, we report experiments on two challenging data slices, and observe a consistent 12–55% drop in accuracy. Third, we study the possibility of transfer across domains and observe that as little as 10–25% of the domain-specific training dataset, when used in conjunction with datasets from other domains within the same locale, largely closes the gap between complete cross-domain and complete in-domain predictive performance. Lastly, we open-source two large-scale annotated review corpora from a large e-commerce portal in India in order to aid the study of replicability and transfer, with the hope that it will fuel further growth of the field.

Rajdeep Mukherjee, Shreyas Shetty, Subrata Chattopadhyay, Subhadeep Maji, Samik Datta, Pawan Goyal
Robustness of Meta Matrix Factorization Against Strict Privacy Constraints

In this paper, we explore the reproducibility of MetaMF, a meta matrix factorization framework introduced by Lin et al. MetaMF employs meta learning for federated rating prediction to preserve users’ privacy. We reproduce the experiments of Lin et al. on five datasets, i.e., Douban, Hetrec-MovieLens, MovieLens 1M, Ciao, and Jester. Also, we study the impact of meta learning on the accuracy of MetaMF’s recommendations. Furthermore, in our work, we acknowledge that users may have different tolerances for revealing information about themselves. Hence, in a second strand of experiments, we investigate the robustness of MetaMF against strict privacy constraints. Our study illustrates that we can reproduce most of Lin et al.’s results. Plus, we provide strong evidence that meta learning is essential for MetaMF’s robustness against strict privacy constraints.

Peter Muellner, Dominik Kowald, Elisabeth Lex
Textual Characteristics of News Title and Body to Detect Fake News: A Reproducibility Study

Fake news, a deliberately designed news to mislead others, is becoming a big societal threat with its fast dissemination over the Web and social media and its power to shape public opinion. Many researchers have been working to understand the underlying features that help identify these fake news on the Web. Recently, Horne and Adali found, on a small amount of data, that news title stylistic and linguistic features are better than the same type of features extracted from the news body in predicting fake news. In this paper, we present our attempt to reproduce the same results to validate their findings. We show which of their findings can be generalized to larger political and gossip news datasets.

Anu Shrestha, Francesca Spezzano
Federated Online Learning to Rank with Evolution Strategies: A Reproducibility Study

Online Learning to Rank (OLTR) optimizes ranking models using implicit users’ feedback, such as clicks, directly manipulating search engine results in production. This process requires OLTR methods to collect user queries and clicks; current methods are not suited to situations in which users want to maintain their privacy, i.e. not sharing data, queries and clicks.Recently, the federated OLTR with evolution strategies (FOLtR-ES) method has been proposed to provide a solution that can meet a number of users’ privacy requirements. Specifically, this method exploits the federated learning framework and $$\epsilon $$ ϵ -local differential privacy. However, the original research study that introduced this method only evaluated it on a small Learning to Rank (LTR) dataset and with no conformity with respect to current OLTR evaluation practice. It further did not explore specific parameters of the method, such as the number of clients involved in the federated learning process, and did not compare FOLtR-ES with the current state-of-the-art OLTR method. This paper aims to remedy to this gap.Our findings question whether FOLtR-ES is a mature method that can be considered in practice: its effectiveness largely varies across datasets, click types, ranker types and settings. Its performance is also far from that of current state-of-the-art OLTR, questioning whether the maintained of privacy guaranteed by FOLtR-ES is not achieved by seriously undermining search effectiveness and user experience.

Shuyi Wang, Shengyao Zhuang, Guido Zuccon
Comparing Score Aggregation Approaches for Document Retrieval with Pretrained Transformers

While BERT has been shown to be effective for passage retrieval, its maximum input length limitation poses a challenge when applying the model to document retrieval. In this work, we reproduce three passage score aggregation approaches proposed by Dai and Callan [5] for overcoming this limitation. After reproducing their results, we generalize their findings through experiments with a new dataset and experiment with other pretrained transformers that share similarities with BERT. We find that these BERT variants are not more effective for document retrieval in isolation, but can lead to increased effectiveness when combined with “pre–fine-tuning” on the MS MARCO passage dataset. Finally, we investigate whether there is a difference between fine-tuning models on “deep” judgments (i.e., fewer queries with many judgments each) vs. fine-tuning on “shallow” judgments (i.e., many queries with fewer judgments each). Based on available data from two different datasets, we find that the two approaches perform similarly.

Xinyu Zhang, Andrew Yates, Jimmy Lin

Short Papers

Transformer-Based Approach Towards Music Emotion Recognition from Lyrics

The task of identifying emotions from a given music track has been an active pursuit in the Music Information Retrieval (MIR) community for years. Music emotion recognition has typically relied on acoustic features, social tags, and other metadata to identify and classify music emotions. The role of lyrics in music emotion recognition remains under-appreciated in spite of several studies reporting superior performance of music emotion classifiers based on features extracted from lyrics. In this study, we use the transformer-based approach model using XLNet as the base architecture which, till date, has not been used to identify emotional connotations of music based on lyrics. Our proposed approach outperforms existing methods for multiple datasets. We used a robust methodology to enhance web-crawlers’ accuracy for extracting lyrics. This study has important implications in improving applications involved in playlist generation of music based on emotions in addition to improving music recommendation systems.

Yudhik Agrawal, Ramaguru Guru Ravi Shanker, Vinoo Alluri
BiGBERT: Classifying Educational Web Resources for Kindergarten-12 Grades

In this paper, we present BiGBERT, a deep learning model that simultaneously examines URLs and snippets from web resources to determine their alignment with children’s educational standards. Preliminary results inferred from ablation studies and comparison with baselines and state-of-the-art counterparts, reveal that leveraging domain knowledge to learn domain-aligned contextual nuances from limited input data leads to improved identification of educational web resources.

Garrett Allen, Brody Downs, Aprajita Shukla, Casey Kennington, Jerry Alan Fails, Katherine Landau Wright, Maria Soledad Pera
How Do Users Revise Zero-Hit Product Search Queries?

A product search on an e-commerce site can return zero hits for several reasons. One major reason is that a user’s query may not be appropriately expressed for locating existing products. To enable successful product purchase, an ideal e-commerce site should automatically revise the user query to avoid zero hits. We investigate what kinds of query revision strategies turn a zero-hit query into a successful query, by analyzing data from a major Japanese e-commerce site. Our analysis shows that about 99% of zero-hit queries can be turned into successful queries that lead to product purchase by term dropping (27%), term replacement (29%), rephrasing (17%), and typo correction (26%). The results suggest that an automatic rewriter for avoiding zero-hit product queries may be able to achieve satisfactory coverage and accuracy by focusing on the above four revision strategies.

Yuki Amemiya, Tomohiro Manabe, Sumio Fujita, Tetsuya Sakai
Query Performance Prediction Through Retrieval Coherency

Post-retrieval Query Performance Prediction (QPP) methods benefit from the characteristics of the retrieved set of documents to determine query difficulty. While existing works have investigated the relation between query and retrieved document spaces, as well as retrieved document scores, the association between the retrieved documents themselves, referred to as coherency, has not been extensively investigated for QPP. We propose that the coherence of the retrieved documents can be formalized as a function of the characteristics of a network that represents the associations between these documents. Based on experiments on three corpora, namely Robust04, Gov2 and ClueWeb09 and their TREC topics, we show that our coherence measures outperform existing metrics in the literature and are able to significantly improve the performance of state of the art QPP methods.

Negar Arabzadeh, Amin Bigdeli, Morteza Zihayat, Ebrahim Bagheri
From the Beatles to Billie Eilish: Connecting Provider Representativeness and Exposure in Session-Based Recommender Systems

Session-based recommender systems consider the evolution of user preferences in browsing sessions. Existing studies suggest as next item the one that keeps the user engaged as long as possible. This point of view does not account for the providers’ perspective. In this paper, we highlight side effects over the providers caused by state-of-the-art models. We focus on the music domain and study how artists’ exposure in the recommendation lists is affected by the input data structure, where different session lengths are explored. We consider four session-based systems on three types of datasets, with long, short, and mixed playlist length. We provide measures to characterize disparate treatment between the artists, through a systematic analysis by comparing (i) the exposure received by an artist in the recommendations and (ii) their input representation in the data. Results show that artists for which we can observe a lot of interactions, but offering less items, are mistreated in terms of exposure. Moreover, we show how input data structure may impact the algorithms’ effectiveness, possibly due to preference-shift phenomena

Alejandro Ariza, Francesco Fabbri, Ludovico Boratto, Maria Salamó
Bayesian System Inference on Shallow Pools

IR test collections make use of human annotated judgments. However, new systems that surface unjudged documents high in their result lists might undermine the reliability of statistical comparisons of system effectiveness, eroding the collection’s value. Here we explore a Bayesian inference-based analysis in a “high uncertainty” evaluation scenario, using data from the first round of the TREC COVID 2020 Track. Our approach constrains statistical modeling and generates credible replicates derived from the judged runs’ scores, comparing the relative discriminatory capacity of RBP scores by their system parameters modeled hierarchically over different response distributions. The resultant models directly compute risk measures as a posterior predictive distribution summary statistic; and also offer enhanced sensitivity.

Rodger Benham, Alistair Moffat, J. Shane Culpepper
Exploring Gender Biases in Information Retrieval Relevance Judgement Datasets

Recent studies in information retrieval have shown that gender biases have found their way into representational and algorithmic aspects of computational models. In this paper, we focus specifically on gender biases in information retrieval gold standard datasets, often referred to as relevance judgements. While not explored in the past, we submit that it is important to understand and measure the extent to which gender biases may be presented in information retrieval relevance judgements primarily because relevance judgements are not only the primary source for evaluating IR techniques but are also widely used for training end-to-end neural ranking methods. As such, the presence of bias in relevance judgements would immediately find its way into how retrieval methods operate in practice. Based on a fine-tuned BERT model, we show how queries can be labelled for gender at scale based on which we label MS MARCO queries. We then show how different psychological characteristics are exhibited within documents associated with gendered queries within the relevance judgement datasets. Our observations show that stereotypical biases are prevalent in relevance judgement documents.

Amin Bigdeli, Negar Arabzadeh, Morteza Zihayat, Ebrahim Bagheri
Assessing the Benefits of Model Ensembles in Neural Re-ranking for Passage Retrieval

Our work aimed at experimentally assessing the benefits of model ensembling within the context of neural methods for passage re-ranking. Starting from relatively standard neural models, we use a previous technique named Fast Geometric Ensembling to generate multiple model instances from particular training schedules, then focusing or attention on different types of approaches for combining the results from the multiple model instances (e.g., averaging the ranking scores, using fusion methods from the IR literature, or using supervised learning-to-rank). Tests with the MS-MARCO dataset show that model ensembling can indeed benefit the ranking quality, particularly with supervised learning-to-rank although also with unsupervised rank aggregation.

Luís Borges, Bruno Martins, Jamie Callan
Event Detection with Entity Markers

Event detection involves the identification of instances of specified types of events in text and their classification into event types. In this paper, we approach the event detection task as a relation extraction task. In this context, we assume that the clues brought by the entities participating in an event are important and could improve the performance of event detection. Therefore, we propose to exploit entity information explicitly for detecting the event triggers by marking them at different levels while fine-tuning a pre-trained language model. The experimental results prove that our approach obtains state-of-the-art results on the ACE 2005 dataset.

Emanuela Boros, Jose G. Moreno, Antoine Doucet
Simplified TinyBERT: Knowledge Distillation for Document Retrieval

Despite the effectiveness of utilizing the BERT model for document ranking, the high computational cost of such approaches limits their uses. To this end, this paper first empirically investigates the effectiveness of two knowledge distillation models on the document ranking task. In addition, on top of the recently proposed TinyBERT model, two simplifications are proposed. Evaluations on two different and widely-used benchmarks demonstrate that Simplified TinyBERT with the proposed simplifications not only boosts TinyBERT, but also significantly outperforms BERT-Base when providing 15 $$\times $$ × speedup.

Xuanang Chen, Ben He, Kai Hui, Le Sun, Yingfei Sun
Improving Cold-Start Recommendation via Multi-prior Meta-learning

Optimization-based meta-learning has been applied in cold-start recommendations, where a good initialization of meta learner is obtained from past experiences and then reused for fast adaptation to new tasks. However, when dealing with various users with diverse preferences, meta-learning with a single prior might fail in cold-start recommendations due to its insufficient capability for adaptation. To address this problem, a multi-prior meta-learning (MPML) approach is proposed in this paper and applied in cold-start recommendations. More concretely, we integrate a novel accuracy-based task clustering scheme with double gradient to learn multiple priors. Experiments demonstrate the effectiveness of MPML.

Zhengyu Chen, Donglin Wang, Shiqian Yin
A White Box Analysis of ColBERT

Transformer-based models are nowadays state-of-the-art in adhoc Information Retrieval, but their behavior are far from being understood. Recent work has claimed that BERT does not satisfy the classical IR axioms. However, we propose to dissect the matching process of ColBERT, through the analysis of term importance and exact/soft matching patterns. Even if the traditional axioms are not formally verified, our analysis reveals that ColBERT (i) is able to capture a notion of term importance; (ii) relies on exact matches for important terms.

Thibault Formal, Benjamin Piwowarski, Stéphane Clinchant
Diversity Aware Relevance Learning for Argument Search

In this work, we focus on retrieving relevant arguments for a query claim covering diverse aspects. State-of-the-art methods rely on explicit mappings between claims and premises and thus cannot utilize extensive available collections of premises without laborious and costly manual annotation. Their diversity approach relies on removing duplicates via clustering, which does not directly ensure that the selected premises cover all aspects. This work introduces a new multi-step approach for the argument retrieval problem. Rather than relying on ground-truth assignments, our approach employs a machine learning model to capture semantic relationships between arguments. Beyond that, it aims to cover diverse facets of the query instead of explicitly identifying duplicates. Our empirical evaluation demonstrates that our approach leads to a significant improvement in the argument retrieval task, even though it requires fewer data than prior methods. Our code is available at .

Michael Fromm, Max Berrendorf, Sandra Obermeier, Thomas Seidl, Evgeniy Faerman
SQE-GAN: A Supervised Query Expansion Scheme via GAN

Existing Supervised Query Expansion (SQE) spends much time in term feature extraction but generates sub-optimal expanded terms. In this paper, we introduce Generative Adversarial Nets (GANs) and propose a GAN-based SQE method (SQE-GAN) to get helpful query expansion terms. We unify two types of models in query expansion: the generative model and the discriminative one. The generative (resp., discriminative) model focuses on predicting relevant terms (resp., relevancy) given a query (resp., a query-term pair). We iteratively optimize both models with a game between them. Besides, a BiLSTM layer is adopted to encode the utility of a term with respect to the query. As a result, the costly feature calculation in SQE schemes is avoided, such that the efficiency can be significantly improved. Moreover, by introducing GAN into expansion, the expanded terms are possible to be more effective with respect to the eventual needs of the user. Our experimental results demonstrate that SQE-GAN can be 37.3% faster than state-of-the-art SQE solutions while outperforming some recently proposed neural models in the retrieval quality.

Tianle Fu, Qi Tian, Hui Li
Rethink Training of BERT Rerankers in Multi-stage Retrieval Pipeline

Pre-trained deep language models (LM) have advanced the state-of-the-art of text retrieval. Rerankers fine-tuned from deep LM estimates candidate relevance based on rich contextualized matching signals. Meanwhile, deep LMs can also be leveraged to improve search index, building retrievers with better recall. One would expect a straightforward combination of both in a pipeline to have additive performance gain. In this paper, we discover otherwise and that popular reranker cannot fully exploit the improved retrieval result. We, therefore, propose a Localized Contrastive Estimation (LCE) for training rerankers and demonstrate it significantly improves deep two-stage models (Our codes are open sourced at .).

Luyu Gao, Zhuyun Dai, Jamie Callan
Should I Visit This Place? Inclusion and Exclusion Phrase Mining from Reviews

Although several automatic itinerary generation services have made travel planning easy, often times travellers find themselves in unique situations where they cannot make the best out of their trip. Visitors differ in terms of many factors such as suffering from a disability, being of a particular dietary preference, travelling with a toddler, etc. While most tourist spots are universal, others may not be inclusive for all. In this paper, we focus on the problem of mining inclusion and exclusion phrases associated with 11 such factors, from reviews related to a tourist spot. While existing work on tourism data mining mainly focuses on structured extraction of trip related information, personalized sentiment analysis, and automatic itinerary generation, to the best of our knowledge this is the first work on inclusion/exclusion phrase mining from tourism reviews. Using a dataset of 2000 reviews related to 1000 tourist spots, our broad level classifier provides a binary overlap F1 of $$\sim $$ ∼ 80 and $$\sim $$ ∼ 82 to classify a phrase as inclusion or exclusion respectively. Further, our inclusion/exclusion classifier provides an F1 of $$\sim $$ ∼ 98 and $$\sim $$ ∼ 97 for 11-class inclusion and exclusion classification respectively. We believe that our work can significantly improve the quality of an automatic itinerary generation service.

Omkar Gurjar, Manish Gupta
Dynamic Cross-Sentential Context Representation for Event Detection

In this paper, which focuses on the supervised detection of event mentions in texts, we propose a method to exploit a large context through the representation of distant sentences selected based on coreference relations between entities. We show the benefits of extending a neural sentence-level model with this representation through evaluation carried out on the TAC Event 2015 reference corpus.

Dorian Kodelja, Romaric Besançon, Olivier Ferret
Transfer Learning and Augmentation for Word Sense Disambiguation

Many downstream NLP tasks have shown significant improvement through continual pre-training, transfer learning and multi-task learning. State-of-the-art approaches in Word Sense Disambiguation today benefit from some of these approaches in conjunction with information sources such as semantic relationships and gloss definitions contained within WordNet. Our work builds upon these systems and uses data augmentation along with extensive pre-training on various different NLP tasks and datasets. Our transfer learning and augmentation pipeline achieves state-of-the-art single model performance in WSD and is at par with the best ensemble results.

Harsh Kohli
Cross-modal Memory Fusion Network for Multimodal Sequential Learning with Missing Values

Information in many real-world applications is inherently multi-modal, sequential and characterized by a variety of missing values. Existing imputation methods mainly focus on the recurrent dynamics in one modality while ignoring the complementary property from other modalities. In this paper, we propose a novel method called cross-modal memory fusion network (CMFN) that explicitly learns both modal-specific and cross-modal dynamics for imputing the missing values in multi-modal sequential learning tasks. Experiments on two datasets demonstrate that our method outperforms state-of-the-art methods and show its potential to better impute missing values in complex multi-modal datasets.

Chen Lin, Joyce C. Ho, Eugene Agichtein
Social Media Popularity Prediction of Planned Events Using Deep Learning

Early prediction of popularity is crucial for recommendation of planned events such as concerts, conferences, sports events, performing arts, etc. Estimation of the volume of social media discussions related to the event can be useful for this purpose. Most of the existing methods for social media popularity prediction focus on estimating tweet popularity i.e. predicting the number of retweets for a given tweet. There is less focus on predicting event popularity using social media. We focus on predicting the popularity of an event much before its start date. This type of early prediction can be helpful in event recommendation systems, assisting event organizers for better planning, dynamic ticket pricing, etc. We propose a deep learning based model to predict the social media popularity of an event. We also incorporate an extra feature indicating how many days left to the event start date to improve the performance. Experimental results show that our proposed deep learning based approach outperforms the baseline methods.

Sreekanth Madisetty, Maunendra Sankar Desarkar
Right for the Right Reasons: Making Image Classification Intuitively Explainable

The effectiveness of Convolutional Neural Networks (CNNs) in classifying image data has been thoroughly demonstrated. In order to explain the classification to humans, methods for visualizing classification evidence have been developed in recent years. These explanations reveal that sometimes images are classified correctly, but for the wrong reasons, i.e., based on incidental evidence. Of course, it is desirable that images are classified correctly for the right reasons, i.e., based on the actual evidence. To this end, we propose a new explanation quality metric to measure object aligned explanation in image classification which we refer to as the ObAlEx metric. Using object detection approaches, explanation approaches, and ObAlEx, we quantify the focus of CNNs on the actual evidence. Moreover, we show that additional training of the CNNs can improve the focus of CNNs without decreasing their accuracy.

Anna Nguyen, Adrian Oberföll, Michael Färber
Weakly Supervised Label Smoothing

We study Label Smoothing (LS), a widely used regularization technique, in the context of neural learning to rank (L2R) models. LS combines the ground-truth labels with a uniform distribution, encouraging the model to be less confident in its predictions. We analyze the relationship between the non-relevant documents—specifically how they are sampled—and the effectiveness of LS, discussing how LS can be capturing “hidden similarity knowledge” between the relevant and non-relevant document classes. We further analyze LS by testing if a curriculum-learning approach, i.e., starting with LS and after a number of iterations using only ground-truth labels, is beneficial. Inspired by our investigation of LS in the context of neural L2R models, we propose a novel technique called Weakly Supervised Label Smoothing (WSLS) that takes advantage of the retrieval scores of the negative sampled documents as a weak supervision signal in the process of modifying the ground-truth labels. WSLS is simple to implement, requiring no modification to the neural ranker architecture. Our experiments across three retrieval tasks—passage retrieval, similar question retrieval and conversation response ranking—show that WSLS for pointwise BERT-based rankers leads to consistent effectiveness gains. The source code is available at .

Gustavo Penha, Claudia Hauff

Open Access

Neural Feature Selection for Learning to Rank

LEarning TO Rank (LETOR) is a research area in the field of Information Retrieval (IR) where machine learning models are employed to rank a set of items. In the past few years, neural LETOR approaches have become a competitive alternative to traditional ones like LambdaMART. However, neural architectures performance grew proportionally to their complexity and size. This can be an obstacle for their adoption in large-scale search systems where a model size impacts latency and update time. For this reason, we propose an architecture-agnostic approach based on a neural LETOR model to reduce the size of its input by up to 60% without affecting the system performance. This approach also allows to reduce a LETOR model complexity and, therefore, its training and inference time up to 50%.

Alberto Purpura, Karolina Buchner, Gianmaria Silvello, Gian Antonio Susto
Exploring the Incorporation of Opinion Polarity for Abstractive Multi-document Summarisation

Abstractive multi-document summarisation (MDS) remains a challenging task. Part of the problem is the question as to how to preserve a document’s polarity in the summary. We propose an opinion polarity attention model for MDS, which incorporates a polarity estimator based on a BERT-GRU sentiment analysis network. It captures the impact of opinions expressed in the source documents and integrates it in the attention mechanism. Experimental results using a state-of-the-art MDS approach and a common benchmark test collection demonstrate that this model has a measurable positive effect using a range of metrics.

Dominik Ramsauer, Udo Kruschwitz
Multilingual Evidence Retrieval and Fact Verification to Combat Global Disinformation: The Power of Polyglotism

This article investigates multilingual evidence retrieval and fact verification as a step to combat global disinformation, a first effort of this kind, to the best of our knowledge. The goal is building multilingual systems that retrieve in evidence - rich languages to verify claims in evidence - poor languages that are more commonly targeted by disinformation. To this end, our EnmBERT fact verification system shows evidence of transfer learning ability and a 400 example mixed English - Romanian dataset is made available for cross - lingual transfer learning evaluation.

Denisa A. Olteanu Roberts
How Do Active Reading Strategies Affect Learning Outcomes in Web Search?

Prior work in education research has shown that various active reading strategies, notably highlighting and note-taking, benefit learning outcomes. Most of these findings are based on observational studies where learners learn from a single document. In a Search as Learning (SAL) context where learners have to iteratively scan and explore a large number of documents to address their learning objective, the effect of these active reading strategies is largely unexplored. To address this research gap, we carried out a crowd-sourced user study, and explored the effects of different highlighting and note-taking strategies on learning during a complex, learning-oriented search task. Out of five hypotheses derived from the education literature we could confirm three in the SAL context. Our findings have important design implications on aiding learning through search. Learners can benefit from search interfaces equipped with active reading tools—but some learning strategies employing these tools are more effective than others. (This research has been supported by DDS (Delft Data Science) and NWO projects SearchX (639.022.722) and Aspasia (015.013.027).)

Nirmal Roy, Manuel Valle Torre, Ujwal Gadiraju, David Maxwell, Claudia Hauff
Fine-Tuning BERT for COVID-19 Domain Ad-Hoc IR by Using Pseudo-qrels

This work analyzes the feasibility of training a neural retrieval system for a collection of scientific papers about COVID-19 using pseudo-qrels extracted from the collection. We propose a method for generating pseudo-qrels that exploits two characteristics present in scientific articles: a) the relationship between title and abstract, and b) the relationship between articles through sentences containing citations. Through these signals we generate pseudo-queries and their respective pseudo-positive (relevant documents) and pseudo-negative (non-relevant documents) examples. The article retrieval process combines a ranking model based on term-maching techniques and a neural one based on pretrained BERT models. BERT models are fine-tuned to the task using the pseudo-qrels generated. We compare different BERT models, both open domain and biomedical domain, and also the generated pseudo-qrels with the open domain MS-Marco dataset for fine-tuning the models. The results obtained on the TREC-COVID collection show that pseudo-qrels provide a significant improvement to neural models, both against classic IR baselines based on term-matching and neural systems trained on MS-Marco.

Xabier Saralegi, Iñaki San Vicente
Windowing Models for Abstractive Summarization of Long Texts

Neural summarization models have a fixed-size input limitation: if text length surpasses the model’s maximal input length, some document content (possibly summary-relevant) gets truncated. Independently summarizing windows of maximal input size disallows for information flow between windows and leads to incoherent summaries. We propose windowing models for neural abstractive summarization of (arbitrarily) long texts. We extend the sequence-to-sequence model augmented with pointer generator network by (1) allowing the encoder to slide over different windows of the input document and (2) sharing the decoder and retaining its state across different input windows. We explore two windowing variants: Static Windowing precomputes the number of tokens for the decoder to generate from each window (based on training corpus statistics); in Dynamic Windowing the decoder learns to emit a token signaling the shift to the next input window. Empirical results render our models effective in intended use-case: summarizing long texts with relevant content not bound to document beginning.

Leon Schüller, Florian Wilhelm, Nico Kreiling, Goran Glavaš
Towards Dark Jargon Interpretation in Underground Forums

Dark jargons are benign-looking words that have hidden, sinister meanings and are used by participants of underground forums for illicit behavior. For example, the dark term “rat” is often used in lieu of “Remote Access Trojan”. In this work we present a novel method towards automatically identifying and interpreting dark jargons. We formalize the problem as a mapping from dark words to “clean” words with no hidden meaning. Our method makes use of interpretable representations of dark and clean words in the form of probability distributions over a shared vocabulary. In our experiments we show our method to be effective in terms of dark jargon identification, as it outperforms another baseline on simulated data. Using manual evaluation, we show that our method is able to detect dark jargons in a real-world underground forum dataset.

Dominic Seyler, Wei Liu, XiaoFeng Wang, ChengXiang Zhai
Multi-span Extractive Reading Comprehension Without Multi-span Supervision

This study focuses on multi-span reading comprehension (RC), which requires answering questions with multiple text spans. Existing approaches for extracting multiple answers require an elaborate dataset that contains questions requiring multiple answers. We propose a method for rewriting single-span answers extracted using several different models to detect single/multiple answer(s). With this approach, only a simple dataset and models for single-span RC are required. We consider multi-span RC with zero-shot learning. Experimental results using the DROP and QUOREF datasets demonstrate that the proposed method improves the exact match (EM) and F1 scores by a large margin on multi-span RC, compared to the baseline models. We further analyzed the effectiveness of combining different models and a strategy for such combinations when applied to multi-span RC.

Takumi Takahashi, Motoki Taniguchi, Tomoki Taniguchi, Tomoko Ohkuma
Textual Complexity as an Indicator of Document Relevance

We study the textual complexity of documents as an aspect of the Information Retrieval process that influences retrieval effectiveness. Our experiments show that in many cases user queries allow determining which linguistic competency level best suits an underlying information need. The paper investigates promising first approaches on how to do so automatically and compares them to an idealistic baseline. By filtering out documents of unexpected textual complexity, we find improved search results mainly when using precision-oriented effectiveness measures.

Anastasia Taranova, Martin Braschler
A Comparison of Question Rewriting Methods for Conversational Passage Retrieval

Conversational passage retrieval relies on question rewriting to modify the original question so that it no longer depends on the conversation history. Several methods for question rewriting have recently been proposed, but they were compared under different retrieval pipelines. We bridge this gap by thoroughly evaluating those question rewriting methods on the TREC CAsT 2019 and 2020 datasets under the same retrieval pipeline. We analyze the effect of different types of question rewriting methods on retrieval performance and show that by combining question rewriting methods of different types we can achieve state-of-the-art performance on both datasets (Resources can be found at .)

Svitlana Vakulenko, Nikos Voskarides, Zhucheng Tu, Shayne Longpre
Predicting Question Responses to Improve the Performance of Retrieval-Based Chatbot

Chatbot models are built to mimic a conversation between humans and fulfill different tasks. Retrieval-based chatbot models are designed to select the most appropriate response from a pool of candidates given a past conversation and current input. During the conversation, chatbots are expected to (1) provide direct assistant when the user request is clear or (2) ask clarification questions to gather more information to better understand the user’s need. Despite its importance, few studies have looked at when to ask questions and how to retrieve relevant questions accordingly. As a result, existing retrieval-based chatbot models perform poorly when the correct response is a question. To overcome this limitation, we propose an adaptive response retrieval model. Specifically, we first predict whether the best response should be a question, and then apply different models to retrieve the responses accordingly. A novel question response retrieval model is proposed to better capture the matching patterns between question responses with the conversations. Experiments on two public data sets show the proposed adaptive model can significantly and consistently improve the retrieval performance in particular for the question responses.

Disen Wang, Hui Fang
Multi-head Self-attention with Role-Guided Masks

The state of the art in learning meaningful semantic representations of words is the Transformer model and its attention mechanisms. Simply put, the attention mechanisms learn to attend to specific parts of the input dispensing recurrence and convolutions. While some of the learned attention heads have been found to play linguistically interpretable roles, they can be redundant or prone to errors. We propose a method to guide the attention heads towards roles identified in prior work as important. We do this by defining role-specific masks to constrain the heads to attend to specific parts of the input, such that different heads are designed to play different roles. Experiments on text classification and machine translation using 7 different datasets show that our method outperforms competitive attention-based, CNN, and RNN baselines.

Dongsheng Wang, Casper Hansen, Lucas Chaves Lima, Christian Hansen, Maria Maistro, Jakob Grue Simonsen, Christina Lioma
PGT: Pseudo Relevance Feedback Using a Graph-Based Transformer

Most research on pseudo relevance feedback (PRF) has been done in vector space and probabilistic retrieval models. This paper shows that Transformer-based rerankers can also benefit from the extra context that PRF provides. It presents PGT, a graph-based Transformer that sparsifies attention between graph nodes to enable PRF while avoiding the high computational complexity of most Transformer architectures. Experiments show that PGT improves upon non-PRF Transformer reranker, and it is at least as accurate as Transformer PRF models that use full attention, but with lower computational costs.

HongChien Yu, Zhuyun Dai, Jamie Callan
Clustering-Augmented Multi-instance Learning for Neural Relation Extraction

Despite its efficiency in generating training data, distant supervision for sentential relation extraction assigns labels to instances in a context-agnostic manner—a process that may introduce false labels and confuse sentential model learning. In this paper, we propose to integrate instance clustering with distant training, and develop a novel clustering-augmented multi-instance training framework. Specifically, for sentences labeled with the same relation type, we jointly perform clustering based on their semantic representations, and treat each cluster as a training unit for multi-instance training. Comparing to existing bag-level attention models, our proposed method does not restrict the training unit to be sentences with the same entity pair, as it may cause the selective attention to focus on instances with simple sentence context, and thus fail to provide informative supervision. Experiments on two popular datasets demonstrate the effectiveness of augmenting multi-instance learning with clustering.

Qi Zhang, Siliang Tang, Jinquan Sun, Yu Wang, Lei Zhang
Detecting and Forecasting Misinformation via Temporal and Geometric Propagation Patterns

Misinformation takes the form of a false claim under the guise of fact. It is necessary to protect social media against misinformation by means of effective misinformation detection and analysis. To this end, we formulate misinformation propagation as a dynamic graph, then extract the temporal evolution patterns and geometric features of the propagation graph based on Temporal Point Processes (TPPs). TPPs provide the appropriate modelling framework for a list of stochastic, discrete events. In this context, that is a sequence of social user engagements. Furthermore, we forecast the cumulative number of engaged users based on a power law. Such forecasting capabilities can be useful in assessing the threat level of misinformation pieces. By jointly considering the geometric and temporal propagation patterns, our model has achieved comparable performance with state-of-the-art baselines on two well known datasets.

Qiang Zhang, Jonathan Cook, Emine Yilmaz
Deep Query Likelihood Model for Information Retrieval

The query likelihood model (QLM) for information retrieval has been thoroughly investigated and utilised. At the basis of this method is the representation of queries and documents as language models; then retrieval corresponds to evaluate the likelihood that the query could be generated by the document. Several approaches have arisen to compute such probability, including by maximum likelihood, smoothing and considering translation probabilities from related terms.In this paper, we consider estimating this likelihood using modern pre-trained deep language models, and in particular the text-to-text transfer transformer (T5) – giving rise to the QLM-T5. This approach is evaluated on the passage ranking task of the MS MARCO dataset; empirical results show that QLM-T5 significantly outperforms traditional QLM methods, as well as a recent ad-hoc methods that exploits T5 for this task.

Shengyao Zhuang, Hang Li, Guido Zuccon
Tweet Length Matters: A Comparative Analysis on Topic Detection in Microblogs

Microblogs are characterized as short and informal text; and therefore sparse and noisy. To understand topic semantics of short text, supervised and unsupervised methods are investigated, including traditional bag-of-words and deep learning-based models. However, the effectiveness of such methods are not together investigated in short-text topic detection. In this study, we provide a comparative analysis on topic detection in microblogs. We construct a tweet dataset based on the recent and important events worldwide, including the COVID-19 pandemic and BlackLivesMatter movement. We also analyze the effect of varying tweet length in both evaluation and training. Our results show that tweet length matters in terms of the effectiveness of a topic-detection method.

Furkan Şahinuç, Cagri Toraman

Demo Papers

repro_eval: A Python Interface to Reproducibility Measures of System-Oriented IR Experiments

In this work we introduce repro_eval - a tool for reactive reproducibility studies of system-oriented Information Retrieval (IR) experiments. The corresponding Python package provides IR researchers with measures for different levels of reproduction when evaluating their systems’ outputs. By offering an easily extensible interface, we hope to stimulate common practices when conducting a reproducibility study of system-oriented IR experiments.

Timo Breuer, Nicola Ferro, Maria Maistro, Philipp Schaer
Signal Briefings: Monitoring News Beyond the Brand

Public relations (PR) professionals are responsible for managing an organisation’s reputation through monitoring entities of interest and wider industry news. Monitoring and tracking wide news spaces such as industry news can cause a significant work load on PR professionals. We present Signal Briefings, a system which uses a combination of clustering and ranking to produce a small set of impactful articles distributed as a periodic email in a scalable and efficient manner.

James Brill, Dyaa Albakour, José Esquivel, Udo Kruschwitz, Miguel Martinez, Jon Chamberlain
Time-Matters: Temporal Unfolding of Texts

Over the past few years, the amount of information generated, consumed and stored on the Web has grown exponentially, making it impossible for users to keep up to date. Temporal data representation can help in this process by giving documents a sense of organization. Timelines are a natural way to showcase this data, giving users the chance to get familiar with a topic in a shorter amount of time. Despite their importance, little is known about their use in the context of single documents. In this paper, we present Time-Matters, a novel system to automatically explore arbitrary texts through temporal narratives in an interactive fashion that allows users to get insights into the relevant temporal happenings of a story through multiple components, including temporal annotation, storylines or temporal clustering. In contrast to classical timeline multi-document summarization tasks, we focus on performing text summaries of single documents with a temporal lens. This approach may be of interest to a number of providers such as media outlets, for which automatically building a condensed overview of a text is an important issue.

Ricardo Campos, Jorge Duque, Tiago Cândido, Jorge Mendes, Gaël Dias, Alípio Jorge, Célia Nunes
An Extensible Toolkit of Query Refinement Methods and Gold Standard Dataset Generation

We present an open-source extensible python-based toolkit that provides access to a (1) range of built-in unsupervised query expansion methods, and (2) pipeline for generating gold standard datasets for building and evaluating supervised query refinement methods. While the information literature offers abundant work on query expansion techniques, there is yet to be a tool that provides unified access to a comprehensive set of query expansion techniques. The advantage of our proposed toolkit, known as ReQue (refining queries), is that it offers one-stop shop access to query expansion techniques to be used in external information retrieval applications. More importantly, we show how ReQue can be used for building gold standards datasets that can be used for training supervised deep learning-based query refinement techniques. These techniques require sizeable gold query refinement datasets, which are not available in the literature. ReQue provides the means to systematically build such datasets.

Hossein Fani, Mahtab Tamannaee, Fattane Zarrinkalam, Jamil Samouh, Samad Paydar, Ebrahim Bagheri
CoralExp: An Explainable System to Support Coral Taxonomy Research

Thanks to the availability of large digital collections of coral images and because of the difficulty for experts to manually process all of them, it is possible and valuable to apply automatic methods to identify similar and relevant coral specimens in a coral specimen collection. Given the digital nature of these collections, it makes sense to leverage computer vision and information retrieval methods to support marine biology experts with their research.In this paper we introduce CoralExp: a data exploration system aimed at supporting domain experts in marine biology by means of explainable computer vision and machine learning techniques in better understanding the reasoning behind automated classification decisions and thus providing insights on which coral properties should to be considered when designing future coral taxonomies.

Jaiden Harding, Tom Bridge, Gianluca Demartini
AWESSOME: An Unsupervised Sentiment Intensity Scoring Framework Using Neural Word Embeddings

Sentiment analysis (SA) is the key element for a variety of opinion and attitude mining tasks. While various unsupervised SA tools already exist, a central problem is that they are lexicon-based where the lexicons used are limited, leading to a vocabulary mismatch. In this paper, we present an unsupervised word embedding-based sentiment scoring framework for sentiment intensity scoring (SIS). The framework generalizes and combines past works so that pre-existing lexicons (e.g. VADER, LabMT) and word embeddings (e.g. BERT, RoBERTa) can be used to address this problem, with no require training, and while providing fine grained SIS of words and phrases. The framework is scalable and extensible, so that custom lexicons or word embeddings can be used to core methods, and to even create new corpus specific lexicons without the need for extensive supervised learning and retraining. The Python 3 toolkit is open source, freely available from GitHub ( ) and can be directly installed via pip install awessome.

Amal Htait, Leif Azzopardi
HSEarch: Semantic Search System for Workplace Accident Reports

Semantic search engines, which integrate the output of text mining (TM) methods, can significantly increase the ease and efficiency of finding relevant documents and locating important information within them. We present a novel search engine for the construction industry, HSEarch ( ), which uses TM methods to provide semantically-enhanced, faceted search over a repository of workplace accident reports. Compared to previous TM-driven search engines for the construction industry, HSEarch provides a more interactive means for users to explore the contents of the repository, to review documents more systematically and to locate relevant knowledge within them.

Emrah Inan, Paul Thompson, Tim Yates, Sophia Ananiadou
Multi-view Conversational Search Interface Using a Dialogue-Based Agent

We present a demonstration application for dialogue-based search. In this system, a conversational agent engages with the user of an online search tool to support their search activities. Agent-supported conversational search of this type represents a fundamental advance beyond current standard search engines, such as web search tools. Analogous to the role of a human librarian, the agent can direct the user to potentially interesting retrieved information and provide suggestions to help progress the searcher’s activities.

Abhishek Kaushik, Nicolas Loir, Gareth J. F. Jones
LogUI: Contemporary Logging Infrastructure for Web-Based Experiments

Logging user interactions is fundamental to capturing and subsequently analysing user behaviours in the context of web-based Interactive Information Retrieval (IIR). However, logging is often implemented within experimental apparatus in a piecemeal fashion, leading to incomplete or noisy data. To address these issues, we present the LogUI logging framework. We use (now ubiquitous) contemporary web technologies to provide an easy-to-use yet powerful framework that can capture virtually any user interaction on a webpage. LogUI removes many of the complexities that must be considered for effective interaction logging.

David Maxwell, Claudia Hauff
LEMONS: Listenable Explanations for Music recOmmeNder Systems

Although current music recommender systems suggest new tracks to their users, they do not provide listenable explanations of why a user should listen to them. LEMONS (Demonstration video: ) is a new system that addresses this gap by (1) adopting a deep learning approach to generate audio content-based recommendations from the audio tracks and (2) providing listenable explanations based on the time-source segmentation of the recommended tracks using the recently proposed audioLIME.

Alessandro B. Melchiorre, Verena Haunschmid, Markus Schedl, Gerhard Widmer
Aspect-Based Passage Retrieval with Contextualized Discourse Vectors

Passage retrieval is the task of retrieving only the portions of a document that are relevant to a particular information need. One application medical doctors and researchers face is the challenge of reading a large amount of novel literature. For example, since the outbreak of Coronavirus disease 2019 (COVID-19), tens of thousands of papers have been published each month about the disease. We demonstrate how we can support healthcare professionals in this exploratory research task with our neural passage retrieval system based on Contextualized Discourse Vectors (CDV). CDV captures the discourse of long documents on sentence level and allows to query a large corpus with medical entities and aspects. Our demonstration covers over 27,000 diseases and 14,000 clinical aspects including symptoms, diagnostics, treatments and medications. It returns passages and highlights sentences to effectively answer clinical queries with up to 65% Recall@1. We showcase our system on the COVID-19 Open Research Dataset (CORD-19), Orphanet and Wikipedia diseases corpora.

Jens-Michalis Papaioannou, Manuel Mayrdorfer, Sebastian Arnold, Felix A. Gers, Klemens Budde, Alexander Löser
News Monitor: A Framework for Querying News in Real Time

News articles generated by online media are a major source of information. In this work, we present News Monitor, a framework that automatically collects news articles from a variety of web pages and performs various analysis tasks. The framework initially identifies fresh news and clusters articles about the same incidents. For every story, it extracts a Knowledge Base (KB) using open information extraction techniques and utilizes this KB in order to build a summary for the user. News Monitor allows the users to query the article in natural language using the state-of-the-art framework BERT. Nevertheless, it allows the user to perform queries also in the KB in order to identify relevant articles. Finally, News Monitor crawls Twitter using a dynamic set of keywords in order to retrieve relevant messages. The framework is distributed, online and performs analysis in real-time.

Antonia Saravanou, Nikolaos Panagiotou, Dimitrios Gunopulos
Chattack: A Gamified Crowd-Sourcing Platform for Tagging Deceptive & Abusive Behaviour

With the explosion of social networks, the web has been transformed into an arena of inappropriate interactions and content, such as fake news and misinformation, deception, hate speech, inauthentic online behaviour, proselytism, slander, and mobbing. In this demo we present Chattack, a first step towards our aim of providing publicly available datasets for accelerating research in the area of safer online conversations. Chattack is a crowd-sourcing web platform that allows the creation of textual dialogues containing inappropriate interactions or language. To make the platform sustainable and collect as many qualitative dialogues as possible, we build upon a gamified approach that can engage users and provide incentives for the completion of various tasks. We provide the details of our approach, present the functionality of the platform, stress its novel features, and discuss some preliminary results and the lessons learned. The platform is publicly available and we invite the participation of the community for its growth.

Emmanouil Smyrnakis, Katerina Papantoniou, Panagiotis Papadakos, Yannis Tzitzikas
PreFace++: Faceted Retrieval of Prerequisites and Technical Data

While learning new technical material, a user faces difficulty encountering new concepts for which she does not have the necessary prerequisite knowledge. Determining the right set of prerequisites is challenging because it involves multiple searches on the web. Although a number of techniques have been proposed to retrieve prerequisites, none of them consider grouping prerequisites into interesting facets. To address this issue, we have developed a system called PreFace++ ( which assists a user in learning new topics. PreFace++ is an extension of our previous system PreFace. It takes a query as input and returns (i) a prerequisite graph, where the nodes represent prerequisites for the query and edges indicate prerequisite relationship, ii) a set of interesting facets towards understanding the query (iii) prerequisites for the query and the facet and iv) a set of research papers and posts relevant for the query and the facet to explore relationship between the query and the facet. The backbone of PreFace++ is TeKnowbase, which is a knowledge base in Computer Science.

Prajna Upadhyay, Maya Ramanath
Brief Description of COVID-SEE: The Scientific Evidence Explorer for COVID-19 Related Research

We present COVID-SEE, a system for medical literature discovery based on the concept of information exploration, which builds on several distinct text analysis and natural language processing methods to structure and organise information in publications, and augments search through a visual overview of a collection enabling exploration to identify key articles of interest. We developed this system over COVID-19 literature to help medical professionals and researchers explore the literature evidence, and improve findability of relevant information. COVID-SEE is available at .

Karin Verspoor, Simon Šuster, Yulia Otmakhova, Shevon Mendis, Zenan Zhai, Biaoyan Fang, Jey Han Lau, Timothy Baldwin, Antonio Jimeno Yepes, David Martinez

CLEF 2021 Lab Descriptions

Overview of PAN 2021: Authorship Verification, Profiling Hate Speech Spreaders on Twitter, and Style Change Detection
Extended Abstract

The paper gives a brief overview of the three shared tasks to be organized at the PAN 2021 lab on digital text forensics and stylometry hosted at the CLEF conference. The tasks include authorship verification across domains, author profiling for hate speech spreaders, and style change detection for multi-author documents. In part the tasks are new and in part they continue and advance past shared tasks, with the overall goal of advancing the state of the art, providing for an objective evaluation on newly developed benchmark datasets.

Janek Bevendorff, BERTa Chulvi, Gretel Liz De La Peña Sarracén, Mike Kestemont, Enrique Manjavacas, Ilia Markov, Maximilian Mayerl, Martin Potthast, Francisco Rangel, Paolo Rosso, Efstathios Stamatatos, Benno Stein, Matti Wiegmann, Magdalena Wolska, Eva Zangerle
Overview of Touché 2021: Argument Retrieval
Extended Abstract

Technologies for argument mining and argumentation analysis are maturing rapidly, so that, as a result, the retrieval of arguments in search scenarios becomes a feasible objective. For the second time, we organize the Touché lab on argument retrieval with two shared tasks: (1) argument retrieval for controversial questions, where arguments are to be retrieved from a focused debate portal-based collection and, (2) argument retrieval for comparative questions, where argumentative documents are to be retrieved from a generic web crawl. In this paper, we briefly summarize the results of Touché 2020, the first edition of the lab, and describe the planned setup for the second edition at CLEF 2021.

Alexander Bondarenko, Lukas Gienapp, Maik Fröbe, Meriem Beloucif, Yamen Ajjour, Alexander Panchenko, Chris Biemann, Benno Stein, Henning Wachsmuth, Martin Potthast, Matthias Hagen
Text Simplification for Scientific Information Access
CLEF 2021 SimpleText Workshop

Modern information access systems hold the promise to give users direct access to key information from authoritative primary sources such as scientific literature, but non-experts tend to avoid these sources due to their complex language, internal vernacular, or lacking prior background knowledge. Text simplification approaches can remove some of these barriers, thereby avoiding that users rely on shallow information in sources prioritizing commercial or political incentives rather than the correctness and informational value. The CLEF 2021 SimpleText track will address the opportunities and challenges of text simplification approaches to improve scientific information access head-on. We aim to provide appropriate data and benchmarks, starting with pilot tasks in 2021, and create a community of NLP and IR researchers working together to resolve one of the greatest challenges of today.

Liana Ermakova, Patrice Bellot, Pavel Braslavski, Jaap Kamps, Josiane Mothe, Diana Nurbakova, Irina Ovchinnikova, Eric San-Juan
CLEF eHealth Evaluation Lab 2021

Motivated by the ever increasing difficulties faced by laypeople in retrieving and digesting valid and relevant information to make health-centred decisions, the CLEF eHealth lab series has offered shared tasks to the community in the fields of Information Extraction (IE), management, and Information Retrieval (IR) since 2013. These tasks have attracted large participation and led to statistically significant improvements in processing quality. In 2021, CLEF eHealth is calling for participants to contribute to the following two tasks: Task 1 on IE focuses on IE from noisy text. Participants will identify and classify Named Entities in written ultrasonography reports, containing misspellings and inconsistencies, from a major public hospital in Argentina. Identified entities will then have to be classified, which can be very challenging as it requires to handle lexical variations. Task 2 is a novel extension of the most popular and established task on consumer health search (CHS), aiming at retrieving relevant, understandable, and credible information for patients and their next-of-kins. In this paper we describe recent advances in the fields of IE and IR, and the subsequent offerings of this years CLEF eHealth lab challenges.

Lorraine Goeuriot, Hanna Suominen, Liadh Kelly, Laura Alonso Alemany, Nicola Brew-Sam, Viviana Cotik, Darío Filippo, Gabriela Gonzalez Saez, Franco Luque, Philippe Mulhem, Gabriella Pasi, Roland Roller, Sandaru Seneviratne, Jorge Vivaldi, Marco Viviani, Chenchen Xu
LifeCLEF 2021 Teaser: Biodiversity Identification and Prediction Challenges

Building accurate knowledge of the identity, the geographic distribution and the evolution of species is essential for the sustainable development of humanity, as well as for biodiversity conservation. However, the difficulty of identifying plants and animals in the field is hindering the aggregation of new data and knowledge. Identifying and naming living plants or animals is almost impossible for the general public and is often difficult even for professionals and naturalists. Bridging this gap is a key step towards enabling effective biodiversity monitoring systems. The LifeCLEF campaign, presented in this paper, has been promoting and evaluating advances in this domain since 2011. The 2021 edition proposes four data-oriented challenges related to the identification and prediction of biodiversity: (i) PlantCLEF: cross-domain plant identification based on herbarium sheets, (ii) BirdCLEF: bird species recognition in audio soundscapes, (iii) GeoLifeCLEF: location-based prediction of species based on environmental and occurrence data and (iv) SnakeCLEF: image-based snake identification.

Alexis Joly, Hervé Goëau, Elijah Cole, Stefan Kahl, Lukáš Picek, Hervé Glotin, Benjamin Deneu, Maximilien Servajean, Titouan Lorieul, Willem-Pier Vellinga, Pierre Bonnet, Andrew M. Durso, Rafael Ruiz de Castañeda, Ivan Eggel, Henning Müller
ChEMU 2021: Reaction Reference Resolution and Anaphora Resolution in Chemical Patents

Chemical patents serve as an indispensable source of information about new discoveries of chemical compounds. The ChEMU (Cheminformatics Elsevier Melbourne University) lab addresses information extraction over chemical patents, and aims to advance the state of the art on this topic. ChEMU lab 2021, as part of the 12th Conference and Labs of the Evaluation Forum (CLEF-2021), will be the second ChEMU lab. ChEMU 2021 will provide two distinct tasks related to reference resolution in chemical patents. Task 1—Chemical Reaction Reference Resolution—focuses on paragraph-level references and aims to identify the chemical reactions or general conditions specified in one reaction description referred to by another. Task 2—Anaphora Resolution—focuses on expression-level references and aims to identify the reference relationships between expressions in chemical reaction descriptions. In this paper, we introduce ChEMU 2021, including its motivation, goals, tasks, resources, and evaluation framework.

Jiayuan He, Biaoyan Fang, Hiyori Yoshikawa, Yuan Li, Saber A. Akhondi, Christian Druckenbrodt, Camilo Thorne, Zubair Afzal, Zenan Zhai, Lawrence Cavedon, Trevor Cohn, Timothy Baldwin, Karin Verspoor
The 2021 ImageCLEF Benchmark: Multimedia Retrieval in Medical, Nature, Internet and Social Media Applications

This paper presents the ideas for the 2021 ImageCLEF lab that will be organized as part of the Conference and Labs of the Evaluation Forum—CLEF Labs 2021 in Bucharest, Romania. ImageCLEF is an ongoing evaluation initiative (active since 2003) that promotes the evaluation of technologies for annotation, indexing and retrieval of visual data with the aim of providing information access to large collections of images in various usage scenarios and domains. In 2021, the 19th edition of ImageCLEF will organize four main tasks: (i) a Medical task addressing visual question answering, a concept annotation and a tuberculosis classification task, (ii) a Coral task addressing the annotation and localisation of substrates in coral reef images, (iii) a DrawnUI task addressing the creation of websites from either a drawing or a screenshot by detecting the different elements present on the design and a new (iv) Aware task addressing the prediction of real-life consequences of online photo sharing. The strong participation in 2020, despite the COVID pandemic, with over 115 research groups registering and 40 submitting over 295 runs for the tasks shows an important interest in this benchmarking campaign. We expect the new tasks to attract at least as many researchers for 2021.

Bogdan Ionescu, Henning Müller, Renaud Péteri, Asma Ben Abacha, Dina Demner-Fushman, Sadid A. Hasan, Mourad Sarrouti, Obioma Pelka, Christoph M. Friedrich, Alba G. Seco de Herrera, Janadhip Jacutprakart, Vassili Kovalev, Serge Kozlovski, Vitali Liauchuk, Yashin Dicente Cid, Jon Chamberlain, Adrian Clark, Antonio Campello, Hassan Moustahfid, Thomas Oliver, Abigail Schulz, Paul Brie, Raul Berari, Dimitri Fichou, Andrei Tauteanu, Mihai Dogariu, Liviu Daniel Stefan, Mihai Gabriel Constantin, Jérôme Deshayes, Adrian Popescu
BioASQ at CLEF2021: Large-Scale Biomedical Semantic Indexing and Question Answering

This paper describes the ninth edition of the BioASQ Challenge, which will run as an evaluation Lab in the context of CLEF2021. The aim of BioASQ is the promotion of systems and methods for highly precise biomedical information access. This is done through the organization of a series of challenges (shared tasks) on large-scale biomedical semantic indexing and question answering, where different teams develop systems that compete on the same demanding benchmark datasets that represent the real information needs of biomedical experts. In order to facilitate this information finding process, the BioASQ challenge introduced two complementary tasks: (a) the automated indexing of large volumes of unlabelled data, primarily scientific articles, with biomedical concepts, (b) the processing of biomedical questions and the generation of comprehensible answers. Rewarding the most competitive systems that outperform the state of the art, BioASQ manages to push the research frontier towards ensuring that the biomedical experts will have direct access to valuable knowledge.

Anastasia Krithara, Anastasios Nentidis, Georgios Paliouras, Martin Krallinger, Antonio Miranda
Advancing Math-Aware Search: The ARQMath-2 Lab at CLEF 2021

ARQMath-2 is a continuation of the ARQMath Lab at CLEF 2020, with two main tasks: (1) finding answers to mathematical questions among posted answers on a community question answering site (Math Stack Exchange), and (2) formula retrieval, where formulae in question posts serve as queries for formulae in earlier question and answer posts; the relevance of retrieved formulae considers the context of the posts in which query and retrieved formulae appear. The 2020 Lab created a large new test collection and established strong baselines for both tasks. Plans for ARQMath-2 includes extending the same test collection with additional topics, provision of standard components for optional use by teams new to the task, and post-hoc evaluation scripts to support tuning of new systems that did not contribute to the 2020 judgment pools.

Behrooz Mansouri, Anurag Agarwal, Douglas W. Oard, Richard Zanibbi
The CLEF-2021 CheckThat! Lab on Detecting Check-Worthy Claims, Previously Fact-Checked Claims, and Fake News

We describe the fourth edition of the CheckThat! Lab, part of the 2021 Cross-Language Evaluation Forum (CLEF). The lab evaluates technology supporting various tasks related to factuality, and it is offered in Arabic, Bulgarian, English, and Spanish. Task 1 asks to predict which tweets in a Twitter stream are worth fact-checking (focusing on COVID-19). Task 2 asks to determine whether a claim in a tweet can be verified using a set of previously fact-checked claims. Task 3 asks to predict the veracity of a target news article and its topical domain. The evaluation is carried out using mean average precision or precision at rank k for the ranking tasks, and F $$_1$$ 1 for the classification tasks.

Preslav Nakov, Giovanni Da San Martino, Tamer Elsayed, Alberto Barrón-Cedeño, Rubén Míguez, Shaden Shaar, Firoj Alam, Fatima Haouari, Maram Hasanain, Nikolay Babulkov, Alex Nikolov, Gautam Kishore Shahi, Julia Maria Struß, Thomas Mandl
eRisk 2021: Pathological Gambling, Self-harm and Depression Challenges

eRisk, a CLEF lab oriented to early risk prediction on the Internet, started in 2017 as a forum to foster experimentation on early risk detection. After four editions (2017, 2018, 2019 and 2020), the lab has created many reference collections in the field and organized multiple early risk detection challenges using those datasets. Each challenge focused on a specific early risk detection problem (e.g., depression, anorexia or self-harm). This paper describes the work done so far, discusses the main lessons learned over the past editions and the plans for the eRisk 2021 edition, where we introduced pathological gambling as a new early risk detection challenge.

Javier Parapar, Patricia Martín-Rodilla, David E. Losada, Fabio Crestani
Living Lab Evaluation for Life and Social Sciences Search Platforms - LiLAS at CLEF 2021

Meta-evaluation studies of system performances in controlled offline evaluation campaigns, like TREC and CLEF, show a need for innovation in evaluating IR-systems. The field of academic search is no exception to this. This might be related to the fact that relevance in academic search is multi-layered and therefore the aspect of user-centric evaluation is becoming more and more important. The Living Labs for Academic Search (LiLAS) lab aims to strengthen the concept of user-centric living labs for the domain of academic search by allowing participants to evaluate their retrieval approaches in two real-world academic search systems from the life sciences and the social sciences. To this end, we provide participants with metadata on the systems’ content as well as candidate lists with the task to rank the most relevant candidate to the top. Using the STELLA-infrastructure, we allow participants to easily integrate their approaches into the real-world systems and provide the possibility to compare different approaches at the same time.

Philipp Schaer, Johann Schaible, Leyla Jael Castro

Doctoral Consortium Papers

Automated Multi-document Text Summarization from Heterogeneous Data Sources

We are currently witnessing an exponential increase of data which emanate from varied sources such as different types of records in companies, online social networks, videos, unstructured text in web pages, and others. It is very challenging to process this sparse, noisy and domain specific data. For instance, BT, a technology company in the UK and the PhD project sponsor, has a significant workforce of field engineers, desk-based agents, and customer support services who generate, collect and manage large volumes of temporally organized unstructured and semi-structured information every day. The problem that they face is effectively and efficiently answering client questions regarding order status live. The main challenge in my PhD is to propose computational models that could effectively and efficiently distil relevant information for the user who could answer the client’s questions from various technical order record documents which are very noisy and follow no structural pattern. To this end, what would be useful is to automatically summarize and derive meaningful information in the form of short answers to queries from this vast source of distributed occurring data. The solution that my thesis proposes is based on a computational model that jointly learns extractive and abstractive summarization techniques with a temporal structure. Besides, the model interplays with a question-answering framework to help answer questions.

Mahsa Abazari Kia
Background Linking of News Articles

Nowadays, it is very rare to find a single news article that solely contains all the information about a certain subject or event. Very recently, a number of methods were proposed to find background articles that can be linked to a query article to help readers understand its context, whenever they are reading it. These methods, however, are still far from reaching an optimal performance. In my thesis, I propose techniques that aim to improve the background linking process for online news articles. For example, I propose to exploit different techniques to construct representative search queries from the query article, that be can effectively employed to retrieve the required background links in an ad-hoc setting. Moreover, I aim to study how to train neural models that can learn the background relevance between pairs of articles. Through the proposed techniques, I aim to experiment with the possible criteria that may distinguish useful background articles from non-relevant ones, such as their semantic and lexical similarities, and the granularity of the topics discussed in each. Defining these criteria will enable understanding the notion of background relevance, and accordingly allow for effective background links retrieval.

Marwa Essam
Multidimensional Relevance in Task-Specific Retrieval

Several criteria of relevance have been proposed in the literature. However, relevance criteria are strongly related to the search task. Thus, it is important to employ the criteria that are useful for the considered search task. This research explores the concept of multidimensional relevance in a specific search-task. Firstly, we want to investigate search tasks and the related relevance dimension. Then, we intend to explore the approaches that can be used to combine more than one relevance dimension. The goal of this study is to improve the retrieval system in a specific task.

Divi Galih Prasetyo Putri
Deep Semantic Entity Linking

Named entity linking systems are an essential component in text mining pipelines, mapping entity mentions in the text to the appropriate knowledge base identifiers. However, the current systems have several limitations affecting their performance: the lack of context of the entity mentions, the incomplete disambiguation graphs and the lack of approaches to deal with unlinkable entity mentions. The PhD project will focus on solving the aforementioned challenges in order to develop a NEL model which outperforms state-of-the-art performance in Biomedical and Life Sciences domains.

Pedro Ruas
Deep Learning System for Biomedical Relation Extraction Combining External Sources of Knowledge

Successful biomedical relation extraction can provide evidence to researchers about possible unknown associations between entities, advancing our current knowledge about those entities and their inherent processes. Multiple relation extraction approaches have been proposed to identify relations between concepts in literature, namely using neural networks algorithms. However, the incorporation of semantics is still scarce. This project proposes that using external semantic sources of knowledge along with the latest state-of-the-art language representations can improve the current performance of biomedical relation extraction both in English and non-English languages. The goal is to build a relation extraction system using state-of-the-art language representations, such as BERT and ELMo, with semantics retrieved from external sources of knowledge, such as domain-specific ontologies, graph attention mechanisms, and semantic similarity measures.

Diana Sousa


Second International Workshop on Algorithmic Bias in Search and Recommendation (BIAS@ECIR2021)

Providing efficient and effective search and recommendation algorithms has been traditionally the main objective for the industrial and academic research communities. However, recent studies have shown that optimizing models through these algorithms may reinforce the existing societal biases, especially under certain circumstances (e.g., when historical users’ behavioral data is used for training). Identifying and mitigating data and algorithmic biases thus becomes a crucial aspect, ensuring that these models have a positive impact on the stakeholders involved in the search and recommendation processes. The BIAS 2021 workshop aims to collect novel contributions in this emerging field, providing a common ground for researchers and practitioners.

Ludovico Boratto, Stefano Faralli, Mirko Marras, Giovanni Stilo
The 4th International Workshop on Narrative Extraction from Texts: Text2Story 2021

Narrative extraction, understanding and visualization is currently a popular topic and an important tool for humans interested in achieving a deeper understanding of text. Information Retrieval (IR), Natural Language Processing (NLP) and Machine Learning (ML) already offer many instruments that aid the exploration of narrative elements in text and within unstructured data. Despite evident advances in the last couple of years the problem of automatically representing narratives in a structured form, beyond the conventional identification of common events, entities and their relationships, is yet to be solved. This workshop held virtually onApril 1st, 2021 co-located with the 43rd European Conference on Information Retrieval (ECIR’21) aims at presenting and discussing current and future directions for IR, NLP, ML and other computational fields capable of improving the automatic understanding of narratives. It includes a session devoted to regular, short and demo papers, keynote talks and space for an informal discussion of the methods, of the challenges and of the future of the area.

Ricardo Campos, Alípio Jorge, Adam Jatowt, Sumit Bhatia, Mark Finlayson
Bibliometric-Enhanced Information Retrieval: 11th International BIR Workshop

The Bibliometric-enhanced Information Retrieval (BIR) workshop series at ECIR tackles issues related to academic search, at the intersection of Information Retrieval, Natural Language Processing and Bibliometrics. BIR is a hot topic investigated by both academia and the industry. In this overview paper, we summarize the 11th iteration of the workshop and present the workshop topics for 2021.

Ingo Frommholz, Philipp Mayr, Guillaume Cabanac, Suzan Verberne
MICROS: Mixed-Initiative ConveRsatiOnal Systems Workshop

The 1st edition of the workshop on Mixed-Initiative ConveRsatiOnal Systems (MICROS@ECIR2021) aims at investigating and collecting novel ideas and contributions in the field of conversational systems. Oftentimes, the users fulfill their information need using smartphones and home assistants. This has revolutionized the way users access online information, thus posing new challenges compared to traditional search and recommendation. The first edition of MICROS will have a particular focus on mixed-initiative conversational systems. Indeed, conversational systems need to be proactive, proposing not only answers but also possible interpretations for ambiguous or vague requests.

Ida Mele, Cristina Ioana Muntean, Mohammad Aliannejadi, Nikos Voskarides
ROMCIR 2021: Reducing Online Misinformation through Credible Information Retrieval

The Reducing Online Misinformation through Credible Information Retrieval (ROMCIR) 2021 Workshop, as part of the satellite events of the 43rd European Conference on Information Retrieval (ECIR), is concerned with providing users with access to genuine information, to mitigate the information disorder phenomenon characterizing the current online digital ecosystem. This problem is very broad, as it concerns different information objects (e.g., Web pages, online accounts, social media posts, etc.) on different platforms, and different domains and purposes (e.g., detecting fake news, retrieving credible health-related information, reducing propaganda and hate-speech, etc.). In this context, all those approaches that can serve, from different perspectives, to tackle the credible information access problem, find their place.

Fabio Saracco, Marco Viviani
Advances in Information Retrieval
Djoerd Hiemstra
Prof. Marie-Francine Moens
Josiane Mothe
Raffaele Perego
Martin Potthast
Fabrizio Sebastiani
Copyright Year
Electronic ISBN
Print ISBN