Skip to main content

2022 | Book

Advances in Information Retrieval

44th European Conference on IR Research, ECIR 2022, Stavanger, Norway, April 10–14, 2022, Proceedings, Part II

Editors: Matthias Hagen, Suzan Verberne, Craig Macdonald, Christin Seifert, Krisztian Balog, Kjetil Nørvåg, Vinay Setty

Publisher: Springer International Publishing

Book Series: Lecture Notes in Computer Science


About this book

This two-volume set LNCS 13185 and 13186 constitutes the refereed proceedings of the 44th European Conference on IR Research, ECIR 2022, held in April 2022, due to the COVID-19 pandemic.

The 35 full papers presented together with 11 reproducibility papers, 13 CLEF lab descriptions papers, 12 doctoral consortium papers, 5 workshop abstracts, and 4 tutorials abstracts were carefully reviewed and selected from 395 submissions.

Chapter “Leveraging Customer Reviews for E-commerce Query Generation” of this book is available open access under a CC BY 4.0 license.

Table of Contents


Short Papers

Improving BERT-based Query-by-Document Retrieval with Multi-task Optimization

Query-by-document (QBD) retrieval is an Information Retrieval task in which a seed document acts as the query and the goal is to retrieve related documents – it is particular common in professional search tasks. In this work we improve the retrieval effectiveness of the BERT re-ranker, proposing an extension to its fine-tuning step to better exploit the context of queries. To this end, we use an additional document-level representation learning objective besides the ranking objective when fine-tuning the BERT re-ranker. Our experiments on two QBD retrieval benchmarks show that the proposed multi-task optimization significantly improves the ranking effectiveness without changing the BERT re-ranker or using additional training samples. In future work, the generalizability of our approach to other retrieval tasks should be further investigated.

Amin Abolghasemi, Suzan Verberne, Leif Azzopardi
Passage Retrieval on Structured Documents Using Graph Attention Networks

Passage Retrieval systems aim at retrieving and ranking small text units according to their estimated relevance to a query. A usual practice is to consider the context a passage appears in (its containing document, neighbour passages, etc.) to improve its relevance estimation. In this work, we study the use of Graph Attention Networks (GATs), a graph node embedding method, to perform passage contextualization. More precisely, we first propose a document graph representation based on several inter- and intra-document relations. Then, we investigate two ways of leveraging the use of GATs on this representation in order to incorporate contextual information for passage retrieval. We evaluate our approach on a Passage Retrieval task for structured documents: CLEF-IP2013. Our results show that our document graph representation coupled with the expressive power of GATs allows for a better context representation leading to improved performances.

Lucas Albarede, Philippe Mulhem, Lorraine Goeuriot, Claude Le Pape-Gardeux, Sylvain Marie, Trinidad Chardin-Segui
Expert Finding in Legal Community Question Answering

Expert finding has been well-studied in community question answering (QA) systems in various domains. However, none of these studies addresses expert finding in the legal domain, where the goal is for citizens to find lawyers based on their expertise. In the legal domain, there is a large knowledge gap between the experts and the searchers, and the content on the legal QA websites consist of a combination formal and informal communication. In this paper, we propose methods for generating query-dependent textual profiles for lawyers covering several aspects including sentiment, comments, and recency. We combine query-dependent profiles with existing expert finding methods. Our experiments are conducted on a novel dataset gathered from an online legal QA service. We discovered that taking into account different lawyer profile aspects improves the best baseline model. We make our dataset publicly available for future work.

Arian Askari, Suzan Verberne, Gabriella Pasi
Towards Building Economic Models of Conversational Search

Various conceptual and descriptive models of conversational search have been proposed in the literature – while useful, they do not provide insights into how interaction between the agent and user would change in response to the costs and benefits of the different interactions. In this paper, we develop two economic models of conversational search based on patterns previously observed during conversational search sessions, which we refer to as: Feedback First where the agent asks clarifying questions then presents results, and Feedback After where the agent presents results, and then asks follow up questions. Our models show that the amount of feedback given/requested depends on its efficiency at improving the initial or subsequent query and the relative cost of providing said feedback. This theoretical framework for conversational search provides a number of insights that can be used to guide and inform the development of conversational search agents. However, empirical work is needed to estimate the parameters in order to make predictions specific to a given conversational search setting.

Leif Azzopardi, Mohammad Aliannejadi, Evangelos Kanoulas
Evaluating the Use of Synthetic Queries for Pre-training a Semantic Query Tagger

Semantic Query Labeling is the task of locating the constituent parts of a query and assigning domain-specific semantic labels to each of them. It allows unfolding the relations between the query terms and the documents’ structure while leaving unaltered the keyword-based query formulation. In this paper, we investigate the pre-training of a semantic query-tagger with synthetic data generated by leveraging the documents’ structure. By simulating a dynamic environment, we also evaluate the consistency of performance improvements brought by pre-training as real-world training data becomes available. The results of our experiments suggest both the utility of pre-training with synthetic data and its improvements’ consistency over time.

Elias Bassani, Gabriella Pasi
A Light-Weight Strategy for Restraining Gender Biases in Neural Rankers

In light of recent studies that show neural retrieval methods may intensify gender biases during retrieval, the objective of this paper is to propose a simple yet effective sampling strategy for training neural rankers that would allow the rankers to maintain their retrieval effectiveness while reducing gender biases. Our work proposes to consider the degrees of gender bias when sampling documents to be used for training neural rankers. We report our findings on the MS MARCO collection and based on different query datasets released for this purpose in the literature. Our results show that the proposed light-weight strategy can show competitive (or even better) performance compared to the state-of-the-art neural architectures specifically designed to reduce gender biases.

Amin Bigdeli, Negar Arabzadeh, Shirin Seyedsalehi, Morteza Zihayat, Ebrahim Bagheri
Recommender Systems: When Memory Matters

In this paper, we study the effect of non-stationarities and memory in the learnability of a sequential recommender system that exploits user’s implicit feedback. We propose an algorithm, where model parameters are updated user per user by minimizing a ranking loss over blocks of items constituted by a sequence of unclicked items followed by a clicked one. We illustrate through empirical evaluations on four large-scale benchmarks that removing non-stationarities, through an empirical estimation of the memory properties, in user’s behaviour interactions allows to gain in performance with respect to MAP and NDCG.

Aleksandra Burashnikova, Marianne Clausel, Massih-Reza Amini, Yury Maximov, Nicolas Dante
Groupwise Query Performance Prediction with BERT

While large-scale pre-trained language models like BERT have advanced the state-of-the-art in IR, its application in query performance prediction (QPP) is so far based on pointwise modeling of individual queries. Meanwhile, recent studies suggest that the cross-attention modeling of a group of documents can effectively boost performances for both learning-to-rank algorithms and BERT-based re-ranking. To this end, a BERT-based groupwise QPP model is proposed, in which the ranking contexts of a list of queries are jointly modeled to predict the relative performance of individual queries. Extensive experiments on three standard TREC collections showcase effectiveness of our approach. Our code is available at .

Xiaoyang Chen, Ben He, Le Sun
How Can Graph Neural Networks Help Document Retrieval: A Case Study on CORD19 with Concept Map Generation

Graph neural networks (GNNs), as a group of powerful tools for representation learning on irregular data, have manifested superiority in various downstream tasks. With unstructured texts represented as concept maps, GNNs can be exploited for tasks like document retrieval. Intrigued by how can GNNs help document retrieval, we conduct an empirical study on a large-scale multi-discipline dataset CORD-19. Results show that instead of the complex structure-oriented GNNs such as GINs and GATs, our proposed semantics-oriented graph functions achieve better and more stable performance based on the BM25 retrieved candidates. Our insights in this case study can serve as a guideline for future work to develop effective GNNs with appropriate semantics-oriented inductive biases for textual reasoning tasks like document retrieval and classification. All code for this case study is available at .

Hejie Cui, Jiaying Lu, Yao Ge, Carl Yang
Leveraging Content-Style Item Representation for Visual Recommendation

When customers’ choices may depend on the visual appearance of products (e.g., fashion), visually-aware recommender systems (VRSs) have been shown to provide more accurate preference predictions than pure collaborative models. To refine recommendations, recent VRSs have tried to recognize the influence of each item’s visual characteristic on users’ preferences, for example, through attention mechanisms. Such visual characteristics may come in the form of content-level item metadata (e.g., image tags) and reviews, which are not always and easily accessible, or image regions-of-interest (e.g., the collar of a shirt), which miss items’ style. To address these limitations, we propose a pipeline for visual recommendation, built upon the adoption of those features that can be easily extracted from item images and represent the item content on a stylistic level (i.e., color, shape, and category of a fashion product). Then, we inject such features into a VRS that exploits attention mechanisms to uncover users’ personalized importance for each content-style item feature and a neural architecture to model non-linear patterns within user-item interactions. We show that our solution can reach a competitive accuracy and beyond-accuracy trade-off compared with other baselines on two fashion datasets. Code and datasets are available at: .

Yashar Deldjoo, Tommaso Di Noia, Daniele Malitesta, Felice Antonio Merra
Does Structure Matter? Leveraging Data-to-Text Generation for Answering Complex Information Needs

In this work, our aim is to provide a structured answer in natural language to a complex information need. Particularly, we envision using generative models from the perspective of data-to-text generation. We propose the use of a content selection and planning pipeline which aims at structuring the answer by generating intermediate plans. The experimental evaluation is performed using the TREC Complex Answer Retrieval (CAR) dataset. We evaluate both the generated answer and its corresponding structure and show the effectiveness of planning-based models in comparison to a text-to-text model.

Hanane Djeddal, Thomas Gerald, Laure Soulier, Karen Pinel-Sauvagnat, Lynda Tamine
Temporal Event Reasoning Using Multi-source Auxiliary Learning Objectives

Temporal event reasoning is vital in modern information-driven applications operating on news articles, social media, financial reports, etc. Recent works train deep neural nets to infer temporal events and relations from text. We improve upon the state-of-the-art by proposing an approach that injects additional temporal knowledge into the pre-trained model from two sources: (i) part-of-speech tagging and (ii) question constraints. Auxiliary learning objectives allow us to incorporate this temporal information into the training process. Our experiments show that these types of multi-source auxiliary learning objectives lead to better temporal reasoning. Our model improves over the state-of-the-art model on the TORQUE question answering benchmark by 1.1% and on the MATRES relation extraction benchmark by 2.8% in F1 score.

Xin Dong, Tanay Kumar Saha, Ke Zhang, Joel Tetreault, Alejandro Jaimes, Gerard de Melo
Enhanced Sentence Meta-Embeddings for Textual Understanding

Sentence embeddings provide vector representations for sentences and short texts, enabling the capture of contextual and semantic meaning for different applications. However, the diversity of sentence embedding techniques poses a challenge, in terms of choosing the model best suited for the downstream task. As such, meta-embeddings study different techniques for combining embeddings from multiple sources. In this paper, we propose CINCE, a principled meta-embedding framework for aggregating various semantic information, captured by different embeddings techniques, via multiple component analysis strategies. Experiments on SentEval benchmark exhibit improved performance for semantic understanding and text classification, compared to existing approaches.

Sourav Dutta, Haytham Assem
Match Your Words! A Study of Lexical Matching in Neural Information Retrieval

Neural Information Retrieval models hold the promise to replace lexical matching models, e.g. BM25, in modern search engines. While their capabilities have fully shone on in-domain datasets like MS MARCO, they have recently been challenged on out-of-domain zero-shot settings (BEIR benchmark), questioning their actual generalization capabilities compared to bag-of-words approaches. Particularly, we wonder if these shortcomings could (partly) be the consequence of the inability of neural IR models to perform lexical matching off-the-shelf. In this work, we propose a measure of discrepancy between the lexical matching performed by any (neural) model and an “ideal” one. Based on this, we study the behavior of different state-of-the-art neural IR models, focusing on whether they are able to perform lexical matching when it’s actually useful, i.e. for important terms. Overall, we show that neural IR models fail to properly generalize term importance on out-of-domain collections or terms almost unseen during training.

Thibault Formal, Benjamin Piwowarski, Stéphane Clinchant
CARES: CAuse Recognition for Emotion in Suicide Notes

Inspired by recent advances in emotion-cause extraction in texts and its potential in research on computational studies in suicide motives and tendencies and mental health, we address the problem of cause identification and cause extraction for emotion in suicide notes. We introduce an emotion-cause annotated suicide corpus of 5769 sentences by labeling the benchmark CEASE-v2.0 dataset (4932 sentences) with causal spans for existing annotated emotions. Furthermore, we expand the utility of the existing dataset by adding emotion and emotion cause annotations for an additional 837 sentences collected from 67 non-English suicide notes (Hindi, Bangla, Telugu). Our proposed approaches to emotion-cause identification and extraction are based on pre-trained transformer-based models that attain performance figures of 83.20% accuracy and 0.76 Ratcliff-Obershelp similarity, respectively. The findings suggest that existing computational methods can be adapted to address these challenging tasks, opening up new research areas.

Soumitra Ghosh, Swarup Roy, Asif Ekbal, Pushpak Bhattacharyya
Identifying Suitable Tasks for Inductive Transfer Through the Analysis of Feature Attributions

Transfer learning approaches have shown to significantly improve performance on downstream tasks. However, it is common for prior works to only report where transfer learning was beneficial, ignoring the significant trial-and-error required to find effective settings for transfer. Indeed, not all task combinations lead to performance benefits, and brute-force searching rapidly becomes computationally infeasible. Hence the question arises, can we predict whether transfer between two tasks will be beneficial without actually performing the experiment? In this paper, we leverage explainability techniques to effectively predict whether task pairs will be complementary, through comparison of neural network activation between single-task models. In this way, we can avoid grid-searches over all task and hyperparameter combinations, dramatically reducing the time needed to find effective task pairs. Our results show that, through this approach, it is possible to reduce training time by up to 83.5% at a cost of only 0.034 reduction in positive-class F1 on the TREC-IS 2020-A dataset.

Alexander J. Hepburn, Richard McCreadie
Establishing Strong Baselines For TripClick Health Retrieval

We present strong Transformer-based re-ranking and dense retrieval baselines for the recently released TripClick health ad-hoc retrieval collection. We improve the – originally too noisy – training data with a simple negative sampling policy. We achieve large gains over BM25 in the re-ranking task of TripClick, which were not achieved with the original baselines. Furthermore, we study the impact of different domain-specific pre-trained models on TripClick. Finally, we show that dense retrieval outperforms BM25 by considerable margins, even with simple training procedures.

Sebastian Hofstätter, Sophia Althammer, Mete Sertkan, Allan Hanbury
Less is Less: When are Snippets Insufficient for Human vs Machine Relevance Estimation?

Traditional information retrieval (IR) ranking models process the full text of documents. Newer models based on Transformers, however, would incur a high computational cost when processing long texts, so typically use only snippets from the document instead. The model’s input based on a document’s URL, title, and snippet (UTS) is akin to the summaries that appear on a search engine results page (SERP) to help searchers decide which result to click. This raises questions about when such summaries are sufficient for relevance estimation by the ranking model or the human assessor, and whether humans and machines benefit from the document’s full text in similar ways. To answer these questions, we study human and neural model based relevance assessments on 12k query-documents sampled from Bing’s search logs. We compare changes in the relevance assessments when only the document summaries and when the full text is also exposed to assessors, studying a range of query and document properties, e.g., query type, snippet length. Our findings show that the full text is beneficial for humans and a BERT model for similar query and document types, e.g., tail, long queries. A closer look, however, reveals that humans and machines respond to the additional input in very different ways. Adding the full text can also hurt the ranker’s performance, e.g., for navigational queries.

Gabriella Kazai, Bhaskar Mitra, Anlei Dong, Nick Craswell, Linjun Yang
Leveraging Transformer Self Attention Encoder for Crisis Event Detection in Short Texts

Analyzing content generated on social media has proven to be a powerful tool for early detection of crisis-related events. Such an analysis may allow for timely action, mitigating or even preventing altogether the effects of a crisis. However, the high noise levels in short texts present in microblogging platforms, combined with the limited publicly available datasets have rendered the task difficult. Here, we propose deep learning models based on a transformer self-attention encoder, which is capable of detecting event-related parts in a text, while also minimizing potential noise levels. Our models’ efficacy is shown by experimenting with CrisisLexT26, achieving up to $$81.6\%$$ 81.6 % f1-score and $$92.7\%$$ 92.7 % AUC.

Pantelis Kyriakidis, Despoina Chatzakou, Theodora Tsikrika, Stefanos Vrochidis, Ioannis Kompatsiaris
What Drives Readership? An Online Study on User Interface Types and Popularity Bias Mitigation in News Article Recommendations

Personalized news recommender systems support readers in finding the right and relevant articles in online news platforms. In this paper, we discuss the introduction of personalized, content-based news recommendations on DiePresse, a popular Austrian online news platform, focusing on two specific aspects: (i) user interface type, and (ii) popularity bias mitigation. Therefore, we conducted a two-weeks online study that started in October 2020, in which we analyzed the impact of recommendations on two user groups, i.e., anonymous and subscribed users, and three user interface types, i.e., on a desktop, mobile and tablet device. With respect to user interface types, we find that the probability of a recommendation to be seen is the highest for desktop devices, while the probability of interacting with recommendations is the highest for mobile devices. With respect to popularity bias mitigation, we find that personalized, content-based news recommendations can lead to a more balanced distribution of news articles’ readership popularity in the case of anonymous users. Apart from that, we find that significant events (e.g., the COVID-19 lockdown announcement in Austria and the Vienna terror attack) influence the general consumption behavior of popular articles for both, anonymous and subscribed users.

Emanuel Lacic, Leon Fadljevic, Franz Weissenboeck, Stefanie Lindstaedt, Dominik Kowald
GameOfThronesQA: Answer-Aware Question-Answer Pairs for TV Series

In this paper, we offer a corpus of question answer pairs related to the TV series generated from paragraph contexts. The data set called GameofThronesQA V1.0 contains 5237 unique question answer pairs from the Game Of Thrones TV series across the eight seasons. In particular, we provide a pipeline approach for answer aware question generation, where the answers are extracted based on the named entities from the TV series. This is different to the traditional methods which generate questions first and find the relevant answers later. Furthermore, we provide a comparative analysis of the generated corpus with the benchmark datasets such as SQuAD, TriviaQA, WikiQA and TweetQA. The snapshot of the dataset is provided as an appendix for review purpose and will be released to public later.

Aritra Kumar Lahiri, Qinmin Vivian Hu

Open Access

Leveraging Customer Reviews for E-commerce Query Generation

Customer reviews are an effective source of information about what people deem important in products (e.g. “strong zipper” for tents). These crowd-created descriptors not only highlight key product attributes, but can also complement seller-provided product descriptions. Motivated by this, we propose to leverage customer reviews to generate queries pertinent to target products in an e-commerce setting. While there has been work on automatic query generation, it often relied on proprietary user search data to generate query-document training pairs for learning supervised models. We take a different view and focus on leveraging reviews without training on search logs, making reproduction more viable by the public. Our method adopts an ensemble of the statistical properties of review terms and a zero-shot neural model trained on adapted external corpus to synthesize queries. Compared to competitive baselines, we show that the generated queries based on our method both better align with actual customer queries and can benefit retrieval effectiveness.

Yen-Chieh Lien, Rongting Zhang, F. Maxwell Harper, Vanessa Murdock, Chia-Jung Lee
Question Rewriting? Assessing Its Importance for Conversational Question Answering

In conversational question answering, systems must correctly interpret the interconnected interactions and generate knowledgeable answers, which may require the retrieval of relevant information from a background repository. Recent approaches to this problem leverage neural language models, although different alternatives can be considered in terms of modules for (a) representing user questions in context, (b) retrieving the relevant background information, and (c) generating the answer. This work presents a conversational question answering system designed specifically for the Search-Oriented Conversational AI (SCAI) shared task, and reports on a detailed analysis of its question rewriting module. In particular, we considered different variations of the question rewriting module to evaluate the influence on the subsequent components, and performed a careful analysis of the results obtained with the best system configuration. Our system achieved the best performance in the shared task and our analysis emphasizes the importance of the conversation context representation for the overall system performance.

Gonçalo Raposo, Rui Ribeiro, Bruno Martins, Luísa Coheur
How Different are Pre-trained Transformers for Text Ranking?

In recent years, large pre-trained transformers have led to substantial gains in performance over traditional retrieval models and feedback approaches. However, these results are primarily based on the MS Marco/TREC Deep Learning Track setup, with its very particular setup, and our understanding of why and how these models work better is fragmented at best. We analyze effective BERT-based cross-encoders versus traditional BM25 ranking for the passage retrieval task where the largest gains have been observed, and investigate two main questions. On the one hand, what is similar? To what extent does the neural ranker already encompass the capacity of traditional rankers? Is the gain in performance due to a better ranking of the same documents (prioritizing precision)? On the other hand, what is different? Can it retrieve effectively documents missed by traditional systems (prioritizing recall)? We discover substantial differences in the notion of relevance identifying strengths and weaknesses of BERT that may inspire research for future improvement. Our results contribute to our understanding of (black-box) neural rankers relative to (well-understood) traditional rankers, help understand the particular experimental setting of MS-Marco-based test collections.

David Rau, Jaap Kamps
Comparing Intrinsic and Extrinsic Evaluation of Sensitivity Classification

With accelerating generation of digital content, it is often impractical at the point of creation to manually segregate sensitive information from information which can be shared. As a result, a great deal of useful content becomes inaccessible simply because it is intermixed with sensitive content. This paper compares traditional and neural techniques for detection of sensitive content, finding that using the two techniques together can yield improved results. Experiments with two test collections, one in which sensitivity is modeled as a topic and a second in which sensitivity is annotated directly, yield consistent improvements with an intrinsic (classification effectiveness) measure. Extrinsic evaluation is conducted by using a recently proposed learning to rank framework for sensitivity-aware ranked retrieval and a measure that rewards finding relevant documents but penalizes revealing sensitive documents.

Mahmoud F. Sayed, Nishanth Mallekav, Douglas W. Oard
Zero-Shot Recommendation as Language Modeling

Recommendation is the task of ranking items (e.g. movies or products) according to individual user needs. Current systems rely on collaborative filtering and content-based techniques, which both require structured training data. We propose a framework for recommendation with off-the-shelf pretrained language models (LM) that only used unstructured text corpora as training data. If a user u liked Matrix and Inception, we construct a textual prompt, e.g. "Movies like Matrix, Inception, $${<}m{>}$$ < m > ” to estimate the affinity between u and m with LM likelihood. We motivate our idea with a corpus analysis, evaluate several prompt structures, and we compare LM-based recommendation with standard matrix factorization trained on different data regimes. The code for our experiments is publicly available ( ).

Damien Sileo, Wout Vossen, Robbe Raymaekers
What Matters for Shoppers: Investigating Key Attributes for Online Product Comparison

Before making high-consideration purchase decisions, shoppers generally need to identify and evaluate products’ key differentiating features or attributes. Many customers, however, lack the knowledge required to do so for all product domains. In this work, we investigate and analyze alternatives for identifying important product attributes, which customers can then use to compare candidate products. We propose an unsupervised attribute-ranking approach ReBARC, that combines both objective data from structured product catalogs, and subjective information from unstructured customer reviews, to suggest to the shopper the most important attributes to consider. Our detailed analysis of product attribute importance across various domains on a shopping website shows that ReBARC significantly outperforms prior efforts judged by both automated and human evaluation metrics. We also analyze the correlation and overlap between key product attributes detected by ReBARC, and those visible to customers during online product search.

Nikhita Vedula, Marcus Collins, Eugene Agichtein, Oleg Rokhlenko
Evaluating Simulated User Interaction and Search Behaviour

Simulating user sessions in a way that comes closer to the original user interactions is key to generating user data at any desired volume and variety such that A/B-testing in domain-specific search engines becomes scalable. In recent years, research on evaluating Information Retrieval (IR) systems has mainly focused on simulation as means to improve users models and evaluation metrics about the performance of search engines using test collections and user studies. However, test collections contain no user interaction data and user studies are expensive to conduct. Thus there is a need in developing a methodology for evaluating simulated user sessions. In this paper, we propose evaluation metrics to assess the realism of simulated sessions and describe a pilot study to assess the capability of generating simulated search sequences representing an approximation of real behaviour. Our findings highlight the importance of investigating and utilising classification-based metrics besides the distribution-based ones in the evaluation process.

Saber Zerhoudi, Michael Granitzer, Christin Seifert, Joerg Schloetterer
Multilingual Topic Labelling of News Topics Using Ontological Mapping

The large volume of news produced daily makes topic modelling useful for analysing topical trends. A topic is usually represented by a ranked list of words but this can be difficult and time-consuming for humans to interpret. Therefore, various methods have been proposed to generate labels that capture the semantic content of a topic. However, there has been no work so far on coming up with multilingual labels which can be useful for exploring multilingual news collections. We propose an ontological mapping method that maps topics to concepts in a language-agnostic news ontology. We test our method on Finnish and English topics and show that it performs on par with state-of-the-art label generation methods, is able to produce multilingual labels, and can be applied to topics from languages that have not been seen during training without any modifications.

Elaine Zosa, Lidia Pivovarova, Michele Boggia, Sardana Ivanova

Demonstration Papers

ranx: A Blazing-Fast Python Library for Ranking Evaluation and Comparison

This paper presents ranx, a Python evaluation library for Information Retrieval built on top of Numba. ranx provides a user-friendly interface to the most common ranking evaluation metrics, such as MAP, MRR, and NDCG. Moreover, it offers a convenient way of managing the evaluation results, comparing different runs, performing statistical tests between them, and exporting LaTeX tables ready to be used in scientific publications, all in a few lines of code. The efficiency brought by Numba, a just-in-time compiler for Python code, makes the adoption ranx convenient even for industrial applications.

Elias Bassani
DuoSearch: A Novel Search Engine for Bulgarian Historical Documents

Search in collections of digitised historical documents is hindered by a two-prong problem, orthographic variety and optical character recognition (OCR) mistakes. We present a new search engine for historical documents, DuoSearch, which uses ElasticSearch and machine learning methods based on deep neural networks to offer a solution to this problem. It was tested on a collection of historical newspapers in Bulgarian from the mid-19th to the mid-20th century. The system provides an interactive and intuitive interface for the end-users allowing them to enter search terms in modern Bulgarian and search across historical spellings. This is the first solution facilitating the use of digitised historical documents in Bulgarian.

Angel Beshirov, Suzan Hadzhieva, Ivan Koychev, Milena Dobreva
Tweet2Story: A Web App to Extract Narratives from Twitter

Social media platforms are used to discuss current events with very complex narratives that become difficult to understand. In this work, we introduce Tweet2Story, a web app to automatically extract narratives from small texts such as tweets and describe them through annotations. By doing this, we aim to mitigate the difficulties existing on creating narratives and give a step towards deeply understanding the actors and their corresponding relations found in a text. We build the web app to be modular and easy-to-use, which allows it to easily incorporate new techniques as they keep getting developed.

Vasco Campos, Ricardo Campos, Pedro Mota, Alípio Jorge
Patapasco: A Python Framework for Cross-Language Information Retrieval Experiments

While there are high-quality software frameworks for information retrieval experimentation, they do not explicitly support cross-language information retrieval (CLIR). To fill this gap, we have created Patapsco, a Python CLIR framework. This framework specifically addresses the complexity that comes with running experiments in multiple languages. Patapsco is designed to be extensible to many language pairs, to be scalable to large document collections, and to support reproducible experiments driven by a configuration file. We include Patapsco results on standard CLIR collections using multiple settings.

Cash Costello, Eugene Yang, Dawn Lawrie, James Mayfield
City of Disguise: A Query Obfuscation Game on the ClueWeb

We present City of Disguise, a retrieval game that tests how well searchers are able to reformulate some sensitive query in a ‘Taboo’-style setup but still retrieve good results. Given one of 200 sensitive information needs and a relevant example document, the players use a special ClueWeb12 search interface that also hints at potentially useful search terms. For an obfuscated query, the system assigns points depending on the result quality and the formulated query. In a pilot study with 72 players, we observed that they find obfuscations to retrieve relevant documents but often only when they relied on the suggested terms.

Maik Fröbe, Nicola Lea Libera, Matthias Hagen
DocTAG: A Customizable Annotation Tool for Ground Truth Creation

Information Retrieval (IR) is a discipline deeply rooted on evaluation that in many cases relies on annotated data as ground truth. Manual annotation is a demanding and time-consuming task, involving human intervention for topic-document assessment. To ease and possibly speed up the work of the assessors, it is desirable to have easy-to-use, collaborative and flexible annotation tools. Despite their importance, in the IR domain no open-source fully customizable annotation tool has been proposed for topic-document annotation and assessment, so far. In this demo paper, we present DocTAG, a portable and customizable annotation tool for ground-truth creation in a web-based collaborative setting.

Fabio Giachelle, Ornella Irrera, Gianmaria Silvello
ALWars: Combat-Based Evaluation of Active Learning Strategies

The demand for annotated datasets for supervised machine learning (ML) projects is growing rapidly. Annotating a dataset often requires domain experts and is a timely and costly process. A premier method to reduce this overhead drastically is Active Learning (AL). Despite a tremendous potential for annotation cost savings, AL is still not used universally in ML projects. The large number of available AL strategies has significantly risen during the past years leading to an increased demand for thorough evaluations of AL strategies. Existing evaluations show in many cases contradicting results, without clear superior strategies. To help researchers in taming the AL zoo we present ALWars: an interactive system with a rich set of features to compare AL strategies in a novel replay view mode of all AL episodes with many available visualization and metrics. Under the hood we support a rich variety of AL strategies by supporting the API of the powerful AL framework ALiPy [21], amounting to over 25 AL strategies out-of-the-box.

Julius Gonsior, Jakob Krude, Janik Schönfelder, Maik Thiele, Wolgang Lehner
INForex: Interactive News Digest for Forex Investors

As foreign exchange (Forex) markets reflect real-world events, locally or globally, financial news is often leveraged to predict Forex trends. In this demonstration, we propose INForex, an interactive web-based system that displays a Forex plot alongside related financial news. To our best knowledge, this is the first system to successfully align the presentation of two types of time-series data—Forex data and textual news data—in a unified and time-aware manner and as well as the first Forex-related online system leveraging deep learning techniques. The system can be of great help in revealing valuable insights and relations between the two types of data and is thus valuable for decision making not only for professional financial analysts or traders but also for common investors. The system is available online at , and the introduction video is at .

Chih-Hen Lee, Yi-Shyuan Chiang, Chuan-Ju Wang
Streamlining Evaluation with ir-measures

We present ir-measures, a new tool that makes it convenient to calculate a diverse set of evaluation measures used in information retrieval. Rather than implementing its own measure calculations, ir-measures provides a common interface to a handful of evaluation tools. The necessary tools are automatically invoked (potentially multiple times) to calculate all the desired metrics, simplifying the evaluation process for the user. The tool also makes it easier for researchers to use recently-proposed measures (such as those from the C/W/L framework) alongside traditional measures, potentially encouraging their adoption.

Sean MacAvaney, Craig Macdonald, Iadh Ounis
Turning News Texts into Business Sentiment

This paper describes a demonstration system for our project on news-based business sentiment nowcast. Compared to traditional business sentiment indices which rely on a time-consuming survey and are announced only monthly or quarterly, our system takes advantage of news articles continually published on the Web and updates the estimate of business sentiment as the latest news come in. Additionally, it provides functionality to search any keyword and temporally visualize how much it influenced business sentiment, which can be a useful analytical tool for policymakers and economists. The codes and demo system are available at .

Kazuhiro Seki
SolutionTailor: Scientific Paper Recommendation Based on Fine-Grained Abstract Analysis

Locating specific scientific content from a large corpora is crucial to researchers. This paper presents SolutionTailor (The demo video is available at: ), a novel system that recommends papers that provide diverse solutions for a specific research objective. The proposed system does not require any prior information from a user; it only requires the user to specify the target research field and enter a research abstract representing the user’s interests. Our approach uses a neural language model to divide abstract sentences into “Background/Objective” and “Methodologies” and defines a new similarity measure between papers. Our current experiments indicate that the proposed system can recommend literature in a specific objective beyond a query paper’s citations compared with a baseline system.

Tetsuya Takahashi, Marie Katsurai
Leaf: Multiple-Choice Question Generation

Testing with quiz questions has proven to be an effective way to assess and improve the educational process. However, manually creating quizzes is tedious and time-consuming. To address this challenge, we present Leaf, a system for generating multiple-choice questions from factual text. In addition to being very well suited for the classroom, Leaf could also be used in an industrial setting, e.g., to facilitate onboarding and knowledge sharing, or as a component of chatbots, question answering systems, or Massive Open Online Courses (MOOCs). The code and the demo are available on GitHub ( ).

Kristiyan Vachev, Momchil Hardalov, Georgi Karadzhov, Georgi Georgiev, Ivan Koychev, Preslav Nakov

CLEF 2022 Lab Descriptions

Overview of PAN 2022: Authorship Verification, Profiling Irony and Stereotype Spreaders, Style Change Detection, and Trigger Detection
Extended Abstract

The paper gives a brief overview of the four shared tasks to be organized at the PAN 2022 lab on digital text forensics and stylometry hosted at the CLEF 2022 conference. The tasks include authorship verification across discourse types, multi-author writing style analysis, author profiling, and content profiling. Some of the tasks continue and advance past editions (authorship verification and multi-author analysis) and some are new (profiling irony and stereotypes spreaders and trigger detection). The general goal of the PAN shared tasks is to advance the state of the art in text forensics and stylometry while ensuring objective evaluation on newly developed benchmark datasets.

Janek Bevendorff, Berta Chulvi, Elisabetta Fersini, Annina Heini, Mike Kestemont, Krzysztof Kredens, Maximilian Mayerl, Reyner Ortega-Bueno, Piotr Pęzik, Martin Potthast, Francisco Rangel, Paolo Rosso, Efstathios Stamatatos, Benno Stein, Matti Wiegmann, Magdalena Wolska, Eva Zangerle
Overview of Touché 2022: Argument Retrieval
Extended Abstract

The goal of the Touché lab on argument retrieval is to foster and support the development of technologies for argument mining and argument analysis. In the third edition of Touché, we organize three shared tasks: (a) argument retrieval for controversial topics, where participants retrieve a gist of arguments from a collection of online debates, (b) argument retrieval for comparative questions, where participants retrieve argumentative passages from a generic web crawl, and (c) image retrieval for arguments, where participants retrieve images from a focused web crawl that show support or opposition to some stance. In this paper, we briefly summarize the results of two years of organizing Touché and describe the planned setup for the third edition at CLEF 2022.

Alexander Bondarenko, Maik Fröbe, Johannes Kiesel, Shahbaz Syed, Timon Gurcke, Meriem Beloucif, Alexander Panchenko, Chris Biemann, Benno Stein, Henning Wachsmuth, Martin Potthast, Matthias Hagen
Introducing the HIPE 2022 Shared Task: Named Entity Recognition and Linking in Multilingual Historical Documents

We present the HIPE-2022 shared task on named entity processing in multilingual historical documents. Following the success of the first CLEF-HIPE-2020 evaluation lab, this edition confronts systems with the challenges of dealing with more languages, learning domain-specific entities, and adapting to diverse annotation tag sets. HIPE-2022 is part of the ongoing efforts of the natural language processing and digital humanities communities to adapt and develop appropriate technologies to efficiently retrieve and explore information from historical texts. On such material, however, named entity processing techniques face the challenges of domain heterogeneity, input noisiness, dynamics of language, and lack of resources. In this context, the main objective of the evaluation lab is to gain new insights into the transferability of named entity processing approaches across languages, time periods, document types, and annotation tag sets.

Maud Ehrmann, Matteo Romanello, Antoine Doucet, Simon Clematide
CLEF Workshop JOKER: Automatic Wordplay and Humour Translation

Humour remains one of the most difficult aspects of intercultural communication: understanding humour often requires understanding implicit cultural references and/or double meanings, and this raises the question of its (un)translatability. Wordplay is a common source of humour in due to its attention-getting and subversive character. The translation of humour and wordplay is therefore in high demand. Modern translation depends heavily on technological aids, yet few works have treated the automation of humour and wordplay translation, or the creation of humour corpora. The goal of the JOKER workshop is to bring together translators and computer scientists to work on an evaluation framework for wordplay, including data and metric development, and to foster work on automatic methods for wordplay translation. We propose three pilot tasks: (1) classify and explain instances of wordplay, (2) translate single words containing wordplay, and (3) translate entire phrases containing wordplay.

Liana Ermakova, Tristan Miller, Orlane Puchalski, Fabio Regattin, Élise Mathurin, Sílvia Araújo, Anne-Gwenn Bosser, Claudine Borg, Monika Bokiniec, Gaelle Le Corre, Benoît Jeanjean, Radia Hannachi, Ġorġ Mallia, Gordan Matas, Mohamed Saki
Automatic Simplification of Scientific Texts: SimpleText Lab at CLEF-2022

The Web and social media have become the main source of information for citizens, with the risk that users rely on shallow information in sources prioritizing commercial or political incentives rather than the correctness and informational value. Non-experts tend to avoid scientific literature due to its complex language or their lack of prior background knowledge. Text simplification promises to remove some of these barriers. The CLEF 2022 SimpleText track addresses the challenges of text simplification approaches in the context of promoting scientific information access, by providing appropriate data and benchmarks, and creating a community of NLP and IR researchers working together to resolve one of the greatest challenges of today. The track will use a corpus of scientific literature abstracts and popular science requests. It features three tasks. First, content selection (what is in, or out?) challenges systems to select passages to include in a simplified summary in response to a query. Second, complexity spotting (what is unclear?) given a passage and a query, aims to rank terms/concepts that are required to be explained for understanding this passage (definitions, context, applications). Third, text simplification (rewrite this!) given a query, asks to simplify passages from scientific abstracts while preserving the main content.

Liana Ermakova, Patrice Bellot, Jaap Kamps, Diana Nurbakova, Irina Ovchinnikova, Eric SanJuan, Elise Mathurin, Sílvia Araújo, Radia Hannachi, Stéphane Huet, Nicolas Poinsu
LeQua@CLEF2022: Learning to Quantify

LeQua 2022 is a new lab for the evaluation of methods for “learning to quantify” in textual datasets, i.e., for training predictors of the relative frequencies of the classes of interest in sets of unlabelled textual documents. While these predictions could be easily achieved by first classifying all documents via a text classifier and then counting the numbers of documents assigned to the classes, a growing body of literature has shown this approach to be suboptimal, and has proposed better methods. The goal of this lab is to provide a setting for the comparative evaluation of methods for learning to quantify, both in the binary setting and in the single-label multiclass setting. For each such setting we provide data either in ready-made vector form or in raw document form.

Andrea Esuli, Alejandro Moreo, Fabrizio Sebastiani
ImageCLEF 2022: Multimedia Retrieval in Medical, Nature, Fusion, and Internet Applications

ImageCLEF s part of the Conference and Labs of the Evaluation Forum (CLEF) since 2003. CLEF 2022 will take place in Bologna, Italy. ImageCLEF is an ongoing evaluation initiative which promotes the evaluation of technologies for annotation, indexing, and retrieval of visual data with the aim of providing information access to large collections of images in various usage scenarios and domains. In its 20th edition, ImageCLEF will have four main tasks: (i) a Medical task addressing concept annotation, caption prediction, and tuberculosis detection; (ii) a Coral task addressing the annotation and localisation of substrates in coral reef images; (iii) an Aware task addressing the prediction of real-life consequences of online photo sharing; and (iv) a new Fusion task addressing late fusion techniques based on the expertise of the pool of classifiers. In 2021, over 100 research groups registered at ImageCLEF with 42 groups submitting more than 250 runs. These numbers show that, despite the COVID-19 pandemic, there is strong interest in the evaluation campaign.

Alba G. Seco de Herrera, Bogdan Ionescu, Henning Müller, Renaud Péteri, Asma Ben Abacha, Christoph M. Friedrich, Johannes Rückert, Louise Bloch, Raphael Brüngel, Ahmad Idrissi-Yaghir, Henning Schäfer, Serge Kozlovski, Yashin Dicente Cid, Vassili Kovalev, Jon Chamberlain, Adrian Clark, Antonio Campello, Hugo Schindler, Jérôme Deshayes, Adrian Popescu, Liviu-Daniel Ştefan, Mihai Gabriel Constantin, Mihai Dogariu
LifeCLEF 2022 Teaser: An Evaluation of Machine-Learning Based Species Identification and Species Distribution Prediction

Building accurate knowledge of the identity, the geographic distribution and the evolution of species is essential for the sustainable development of humanity, as well as for biodiversity conservation. However, the difficulty of identifying plants, animals and fungi is hindering the aggregation of new data and knowledge. Identifying and naming living organisms is almost impossible for the general public and is often difficult even for professionals and naturalists. Bridging this gap is a key step towards enabling effective biodiversity monitoring systems. The LifeCLEF campaign, presented in this paper, has been promoting and evaluating advances in this domain since 2011. The 2022 edition proposes five data-oriented challenges related to the identification and prediction of biodiversity: (i) PlantCLEF: very large-scale plant identification, (ii) BirdCLEF: bird species recognition in audio soundscapes, (iii) GeoLifeCLEF: remote sensing based prediction of species, (iv) SnakeCLEF: Snake Species Identification in Medically Important scenarios, and (v) FungiCLEF: Fungi recognition from images and metadata.

Alexis Joly, Hervé Goëau, Stefan Kahl, Lukáš Picek, Titouan Lorieul, Elijah Cole, Benjamin Deneu, Maximilien Servajean, Andrew Durso, Isabelle Bolon, Hervé Glotin, Robert Planqué, Willem-Pier Vellinga, Holger Klinck, Tom Denton, Ivan Eggel, Pierre Bonnet, Henning Müller, Milan Šulc
The ChEMU 2022 Evaluation Campaign: Information Extraction in Chemical Patents

The discovery of new chemical compounds is a key driver of the chemistry and pharmaceutical industries, and many other industrial sectors. Patents serve as a critical source of information about new chemical compounds. The ChEMU (Cheminformatics Elsevier Melbourne Universities) lab addresses information extraction over chemical patents and aims to advance the state of the art on this topic. ChEMU lab 2022, as part of the 13th Conference and Labs of the Evaluation Forum (CLEF-2022), will be the third ChEMU lab. The ChEMU 2020 lab provided two information extraction tasks, named entity recognition and event extraction. The ChEMU 2021 lab introduced two more tasks, chemical reaction reference resolution and anaphora resolution. For ChEMU 2022, we plan to re-run all the four tasks with a new task on semantic classification for tables as the fifth one. In this paper, we introduce ChEMU 2022, including its motivation, goals, tasks, resources, and evaluation framework.

Yuan Li, Biaoyan Fang, Jiayuan He, Hiyori Yoshikawa, Saber A. Akhondi, Christian Druckenbrodt, Camilo Thorne, Zenan Zhai, Zubair Afzal, Trevor Cohn, Timothy Baldwin, Karin Verspoor
Advancing Math-Aware Search: The ARQMath-3 Lab at CLEF 2022

ARQMath-3 is the third edition of the Answer Retrieval for Questions on Math lab at CLEF. In addition to the two main tasks from previous years, an interesting new pilot task will also be run. The main tasks include: (1) Answer Retrieval, returning posted answers to mathematical questions taken from a community question answering site (Math Stack Exchange (MSE)), and (2) Formula Retrieval, returning formulas and their associated question/answer posts in response to a query formula taken from a question. The previous ARQMath labs created a large new test collection, new evaluation protocols for formula retrieval, and established baselines for both main tasks. This year we will pilot a new open domain question answering task as Task 3, where questions from Task 1 may be answered using passages from documents from outside of the ARQMath collection, and/or that are generated automatically.

Behrooz Mansouri, Anurag Agarwal, Douglas W. Oard, Richard Zanibbi
The CLEF-2022 CheckThat! Lab on Fighting the COVID-19 Infodemic and Fake News Detection

The fifth edition of the CheckThat! Lab is held as part of the 2022 Conference and Labs of the Evaluation Forum (CLEF). The lab evaluates technology supporting various factuality tasks in seven languages: Arabic, Bulgarian, Dutch, English, German, Spanish, and Turkish. Task 1 focuses on disinformation related to the ongoing COVID-19 infodemic and politics, and asks to predict whether a tweet is worth fact-checking, contains a verifiable factual claim, is harmful to the society, or is of interest to policy makers and why. Task 2 asks to retrieve claims that have been previously fact-checked and that could be useful to verify the claim in a tweet. Task 3 is to predict the veracity of a news article. Tasks 1 and 3 are classification problems, while Task 2 is a ranking one.

Preslav Nakov, Alberto Barrón-Cedeño, Giovanni Da San Martino, Firoj Alam, Julia Maria Struß, Thomas Mandl, Rubén Míguez, Tommaso Caselli, Mucahid Kutlu, Wajdi Zaghouani, Chengkai Li, Shaden Shaar, Gautam Kishore Shahi, Hamdy Mubarak, Alex Nikolov, Nikolay Babulkov, Yavuz Selim Kartal, Javier Beltrán
BioASQ at CLEF2022: The Tenth Edition of the Large-scale Biomedical Semantic Indexing and Question Answering Challenge

The tenth version of the BioASQ Challenge will be held as an evaluation Lab within CLEF2022. The motivation driving BioASQ is the continuous advancement of approaches and tools to meet the need for efficient and precise access to the ever-increasing biomedical knowledge. In this direction, a series of annual challenges are organized, in the fields of large-scale biomedical semantic indexing and question answering, formulating specific shared-tasks in alignment with the real needs of the biomedical experts. These shared-tasks and their accompanying benchmark datasets provide an unique common testbed for investigating and comparing new approaches developed by distinct teams around the world for identifying and accessing biomedical information. In particular, the BioASQ Challenge consists of shared-tasks in two complementary directions: (a) the automated indexing of large volumes of unlabelled biomedical documents, primarily scientific publications, with biomedical concepts, (b) the automated retrieval of relevant material for biomedical questions and the generation of comprehensible answers. In the first direction on semantic indexing, two shared-tasks are organized for English and Spanish content respectively, the latter considering human-interpretable evidence extraction (NER and concept linking) as well. In the second direction, two shared-tasks are organized as well, one for biomedical question answering and one particularly focusing on the developing issue of COVID-19. As BioASQ rewards the approaches that manage to outperform the state of the art in these shared-tasks, the research frontier is pushed towards ensuring that the valuable biomedical knowledge will be identifiable and accessible by the biomedical experts.

Anastasios Nentidis, Anastasia Krithara, Georgios Paliouras, Luis Gasco, Martin Krallinger
eRisk 2022: Pathological Gambling, Depression, and Eating Disorder Challenges

In 2017, we launched eRisk as a CLEF Lab to encourage research on early risk detection on the Internet. The eRisk 2021 was the fifth edition of the Lab. Since then, we have created a large number of collections for early detection addressing different problems (e.g., depression, anorexia or self-harm). This paper outlines the work that we have done to date (2017, 2018, 2019, 2020, and 2021), discusses key lessons learned in previous editions, and presents our plans for eRisk 2022, which introduces a new challenge to assess the severity of eating disorders.

Javier Parapar, Patricia Martín-Rodilla, David E. Losada, Fabio Crestani

Doctoral Consortium

Continually Adaptive Neural Retrieval Across the Legal, Patent and Health Domain

In the past years neural retrieval approaches using contextualized language models have driven advancements in information retrieval (IR) and demonstrated great effectiveness gains for retrieval, primarily in the web domain [1, 10, 33]. This is enabled by the availability of large-scale, open-domain labelled collections [6].

Sophia Althammer
Understanding and Learning from User Behavior for Recommendation in Multi-channel Retail

Online shopping is gaining more and more popularity everyday. Traditional retailers with physical stores adjust to this trend by allowing their customers to shop online as well as offline, i.e., in-store. Increasingly, customers can browse and purchase products across multiple shopping channels. Understanding how customer behavior relates to the availability of multiple shopping channels is an important prerequisite for many downstream machine learning tasks, such as recommendation and purchase prediction. However, previous work in this domain is limited to analyzing single-channel behavior only. In this project, we first provide a better understanding of the similarities and differences between online and offline behavior. We further study the next basket recommendation task in a multi-channel context, where the goal is to build recommendation algorithms that can leverage the rich cross-channel user behavior data in order to enhance the customer experience.

Mozhdeh Ariannezhad
An Entity-Oriented Approach for Answering Topical Information Needs

In this dissertation, we adopt an entity-oriented approach to identify relevant materials for answering a topical keyword query such as “Cholera”. To this end, we study the interplay between text and entities by addressing three related prediction problems: (1) Identify knowledge base entities that are relevant for the query, (2) Understand an entity’s meaning in the context of the query, and (3) Identify text passages that elaborate the connection between the query and an entity. Through this dissertation, we aim to study some overarching questions in entity-oriented research such as the importance of query-specific entity descriptions, and the importance of entity salience and context-dependent entity similarity for modeling the query-specific context of an entity.

Shubham Chatterjee
Cognitive Information Retrieval

Several existing search personalisation techniques tailor the returned results by using information about the user that often contains demographic data, query logs, or history of visited pages.These techniques still lack awareness about the user’s cognitive aspects like beliefs, knowledge, and search goals. They might return, for example, results that answer the query and fit the user’s interests but contain information that the user already knows. Considering the user’s cognitive components in the domain of Information Retrieval (IR) is still recognized as one of the “major challenges” by the IR community. This paper overviews my recent doctoral work on the exploration of the approaches to represent the user’s cognitive aspects (especially knowledge and search goals) and on the investigation of incorporating them into information retrieval systems. Knowing that those aspects are subject to constant change, the thesis also aims to consider this dynamic characteristic. The research’s objective is to better understand the knowledge acquisition process and the goal achievement task in an IR context. That will help search users find the information they seek for.

Dima El Zein
Graph-Enhanced Document Representation for Court Case Retrieval

To reach informed decisions, legal domain experts in Civil Law systems need to have knowledge not only about legal paragraphs, but also about related court cases. However, court case retrieval is challenging due to the domain-specific language and large document sizes. While modern transformer models such as BERT create dense text representations suitable for efficient retrieval in many domains, without domain specific adaptions they are outperformed by established lexical retrieval models in the legal domain. Although citations of court cases and codified law play an important role in the domain, there has been little research on utilizing a combination of text representations and citation graph data for court case retrieval. In other domains, attempts have been made to combine these two with methods such as concatenating graph embeddings to text embeddings. In the PhD research project, domain-specific challenges of legal retrieval systems will be tackled. To help with this task, a dataset of Austrian court cases, their document labels as well as their citations of other court cases and codified law on a document and paragraph level will be created and made public. Experiments in this project will include various ways of enhancing transformer-based text representations methods with citation graph data, such as graph based transformer re-training or graph embeddings.

Tobias Fink
Relevance Models Based on the Knowledge Gap

Search systems are increasingly used for gaining knowledge through accessing relevant resources from a vast volume of content. However, search systems provide only limited support to users in knowledge acquisition contexts. Specifically, they do not fully consider the knowledge gap which we define as the gap existing between what the user knows and what the user intends to learn. The effects of considering the knowledge gap for knowledge acquisition tasks remain largely unexplored in search systems. We propose to model and incorporate the knowledge gap into search algorithms. We plan to explore to what extent the incorporation of the knowledge gap leads to an improvement in the performance of search systems in knowledge acquisition tasks. Furthermore, we aim to investigate and design a metric for the evaluation of the search systems’ performance in the context of knowledge acquisition tasks.

Yasin Ghafourian
Evidence-Based Early Rumor Verification in Social Media

A plethora of studies has been conducted in the past years on rumor verification in micro-blogging platforms. However, most of them exploit the propagation network, i.e., replies and retweets to verify rumors. We argue that first, subjective evidence from the propagation network is insufficient for users to understand, and reason the veracity of the rumor. Second, the full propagation network of the rumor can be sufficient, but for early detection when only part of the network is used, inadequate context for verification can be a major issue. As time is critical for early rumor verification, and sufficient evidence may not be available at the posting time, the objective of this thesis is to verify any tweet as soon as it is posted. Specifically, we are interested in exploiting evidence from Twitter as we believe it will be beneficial to 1) improve the veracity prediction 2) improve the user experience by providing convincing evidence 3) early verification, as waiting for subjective evidence may not be needed. We first aim to retrieve authority Twitter accounts that may help verify the rumor. Second, we aim to retrieve relevant tweets, i.e., tweets stating the same rumor, or tweets stating an evidence that contradicts or supports the rumor along with their propagation networks. Given the retrieved evidence from multiple sources namely evidence from authority accounts, evidence from relevant tweets, and their propagation networks, we intend to learn an effective model for rumor verification, and show rationales behind the decisions it makes.

Fatima Haouari
Multimodal Retrieval in E-Commerce
From Categories to Images, Text, and Back

E-commerce provides rich multimodal data that is barely leveraged in practice. The majority of e-commerce search mechanisms are uni-modal, which are cumbersome and often fail to grasp the customer’s needs. For the Ph.D. we conduct research aimed at combining information across multiple modalities to improve search and recommendations in e-commerce. The research plans are organized along the two principal lines. First, motivated by the mismatch between a textual and a visual representation of a given product category, we propose the task of category-to-image retrieval, i.e., the problem of retrieval of an image of a category expressed as a textual query. Besides, we propose a model for the task. The model leverages information from multiple modalities to create product representations. We explore how adding information from multiple modalities impacts the model’s performance and compare our approach with state-of-the-art models. Second, we consider fine-grained text-image retrieval in e-commerce. We start off by considering the task in the context of reproducibility. Moreover, we address the problem of attribute granularity in e-commerce. We select two state-of the-art (SOTA) models with distinct architectures, a CNN-RNN model and a Transformer-based model, and consider their performance on various e-commerce categories as well as on object-centric data from general domain. Next, based on the lessons learned from the reproducibility study, we propose the model for the fine-grained text-image retrieval.

Mariya Hendriksen
Medical Entity Linking in Laypersons’ Language

Due to the vast amount of health-related data on the Internet, a trend toward digital health literacy is emerging among laypersons. We hypothesize that providing trustworthy explanations of informal medical terms in social media can improve information quality. Entity linking (EL) is the task of associating terms with concepts (entities) in the knowledge base. The challenge with EL in lay medical texts is that the source texts are often written in loose and informal language. We propose an end-to-end entity linking approach that involves identifying informal medical terms, normalizing medical concepts according to SNOMED-CT, and linking entities to Wikipedia to provide explanations for laypersons.

Annisa Maulida Ningtyas
A Topical Approach to Capturing Customer Insight Dynamics in Social Media

With the emergence of the internet, customers have become far more than mere consumers: they are now opinion makers. As such, they share their experience of goods, services, brands, and retailers. People interested in a certain product often reach for these opinions on all kinds of channels with different structures, from forums to microblogging platforms. On these platforms, topics about almost everything proliferate, and can become viral for a certain time before they begin stagnating, or extinguishing. The amount of data is massive, and the data acquisition processes frequently involve web scraping. Even if basic parsing, cleaning, and standardization exist, the variability of noise create the need for ad-hoc tools. All these elements make it difficult to extract customer insights from the internet. To address these issues, I propose to devise time-dynamic, nonparametric neural-based topic models that take topic, document and word linking into account. I also want to extract opinions accordingly with multilingual contexts, all the while making my tools relevant for pretreatment improvement. Last but not least, I want to devise a proper way of evaluating models so as to assess all their aspects.

Miguel Palencia-Olivar
Towards Explainable Search in Legal Text

Assume a non-AI expert user like a lawyer using an AI driven text retrieval (IR) system. A user is not always sure why a certain document is at the bottom of the ranking list although it seems quite relevant and is expected at the top. Is it due to the proportion of matching terms, semantically related topics, or unknown reasons? This can be confusing and leading to lack of trust and transparency in AI systems. Explainable AI (XAI) is currently a vibrant research topic which is being investigated from various perspectives in the IR and ML community. While a major focus of the ML community is to explain a classification decision, a key focus in IR is to explain the notion of similarity that is used to estimate relevance rankings. Relevance in IR is a complex entity based on various notions of similarity (e.g. semantic, syntactic, contextual) in text. This is often subjective and ranking is an estimation of the relevance. In this work, we attempt to explore the notion of similarity in text with regard to aspects such as semantics, law cross references and arrive at interpretable facets of evidence which can be used to explain rankings. The idea is to explain non-AI experts that why a certain document is relevant to a query, for legal domain. We present our preliminary findings, outline future work and discuss challenges.

Sayantan Polley
End to End Neural Retrieval for Patent Prior Art Search

This research will examine neural retrieval methods for patent prior art search. One research direction is the federated search approach, where we proposed two new methods that solve the results merging problem in federated patent search using machine learning models. The methods are based on a centralized index containing samples of documents from all potential resources, and they implement machine learning models to predict comparable scores for the documents retrieved by different resources. The other research direction is the adaptation of end-to-end neural retrieval approaches to the patent characteristics such that the retrieval effectiveness will be increased. Off-the-self neural methods like BERT have lower effectiveness for patent prior art search. So, we adapt the BERT model to patent characteristics in order to increase retrieval performance. We propose a new gate-based document retrieval method and examine it in patent prior art search. The method combines a first-stage retrieval method using BM25 and a re-ranking approach where the BERT model is used as a gating function that operates on the BM25 score and modifies it according to the BERT relevance score. These experiments are based on two-stage retrieval approaches as neural models like BERT requires lots of computing power to be used. Eventually, the final part of the research will examine first-stage neural retrieval methods such as dense retrieval methods adapted to patent characteristics for prior art search.

Vasileios Stamatis


Third International Workshop on Algorithmic Bias in Search and Recommendation (BIAS@ECIR2022)

Creating search and recommendation algorithms that are efficient and effective has been the main goal for the industry and the academia for years. However, recent research has shown that these algorithms lead to models, trained on historical data, that might exacerbate existing biases and generate potentially negative outcomes. Defining, assessing and mitigating these biases throughout experimental pipelines is hence a core step for devising search and recommendation algorithms that can be responsibly deployed in real-world applications. The Bias 2022 workshop aims to collect novel contributions in this field and offer a common ground for interested researchers and practitioners. The workshop website is available at .

Ludovico Boratto, Stefano Faralli, Mirko Marras, Giovanni Stilo
The 5th International Workshop on Narrative Extraction from Texts: Text2Story 2022

Narrative extraction, understanding, verification, and visualization are currently popular topics for users interested in achieving a deeper understanding of text, researchers who want to develop accurate methods for text mining, and commercial companies that strive to provide efficient tools for that. Information Retrieval (IR), Natural Language Processing (NLP), Machine Learning (ML) and Computational Linguistics (CL) already offer many instruments that aid the exploration of narrative elements in text and within unstructured data. Despite evident advances in the last couple of years, the problem of automatically representing narratives in a structured form and interpreting them, beyond the conventional identification of common events, entities and their relationships, is yet to be solved. This workshop held virtually on April 10th, 2022 in conjunction with the 44th European Conference on Information Retrieval (ECIR’22) aims at presenting and discussing current and future directions for IR, NLP, ML and other computational linguistics-related fields capable of improving the automatic understanding of narratives. It includes sessions devoted to research, demo, position papers, work-in-progress, project description, nectar, and negative results papers, keynote talks and space for an informal discussion of the methods, of the challenges and of the future of this research area.

Ricardo Campos, Alípio Jorge, Adam Jatowt, Sumit Bhatia, Marina Litvak
Augmented Intelligence in Technology-Assisted Review Systems (ALTARS 2022): Evaluation Metrics and Protocols for eDiscovery and Systematic Review Systems

In this workshop, we aim to fathom the effectiveness of Technology-Assisted Review Systems from different viewpoints. In fact, despite the number of evaluation measures at our disposal to assess the effectiveness of a “traditional” retrieval approach, there are additional dimensions of evaluation for these systems. For example, it is true that an effective high-recall system should be able to find the majority of relevant documents using the least number of assessments. However, this kind of evaluation usually discards the resources used to achieve this goal, such as the total time spent on those assessments, or the amount of money spent for the experts judging the documents.

Giorgio Maria Di Nunzio, Evangelos Kanoulas, Prasenjit Majumder
Bibliometric-enhanced Information Retrieval: 12th International BIR Workshop (BIR 2022)

The 12th iteration of the Bibliometric-enhanced Information Retrieval (BIR) workshop series is a full-day ECIR 2022 workshop. BIR tackles issues related to, for instance, academic search and recommendation, at the intersection of Information Retrieval, Natural Language Processing, and Bibliometrics. As an interdisciplinary scientific event, BIR brings together researchers and practitioners from the Scientometrics/Bibliometrics community on the one hand and the Information Retrieval community on the other hand. BIR is an ever-growing topic investigated by both academia and the industry.

Ingo Frommholz, Philipp Mayr, Guillaume Cabanac, Suzan Verberne
ROMCIR 2022: Overview of the 2nd Workshop on Reducing Online Misinformation Through Credible Information Retrieval

The ROMCIR 2022 workshop is focused on discussing and addressing issues related to information disorder, a new term that holistically encompasses all forms of communication pollution. In particular, the aim of ROMCIR is reducing such clutter, from false content to incorrect correlations, from misinformation to disinformation, through Information Retrieval solutions, by providing users with access to genuine information. This topic is very broad, as it concerns different contents (e.g., Web pages, news, reviews, medical information, online accounts, etc.), different Web and social media platforms (e.g., microblogging platforms, social networking services, social question-answering systems, etc.), and different purposes (e.g., identifying false information, accessing and retrieving information based on its genuineness, providing explainable solutions to users, etc.). Therefore, interdisciplinary input to ROMCIR is more than welcome.

Marinella Petrocchi, Marco Viviani


Online Advertising Incrementality Testing: Practical Lessons, Paid Search and Emerging Challenges

Online advertising has historically been approached as an ad-to-user matching problem within sophisticated optimization algorithms. As the research and ad tech industries have progressed, advertisers have increasingly emphasized the causal effect estimation of their ads (incrementality) using controlled experiments (A/B testing). With low lift effects and sparse conversion, the development of incrementality testing platforms at scale suggests tremendous engineering challenges in measurement precision. Similarly, the correct interpretation of results addressing a business goal requires significant data science and experimentation research expertise.We propose a practical tutorial in the incrementality testing landscape, including:– The business need– Literature solutions and industry practices– Designs in the development of testing platforms– The testing cycle, case studies, and recommendations– Paid search effectiveness in the marketplace– Emerging privacy challenges for incrementality testing and research solutionsWe provide first-hand lessons based on the development of such a platform in a major combined DSP and ad network, and after running several tests for up to two months each over recent years. With increasing privacy constraints, we survey literature and current practices. These practices include private set union and differential privacy for conversion modeling, and geo-testing combined with synthetic control techniques.

Joel Barajas, Narayan Bhamidipati, James G. Shanahan
From Fundamentals to Recent Advances: A Tutorial on Keyphrasification

Keyphrases represent the most important information of text which often serve as a surrogate for efficiently summarizing text documents. With the advancement of deep neural networks, recent years have witnessed rapid development in automatic identification of keyphrases. The performance of keyphrase extraction methods has been greatly improved by the progresses made in natural language understanding, enable models to predict relevant phrases not mentioned in the text. We name the task of summarizing texts with phrases keyphrasification.In this half-day tutorial, we provide a comprehensive overview of keyphrasification as well as hands-on practice with popular models and tools. This tutorial covers important topics ranging from basics of the task to the advanced topics and applications. By the end of the tutorial, participants will have a better understanding of 1) classical and state-of-the-art keyphrasification methods, 2) current evaluation practices and their issues, and 3) current trends and future directions in keyphrasification research. Tutorial-related resources are available at .

Rui Meng, Debanjan Mahata, Florian Boudin
Information Extraction from Social Media: A Hands-On Tutorial on Tasks, Data, and Open Source Tools

Information extraction (IE) is a common sub-area of natural language processing that focuses on identifying structured data from unstructured data. The community of Information Retrieval (IR) relies on accurate and high-performance IE to be able to retrieve high quality results from massive datasets. One example of IE is to identify named entities in a text, e.g., “Barack Obama served as the president of the USA”. Here, Barack Obama and USA are named entities of types of PERSON and LOCATION, respectively. Another example is to identify sentiment expressed in a text, e.g., “This movie was awesome”. Here, the sentiment expressed is positive. Finally, identifying various linguistic aspects of a text, e.g., part of speech tags, noun phrases, dependency parses, etc., which can serve as features for additional IE tasks. This tutorial introduces participants to a) the usage of Python based, open-source tools that support IE from social media data (mainly Twitter), and b) best practices for ensuring the reproducibility of research. Participants will learn and practice various semantic and syntactic IE techniques that are commonly used for analyzing tweets. Additionally, participants will be familiarized with the landscape of publicly available tweet data, and methods for collecting and preparing them for analysis. Finally, participants will be trained to use a suite of open source tools ( SAIL for active learning, TwitterNER for named entity recognition3, and SocialMediaIE for multi task learning), which utilize advanced machine learning techniques (e.g., deep learning, active learning with human-in-the-loop, multi-lingual, and multi-task learning) to perform IE on their own or existing datasets. Participants will also learn how social context can be integrated in Information Extraction systems to make them better. The tools introduced in the tutorial will focus on the three main stages of IE, namely, collection of data (including annotation), data processing and analytics, and visualization of the extracted information. More details can be found at: .

Shubhanshu Mishra, Rezvaneh Rezapour, Jana Diesner
ECIR 2022 Tutorial: Technology-Assisted Review for High Recall Retrieval

Human-in-the-loop (HITL) IR workflows are being applied to an increasing range of tasks in the law, medicine, social media, and other areas.

Eugene Yang, Jeremy Pickens, David D. Lewis
Advances in Information Retrieval
Matthias Hagen
Suzan Verberne
Craig Macdonald
Christin Seifert
Krisztian Balog
Kjetil Nørvåg
Vinay Setty
Copyright Year
Electronic ISBN
Print ISBN