Zum Inhalt

2025 | Buch

Advances in Information Retrieval

47th European Conference on Information Retrieval, ECIR 2025, Lucca, Italy, April 6–10, 2025, Proceedings, Part IV

herausgegeben von: Claudia Hauff, Craig Macdonald, Dietmar Jannach, Gabriella Kazai, Franco Maria Nardini, Fabio Pinelli, Fabrizio Silvestri, Nicola Tonellotto

Verlag: Springer Nature Switzerland

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

Die fünfbändige Reihe LNCS 15572, 15573, 15574, 15575 und 15576 stellt die referierten Konferenzunterlagen der 47. Europäischen Konferenz zur Informationsgewinnung (ECIR 2025) dar, die vom 6. bis 10. April 2025 in Lucca, Italien, stattfand. Die 52 vollständigen Beiträge, 11 Ergebnisse, 42 Kurzbeiträge und 76 Beiträge anderer Art, die in diesem Verfahren präsentiert wurden, wurden sorgfältig geprüft und aus 530 Einreichungen ausgewählt. Die anerkannten Arbeiten decken den aktuellen Stand der Informationsgewinnung und Empfehlungssysteme ab: Nutzeraspekte, System- und Grundlagenaspekte, künstliche Intelligenz und maschinelles Lernen, Anwendungen, Evaluierung, neue soziale und technische Herausforderungen und andere Themen von direkter oder indirekter Relevanz für Suche und Empfehlung.

Inhaltsverzeichnis

Frontmatter
FAIR-QR: Enhancing Fairness-Aware Information Retrieval Through Query Refinement

Information retrieval systems such as open web search and recommendation systems are ubiquitous and significantly impact how people receive and consume online information. Previous research has shown the importance of fairness in information retrieval systems to combat the issue of echo chambers and mitigate the rich-get-richer effect. Therefore, various fairness-aware information retrieval methods have been proposed. Score-based fairness-aware information retrieval algorithms, focusing on statistical parity, are interpretable but could be mathematically infeasible and lack generalizability. In contrast, learning-to-rank-based fairness-aware information retrieval algorithms using fairness-aware loss functions demonstrate strong performance but lack interpretability. In this study, we proposed a novel and interpretable framework that recursively refines query keywords to retrieve documents from underrepresented groups and achieve group fairness. Retrieved documents using refined queries will be re-ranked to ensure relevance. Our method not only shows promising retrieval results regarding relevance and fairness but also preserves interpretability by showing refined keywords used at each iteration.

Fumian Chen, Hui Fang
CLASP: Contrastive Language-Speech Pretraining for Multilingual Multimodal Information Retrieval

This study introduces CLASP (Contrastive Language-Speech Pretraining), a multilingual, multimodal representation tailored for audio-text information retrieval. CLASP leverages the synergy between spoken content and textual data. During training, we utilize our newly introduced speech-text dataset, which encompasses 15 diverse categories ranging from fiction to religion. CLASP’s audio component integrates audio spectrograms with a pre-trained self-supervised speech model, while its language encoding counterpart employs a sentence encoder pre-trained on over 100 languages. This unified lightweight model bridges the gap between various modalities and languages, enhancing its effectiveness in handling and retrieving multilingual and multimodal data. Our evaluations across multiple languages demonstrate that CLASP establishes new benchmarks in HITS@1, MRR, and meanR metrics, outperforming traditional ASR-based retrieval methods that rely on transcribing speech into text for subsequent text retrieval, especially in specific scenarios.

Mohammad Mahdi Abootorabi, Ehsaneddin Asgari
ColBERT-Serve: Efficient Multi-stage Memory-Mapped Scoring

We study serving retrieval models, particularly late interaction retrievers like ColBERT, to many concurrent users at once and under a small budget, in which the index may not fit in memory. We present ColBERT-serve, a serving system that applies a memory-mapping strategy to the ColBERT index, reducing RAM usage by 90% and permitting its deployment on cheap servers, and incorporates a multi-stage architecture with hybrid scoring, reducing ColBERT’s query latency and supporting many concurrent queries in parallel.

Kaili Huang, Thejas Venkatesh, Uma Dingankar, Antonio Mallia, Daniel Campos, Jian Jiao, Christopher Potts, Matei Zaharia, Kwabena Boahen, Omar Khattab, Saarthak Sarup, Keshav Santhanam
Are Representation Disentanglement and Interpretability Linked in Recommendation Models?
A Critical Review and Reproducibility Study

Unsupervised learning of disentangled representations has been closely tied to enhancing the representation intepretability of Recommender Systems (RSs). This has been achieved by making the representation of individual features more distinctly separated, so that it is easier to attribute the contribution of features to the model’s predictions. However, such advantages in interpretability and feature attribution have mainly been explored qualitatively. Moreover, the effect of disentanglement on the model’s recommendation performance has been largely overlooked. In this work, we reproduce the recommendation performance, representation disentanglement and representation interpretability of five well-known recommendation models on four RSs datasets. We quantify disentanglement and investigate the link of disentanglement with recommendation effectiveness and representation interpretability. While existing work in RSs has proposed disentangled representations as a gateway to improved effectiveness and interpretability, our findings show that disentanglement is not necessarily related to effectiveness but is closely related to representation interpretability. Our code and results are publicly available at https://github.com/edervishaj/disentanglement-interpretability-recsys .

Ervin Dervishaj, Tuukka Ruotsalo, Maria Maistro, Christina Lioma
A Reproducibility Study on Consistent LLM Reasoning for Natural Language Inference over Clinical Trials

With the rapid expansion of AI in healthcare, ensuring that language models can reason accurately and consistently within the medical domain is essential for enhancing clinical decision-making. Consistent reasoning is particularly challenging, since once a model outputs a judgment for a given medical statement, it should retain that judgment when faced with a syntactically altered version of the statement. In addition, when the altered version of the statement reflects a semantic shift, the model should adjust its judgment accordingly. In this paper, we describe the process of reproducing state-of-the-art methods for safe biomedical Natural Language Inference for Clinical Trials (NLI4CT), emphasizing model vulnerability to small input variations and the inherent complexity of clinical trial reports (CTRs). We specifically evaluate the reasoning capabilities of Large Language Models (LLMs) in the SemEval-2024 NLI4CT dataset, thus considering a task that focused on robustness and which serves as a good proxy for real-world applications. To improve thoroughness, we extended this study and explore a broader set of techniques, establishing baseline scores for several widely used models. We conclude with an analysis of the results, highlighting key insights and empirical lessons that contribute to future research in this domain.

Artur Guimarães, João Magalhães, Bruno Martins
On the Reproducibility of Learned Sparse Retrieval Adaptations for Long Documents

Document retrieval is one of the most challenging tasks in Information Retrieval. It requires handling longer contexts, often resulting in higher query latency and increased computational overhead. Recently, Learned Sparse Retrieval (LSR) has emerged as a promising approach to address these challenges. Some have proposed adapting the LSR approach to longer documents by aggregating segmented document using different post-hoc methods, including n-grams and proximity scores, adjusting representations, and learning to ensemble all signals. In this study, we aim to reproduce and examine the mechanisms of adapting LSR for long documents. Our reproducibility experiments confirmed the importance of specific segments, with the first segment consistently dominating document retrieval performance. Furthermore, We re-evaluate recently proposed methods – ExactSDM and SoftSDM – across varying document lengths, from short (up to 2 segments) to longer (3+ segments). We also designed multiple analyses to probe the reproduced methods and shed light on the impact of global information on adapting LSR to longer contexts. The complete code and implementation for this project is available at: https://github.com/lionisakis/Reproducibilitiy-lsr-long .

Emmanouil Georgios Lionis, Jia-Huei Ju
Fact vs. Fiction: Are the Reportedly “Magical” LLM-Based Recommenders Reproducible?

Reproducibility is a cornerstone of scientific progress, allowing researchers to validate findings and build upon previous work. While reproducibility has been an important issue in traditional recommender systems, the rise of Large Language Model (LLM)-based recommendation systems introduces new challenges, particularly in top-N recommendation tasks. In this study, we investigate the reproducibility of state-of-the-art LLM-based recommendation systems. We categorize key factors affecting the reproducibility of recommendation performance into four groups: code, data, methodological details, and evaluation. Our findings highlight significant performance fluctuations based on these factors, emphasizing the need for these factors to be clearly documented and considered during evaluations. To enhance reproducibility, we propose LLMReClarify, a comprehensive set of guidelines adapted from the NeurIPS reproducibility checklist, tailored specifically for LLM-based recommendation systems.

Shirin Tahmasebi, Narjes Nikzad, Amir H. Payberah, Meysam Asgari-chenaghlu, Mihhail Matskin
Reproducing HotFlip for Corpus Poisoning Attacks in Dense Retrieval

HotFlip is a topical gradient-based word substitution method for attacking language models. Recently, this method has been further applied to attack retrieval systems by generating malicious passages that are injected into a corpus, i.e., corpus poisoning. However, HotFlip is known to be computationally inefficient, with the majority of time being spent on gradient accumulation for each query-passage pair during the adversarial token generation phase, making it impossible to generate an adequate number of adversarial passages in a reasonable amount of time. Moreover, the attack method itself assumes access to a set of user queries, a strong assumption that does not correspond to how real-world adversarial attacks are usually performed. In this paper, we first significantly boost the efficiency of HotFlip, reducing the adversarial generation process from 4 h per document to only 15 min, using the same hardware. We further contribute experiments and analysis on two additional tasks: (1) transfer-based black-box attacks, and (2) query-agnostic attacks. Whenever possible, we provide comparisons between the original method and our improved version. Our experiments demonstrate that HotFlip can effectively attack a variety of dense retrievers, with an observed trend that its attack performance diminishes against more advanced and recent methods. Interestingly, we observe that while HotFlip performs poorly in a black-box setting, indicating limited capacity for generalization, in query-agnostic scenarios its performance is correlated to the volume of injected adversarial passages.

Yongkang Li, Panagiotis Eustratiadis, Evangelos Kanoulas
Combining Query Performance Predictors: A Reproducibility Study

A large number of approaches to Query Performance Prediction (QPP) have been proposed over the last two decades. As early as 2009, Hauff et al. [28] explored whether different QPP methods may be combined to improve prediction quality. Since then, significant research has been done both on QPP approaches, as well as their evaluation. This study revisits Hauff et al.’s work to assess the reproducibility of their findings in the light of new prediction methods, evaluation metrics, and datasets. We expand the scope of the earlier investigation by: (i) considering post-retrieval methods, including supervised neural techniques (only pre-retrieval techniques were studied in [28]); (ii) using sMARE for evaluation, in addition to the traditional correlation coefficients and RMSE; and (iii) experimenting with additional datasets (Clueweb09B and TREC DL). Our results largely support previous claims, but we also present several interesting findings. We interpret these findings by taking a more nuanced look at the correlation between QPP methods, examining whether they capture diverse information or rely on overlapping factors.

Sourav Saha, Suchana Datta, Dwaipayan Roy, Mandar Mitra, Derek Greene
A Reproducibility Study for Joint Information Retrieval and Recommendation in Product Search

Information Retrieval (IR) systems and Recommender Systems (RS) are ubiquitous commodities, essential to satisfy users’ information needs in digital environments. These two classes of systems are traditionally treated as two isolated components with limited, if any, interaction. Recent studies showed that jointly operating retrieval and recommendation allows for improved performance on both tasks. In this regard, the state-of-the-art is represented by the Unified Information Access (UIA) framework. In this work, we analyse the UIA framework from the reproducibility, replicability and generalizability sides. To do this, we first reproduce the original results the UIA framework achieved – highlighting a good reproducibility degree. Then we examine the behavior of UIA when using a public dataset – discovering that UIA is not always replicable. Moreover, to further investigate the generalizability of the UIA framework, we introduce some changes in its data processing and training procedures. Our empirical assessment highlights that the robustness and effectiveness of the UIA framework depend on several factors. In particular, some tasks, such as the Keyword Search, appear to be more robust, while others, such as Complementary Item Retrieval, are more vulnerable to changes in the underlying training process.

Simone Merlo, Guglielmo Faggioli, Nicola Ferro
Towards Reproducibility of Interactive Retrieval Experiments: Framework and Case Study

In interactive information retrieval (IIR), the inherent variability in user behavior and the complexities of user-system interactions pose significant challenges to reproducibility, which remain largely underexplored. To address these challenges, we propose a three-level model for evaluating the reproducibility of IIR experiments through the similarity of experimental findings, underlying measurements, and user behavior across original and reproduction studies. For each level, we introduce specific criteria, offering a structured framework for assessing reproducibility in IIR research.We demonstrate the framework’s utility in a case study where we simulate ideal reproductions by repeatedly dividing the sessions from a user experiment into two groups, treating one as the original and the other as the reproduction, while maintaining consistent experimental conditions. This approach enables us to focus on IIR-specific effects within our methodology. Our analysis indicates that the high variability in user behavior can hinder successful reproductions, raising concerns about the reliability of experimental results. Borderline significant results, in particular, are frequently not reproducible, despite an optimal reproduction setup. Furthermore, we find that cognitive abilities significantly influence search behavior, emphasizing the need to account for these characteristics in the design of future tests.

Jana Isabelle Friese, Norbert Fuhr
Revisiting Language Models in Neural News Recommender Systems

Neural news recommender systems (RSs) have integrated language models (LMs) to encode news articles with rich textual information into representations, thereby improving the recommendation process. Most studies suggest that (i) news RSs achieve better performance with larger pre-trained language models (PLMs) than shallow language models (SLMs), and (ii) that large language models (LLMs) outperform PLMs. However, other studies indicate that PLMs sometimes lead to worse performance than SLMs. Thus, it remains unclear whether using larger LMs consistently improves the performance of news RSs. In this paper, we revisit, unify, and extend these comparisons of the effectiveness of LMs in news RSs using the real-world MIND dataset. We find that (i) larger LMs do not necessarily translate to better performance in news RSs, and (ii) they require stricter fine-tuning hyperparameter selection and greater computational resources to achieve optimal recommendation performance than smaller LMs. On the positive side, our experiments show that larger LMs lead to better recommendation performance for cold-start users: they alleviate dependency on extensive user interaction history and make recommendations more reliant on the news content.

Yuyue Zhao, Jin Huang, David Vos, Maarten de Rijke
Multimodal Feature Extraction for Assistive Technology: Evaluation and Dataset

Assistive technology (AT) is an important real-world context that provides fertile ground for search and recommendation research. Identifying suitable equipment that benefits disabled users is a crucial challenge connected to all 17 UN Sustainable Development Goals. AT information retrieval tools should be comprehensive, flexible and accessible as possible to end users without specialist knowledge. Recent advances in AI/ML support new opportunities to enable users to find AT, facilitating state and international legislative objectives which require inclusive design. This work contributes a new collection for AT retrieval research derived from a production database. We also explore the proper evaluation of a state-of-the-art visual-linguistic model using information gain with multimodality assessed through ablation. Preliminary experiments suggest multimodal representation of AT items for generating text-based features is superior to either text or images alone. In the case of abstractions, such as item “goals”, there is little difference between singular text or image data; extended product descriptions benefit from the saliency of text. Secondly, human assessment exhibits only slight deviation from an LLM’s featurisation of products. Together, these findings provide new resources for downstream application development and benchmarking of future information retrieval exercises for AT.

Hunter Briegel, Maya Pagal, Jacki Liddle, Shane Culpepper
Improving Novelty and Diversity of Nearest-Neighbors Recommendation by Exploiting Dissimilarities

Neighborhood-based approaches remain widely used techniques in collaborative filtering recommender systems due to their versatility, simplicity, and efficiency. Traditionally, these algorithms consider similarity functions to measure how close user or item interactions are. However, their focus on capturing similar tastes often overlooks divergent preferences that could enhance recommendations. In this paper, we explore alternative methods to incorporate such information to improve beyond-accuracy performance in this type of recommenders. We define three mechanisms based on various modeling assumptions to integrate differing preferences into traditional nearest neighbors algorithms. Our comparison on four well-known and different datasets shows that our proposed approach can enhance the novelty and diversity of the recommendations while maintaining ranking accuracy. Our implementation is available at https://github.com/pablosanchezp/kNNDissimilarities .

Pablo Sánchez, Javier Sanz-Cruzado, Alejandro Bellogín
LambdaFair for Fair and Effective Ranking

Traditional machine learning algorithms are known to amplify bias in data or introduce new biases during the learning process, often resulting in discriminatory outcomes that impact individuals from marginalized or underrepresented groups. In information retrieval, one application of machine learning is learning-to-rank frameworks, typically employed to reorder items based on their relevance to user interests. This focus on effectiveness can lead to rankings that unevenly distribute exposure among groups, affecting their visibility to the final user. Consequently, ensuring fair treatment of protected groups has become a pivotal challenge in information retrieval to prevent discrimination, alongside the need to maximize ranking effectiveness. This work introduces LambdaFair, a novel in-processing method designed to jointly optimize effectiveness and fairness ranking metrics. LambdaFair builds upon the LambdaMART algorithm, harnessing its ability to train highly effective models through additive ensembles of decision trees while integrating fairness awareness. We evaluate LambdaFair on three publicly available datasets, comparing its performance with state-of-the-art learning algorithms in terms of both fairness and effectiveness. Our experiments demonstrate that, on average, LambdaFair achieves 6.7% higher effectiveness and only 0.4% lower fairness compared to state-of-the-art fairness-oriented learning algorithms. This highlights LambdaFair’s ability to improve fairness without sacrificing the model’s effectiveness.

Federico Marcuzzi, Claudio Lucchese, Salvatore Orlando
How Child-Friendly is Web Search? An Evaluation of Relevance vs. Harm

Today’s children grow up with easy access to the Web and respective search engines. However, when children try to inform themselves about current conflicts, crimes, etc., general web search engines might show relevant but inappropriate content. Instead, search engines specifically aimed at children as searchers try to filter out inappropriate content or even operate on much smaller, manually curated indexes of appropriate documents only. To understand whether this is effective, we compare three general and three child-oriented web search engines based on a new German evaluation corpus of 50 queries spanning personal, political, educational, and entertainment information needs of children. For each query, we annotate the search engines’ top-10 results with respect to relevance and potential harm—a child-friendly result should be relevant but harmless. Our comparison shows that the child-oriented search engines effectively remove potentially harmful documents, while the general web search engines return more relevant documents at the expense of significantly more potentially harmful content.

Maik Fröbe, Sophie Charlotte Bartholly, Matthias Hagen
GASCADE: Grouped Summarization of Adverse Drug Event for Enhanced Cancer Pharmacovigilance

In the realm of cancer treatment, summarizing adverse drug events (ADEs) reported by patients using prescribed drugs is crucial for enhancing pharmacovigilance practices and improving drug-related decision-making. While the volume and complexity of pharmacovigilance data have increased, existing research in this field has predominantly focused on general diseases rather than specifically addressing cancer. This work introduces the task of grouped summarization of adverse drug events reported by multiple patients using the same drug for cancer treatment. To address the challenge of limited resources in cancer pharmacovigilance, we present the MultiLabeled Cancer Adverse Drug Reaction and Summarization (MCADRS) dataset. This dataset includes pharmacovigilance posts detailing patient concerns regarding drug efficacy and adverse effects, along with extracted labels for drug names, adverse drug events, severity, and adversity of reactions, as well as summaries of ADEs for each drug. Additionally, we propose the Grouping and Abstractive Summarization of Cancer Adverse Drug events (GASCADE) framework, a novel pipeline that combines the information extraction capabilities of Large Language Models (LLMs) with the summarization power of the encoder-decoder T5 model. Our work is the first to apply alignment techniques, including advanced algorithms like Direct Preference Optimization, to encoder-decoder models using synthetic datasets for summarization tasks. Through extensive experiments, we demonstrate the superior performance of GASCADE across various metrics, validated through both automated assessments and human evaluations. This multitasking approach enhances drug-related decision-making and fosters a deeper understanding of patient concerns, paving the way for advancements in personalized and responsive cancer care. The code and dataset used in this work are publicly available. ( https://github.com/SofeeyaJ/GASCADE_ECIR2025 ).

Sofia Jamil, Aryan Dabad, Bollampalli Areen Reddy, Sriparna Saha, Rajiv Misra, Adil A. Shakur
Poison-RAG: Adversarial Data Poisoning Attacks on Retrieval-Augmented Generation in Recommender Systems

This study presents Poison-RAG, a framework for adversarial data poisoning attacks targeting retrieval-augmented generation (RAG)-based recommender systems. Poison-RAG manipulates item metadata, such as tags and descriptions, to influence recommendation outcomes. Using item metadata generated through a large language model (LLM) and embeddings derived via the OpenAI API, we explore the impact of adversarial poisoning attacks on provider-side, where attacks are designed to promote long-tail items and demote popular ones. Two attack strategies are proposed: local modifications, which personalize tags for each item using BERT embeddings, and global modifications, applying uniform tags across the dataset. Experiments conducted on the MovieLens dataset in a black-box setting reveal that local strategies improve manipulation effectiveness by up to 50%, while global strategies risk boosting already popular items. The results indicate that popular items are more susceptible to attacks, whereas long-tail items are harder to manipulate. Approximately 70% of items lack tags, presenting a cold-start challenge; data augmentation and synthesis are proposed as potential defense mechanisms to enhance RAG-based systems’ resilience. The findings emphasize the need for robust metadata management to safeguard recommendation frameworks. Code and data are available at https://github.com/atenanaz/Poison-RAG .

Fatemeh Nazary, Yashar Deldjoo, Tommaso di Noia
Tales and Truths: Exploring the Linguistic Journey of 19th Century Literature and Non-fiction

In this work, we explore the potential of using the lens of information retrieval to reveal societal themes within historical texts. We specifically investigate how term usage evolves over time in the 19th century texts categorised as either fiction or non-fiction. By applying Pseudo-relevance Feedback to a collection of texts from the British Library, segmented by decade, we analyse changes in related terms over time within each category. Our analysis employs standard metrics, such as Kendall’s $$\tau $$ , Jaccard similarity, and Jensen-Shannon divergence, to assess overlaps and shifts in these expanded term sets. The results reveal significant divergences in related terms across decades, highlighting key linguistic and conceptual changes during the 19th century.

Suchana Datta, Dwaipayan Roy, Derek Greene, Gerardine Meaney
Fair Exposure Allocation Using Generative Query Expansion

Automatically rewriting a user’s query using traditional pseudo-relevance feedback (PRF) mechanisms typically increases a search system’s effectiveness in retrieving relevant documents. With recent advances in the generative capabilities of Large Language Models (LLMs), the effectiveness of PRF mechanisms that leverage LLMs has significantly improved. However, little previous work has explored the potential impact of generative relevance feedback on the fairness of the search results. In this work, we investigate how generative PRF for query rewriting influences the fair allocation of exposure in the search results. We propose a novel generative PRF mechanism for fairness using automatically generated query expansion terms, which we call Fair Generative Query Expansion (FGQE). We investigate four prompting strategies and show that FGQE can effectively be applied using zero-shot long-text generation to create effective new query terms. Our experiments on the TREC 2021 and the TREC 2022 Fair Ranking Track collections demonstrate that all of our prompting strategies enhance exposure allocation compared to both traditional and dense PRF baselines, achieving improvements of up to approximately 8% in terms of the Attention Weighted Ranked Fairness (AWRF) metric. Simultaneously, our FGQE approach enhances fairness while maintaining the relevance of search results.

Thomas Jaenich, Graham McDonald, Iadh Ounis
Enabling Low-Resource Language Retrieval: Establishing Baselines for Urdu MS MARCO

As the Information Retrieval (IR) field increasingly recognizes the importance of inclusivity, addressing the needs of low-resource languages remains a significant challenge. This paper introduces the first large-scale Urdu IR dataset, created by translating the MS MARCO dataset through machine translation. We establish baseline results through zero-shot learning for IR in Urdu and subsequently apply the mMARCO multilingual IR methodology to this newly translated dataset. Our findings demonstrate that the fine-tuned model (Urdu-mT5-mMARCO) achieves a Mean Reciprocal Rank (MRR@10) of 0.247 and a Recall@10 of 0.439, representing significant improvements over zero-shot results and showing the potential for expanding IR access for Urdu speakers. By bridging access gaps for speakers of low-resource languages, this work not only advances multilingual IR research but also emphasizes the ethical and societal importance of inclusive IR technologies. This work provides valuable insights into the challenges and solutions for improving language representation and lays the groundwork for future research, especially in South Asian languages, which can benefit from the adaptable methods used in this study.

Umer Butt, Stalin Varanasi, Günter Neumann
Improving Low-Resource Retrieval Effectiveness Using Zero-Shot Linguistic Similarity Transfer

Globalisation and colonisation have led the vast majority of the world to use only a fraction of languages, such as English and French, to communicate, excluding many others. This has severely affected the survivability of many now-deemed vulnerable or endangered languages, such as Occitan and Sicilian. These languages often share some characteristics, such as elements of their grammar and lexicon, with other high-resource languages, e.g. French or Italian. They can be clustered into groups of language varieties with various degrees of mutual intelligibility. Current search systems are not usually trained on many of these low-resource varieties, leading search users to express their needs in a high-resource language instead. This problem is further complicated when most information content is expressed in a high-resource language, inhibiting even more retrieval in low-resource languages. We show that current search systems are not robust across language varieties, severely affecting retrieval effectiveness. Therefore, it would be desirable for these systems to leverage the capabilities of neural models to bridge the differences between these varieties. This can allow users to express their needs in their low-resource variety and retrieve the most relevant documents in a high-resource one. To address this, we propose fine-tuning neural rankers on pairs of language varieties, thereby exposing them to their linguistic similarities ( https://github.com/andreaschari/linguistic-transfer ). We find that this approach improves the performance of the varieties upon which the models were directly trained, thereby regularising these models to generalise and perform better even on unseen language variety pairs. We also explore whether this approach can transfer across language families and observe mixed results that open doors for future research.

Andreas Chari, Sean MacAvaney, Iadh Ounis
How to Diversify any Personalized Recommender?

In this paper, we introduce a novel approach to improve the diversity of Top-N recommendations while maintaining accuracy. Our approach employs a user-centric pre-processing strategy aimed at exposing users to a wide array of content categories and topics. We personalize this strategy by selectively adding and removing a percentage of interactions from user profiles. This personalization ensures we remain closely aligned with user preferences while gradually introducing distribution shifts. Our pre-processing technique offers flexibility and can seamlessly integrate into any recommender architecture. We run extensive experiments on two publicly available data sets for news and book recommendations to evaluate our approach. We test various standard and neural network-based recommender system algorithms. Our results show that our approach generates diverse recommendations, ensuring users are exposed to a wider range of items. Furthermore, using pre-processed data for training leads to recommender systems achieving performance levels comparable to, and in some cases, better than those trained on original, unmodified data. Additionally, our approach promotes provider fairness by facilitating exposure to minority categories. (Our GitHub code is available at: https://github.com/SlokomManel/How-to-Diversify-any-Personalized-Recommender- ).

Manel Slokom, Savvina Daniil, Laura Hollink
Nano-ESG: Extracting Corporate Sustainability Information from News Articles

Determining the sustainability impact of companies is a highly complex subject which has garnered more and more attention over the past few years. Today, investors largely rely on sustainability-ratings from established rating-providers in order to analyze how responsibly a company acts. However, those ratings have recently been criticized for being hard to understand and nearly impossible to reproduce.An independent way to find out about the sustainability practices of companies lies in the rich landscape of news article data. In this paper, we explore a different approach to identify key opportunities and challenges of companies in the sustainability domain. We present a novel dataset of German and English news articles which were gathered for major German companies between January 2023 and September 2024. By applying a mixture of Natural Language Processing techniques, we first identify relevant articles, before summarizing them and extracting their sustainability-related sentiment and aspect using Large Language Models (LLMs). Furthermore, we conduct an evaluation of the obtained data and determine that the LLM-produced answers are accurate. We release both datasets at https://github.com/Bailefan/Nano-ESG .

Fabian Billert, Stefan Conrad
Verifying Cross-Modal Entity Consistency in News Using Vision-Language Models

The web has become a crucial source of information, but it is also used to spread disinformation, often conveyed through multiple modalities like images and text. The identification of inconsistent cross-modal information, in particular entities such as persons, locations, and events, is critical to detect disinformation. Previous works either identify out-of-context disinformation by assessing the consistency of images to the whole document, neglecting relations of individual entities, or focus on generic entities that are not relevant to news. So far, only few approaches have addressed the task of validating entity consistency between images and text in news. However, the potential of large vision-language models (LVLMs) has not been explored yet. In this paper, we propose an LVLM-based framework for verifying cross-modal entity consistency (LVLM4CEC), to assess whether persons, locations and events in news articles are consistent across both modalities. We suggest effective prompting strategies for LVLMs for entity verification that leverage reference images crawled from web. Moreover, we extend three existing datasets for the task of entity verification in news providing manual ground-truth data. Our results show the potential of LVLMs for automating cross-modal entity verification, showing improved accuracy in identifying persons and events when using evidence images. Moreover, our method outperforms a baseline for location and event verification in documents. The datasets and source code are available on GitHub at https://github.com/TIBHannover/LVLM4CEC .

Sahar Tahmasebi, Eric Müller-Budack, Ralph Ewerth
Improving Minimax Group Fairness in Sequential Recommendation

Training sequential recommenders such as SASRec with uniform sample weights achieves good overall performance but can fall short on specific user groups. One such example is popularity bias, where mainstream users receive better recommendations than niche content viewers. To improve recommendation quality across diverse user groups, we explore three Distributionally Robust Optimization(DRO) methods: Group DRO, Streaming DRO, and Conditional Value at Risk (CVaR) DRO. While Group and Streaming DRO rely on group annotations and struggle with users belonging to multiple groups, CVaR does not require such annotations and can naturally handle overlapping groups. In experiments on two real-world datasets, we show that the DRO methods outperform standard training, with CVaR delivering the best results. Additionally, we find that Group and Streaming DRO are sensitive to the choice of group used for loss computation. Our contributions include (i) a novel application of CVaR to recommenders, (ii) showing that the DRO methods improve group metrics as well as overall performance, and (iii) demonstrating CVaR’s effectiveness in the practical scenario of intersecting user groups. Our code is available at https://github.com/krishnacharya/sequentialrec-fairness.

Krishna Acharya, David Wardrope, Timos Korres, Aleksandr V. Petrov, Anders Uhrenholt
Call for Research on the Impact of Information Retrieval on Social Norms

The information retrieval (IR) systems of major media platforms have a significant impact on social norms. Social norms contribute to the cultural identity of a society, but can also lead to people being marginalized, suffering from social pressure, and feeling inferior. For this reason, we call on the IR community to (1) contribute to the social sciences with computational means to study the impact of IR systems on social norms, and (2) to incorporate respective social science research findings into IR system development. To support our call, this paper presents a dataset and classification technology for investigating the prevalence of normative beauty ideals in multimodal (image and text) search results and recommendations. On a dataset comprising 928 annotated social media posts, in addition to determining the best classification model for the task, we examine how state-of-the-art zero-shot classifiers perform compared to fine-tuned models, and how multimodal models perform compared to unimodal variants. With 92% classification accuracy, a late fusion model with individually fine-tuned image and text representations achieves peak effectiveness, which are promising first results for research in computational social science and on IR systems. To illustrate our work, we analyze the image search results pages of a major web search engine and report our findings. The code repository of our research is available at https://github.com/webis-de/ECIR-25 .

Tim Gollub, Pierre Achkar, Martin Potthast, Benno Stein
FlashCheck: Exploration of Efficient Evidence Retrieval for Fast Fact-Checking

The advances in digital tools have led to the rampant spread of misinformation. While fact-checking aims to combat this, manual fact-checking is cumbersome and not scalable. It is essential for automated fact-checking to be efficient for aiding in combating misinformation in real-time and at the source. Fact-checking pipelines primarily comprise a knowledge retrieval component which extracts relevant knowledge to fact-check a claim from large knowledge sources like Wikipedia and a verification component. The existing works primarily focus on the fact-verification part rather than evidence retrieval from large data collections, which often face scalability issues for practical applications such as live fact-checking. In this study, we address this gap by exploring various methods for indexing a succinct set of factual statements from large collections like Wikipedia to enhance the retrieval phase of the fact-checking pipeline. We also explore the impact of vector quantization to further improve the efficiency of pipelines that employ dense retrieval approaches for first-stage retrieval.We study the efficiency and effectiveness of the approaches on fact-checking datasets such as HoVer and WiCE, leveraging Wikipedia as the knowledge source. We also evaluate the real-world utility of the efficient retrieval approaches by fact-checking 2024 presidential debate and also open source the collection of claims with corresponding labels identified in the debate. Through a combination of indexed facts together with Dense retrieval and Index compression, we achieve up to a 10.0x speedup on CPUs and more than a 20.0x speedup on GPUs compared to the classical fact-checking pipelines over large collections.

Kevin Nanhekhan, V. Venktesh, Erik Martin, Henrik Vatndal, Vinay Setty, Avishek Anand
kANNolo: Sweet and Smooth Approximate k-Nearest Neighbors Search

Approximate Nearest Neighbors (ANN) search is a crucial task in several applications like recommender systems and information retrieval. Current state-of-the-art ANN libraries, although being performance-oriented, often lack modularity and ease of use. This translates into them not being fully suitable for easy prototyping and testing of research ideas, an important feature to enable. We address these limitations by introducing kANNolo, a novel—research-oriented—ANN library written in Rust and explicitly designed to combine usability with performance effectively. kANNolo is the first ANN library that supports dense and sparse vector representations made available on top of different similarity measures, e.g., euclidean distance and inner product. Moreover, it also supports vector quantization techniques, e.g., Product Quantization, on top of the indexing strategies implemented. These functionalities are managed through Rust traits, allowing shared behaviors to be handled abstractly. This abstraction ensures flexibility and facilitates an easy integration of new components. In this work, we detail the architecture of kANNolo and demonstrate that its flexibility does not compromise performance. The experimental analysis shows that kANNolo achieves state-of-the-art performance in terms of speed-accuracy trade-off while allowing fast and easy prototyping, thus making kANNolo a valuable tool for advancing ANN research. Source code available on GitHub: https://github.com/TusKANNy/kannolo .

Leonardo Delfino, Domenico Erriquez, Silvio Martinico, Franco Maria Nardini, Cosimo Rulli, Rossano Venturini
TROPIC – Trustworthiness Rating of Online Publishers Through Online Interactions Calculation

Existing methods for assessing the trustworthiness of news publishers face high costs and scalability issues. The tool presented in this paper supports the efforts of specialized organizations by providing a solution that, starting from an online discussion, provides (i) trustworthiness ratings for previously unclassified news publishers and (ii) an interactive platform to guide annotation efforts and improve the robustness of the ratings. The system implements a novel framework for assessing the trustworthiness of online news publishers based on user interactions on social media platforms.

Manuel Pratelli, Fabio Saracco, Marinella Petrocchi
LS-Dashboard: A Tool for Monitoring and Analyzing Data Annotation in Machine Learning Classification Tasks

High-quality data annotation is critical for the success of machine learning models, particularly in supervised learning and large-scale projects involving multiple annotators. This paper introduces LS-Dashboard, a comprehensive analytical tool designed to evaluate and monitor annotation projects. LS-Dashboard integrates seamlessly with Label Studio, an open-source annotation platform, to provide real-time insights into annotator performance, task distribution, and agreement metrics. The tool consists of a Python package for backend data analysis and a Streamlit-based app for interactive visualizations. By utilizing two inverted index structures to efficiently retrieve annotation data, LS-Dashboard enables in-depth insights at both the annotation task and annotator levels. It supports classification tasks, offering customizable visualizations for detailed project monitoring.

Vinicius Monteiro de Lira, Peng Jiang
SimplifyMyText: An LLM-Based System for Inclusive Plain Language Text Simplification

Text simplification is essential for making complex content accessible to diverse audiences who face comprehension challenges. Yet, the limited availability of simplified materials creates significant barriers to personal and professional growth and hinders social inclusion. Although researchers have explored various methods for automatic text simplification, none fully leverage large language models (LLMs) to offer tailored customization for different target groups and varying levels of simplicity. Moreover, despite its proven benefits for both consumers and organizations, the well-established practice of plain language remains underutilized. In this paper, we introduce https://simplifymytext.org , the first system designed to produce plain language content from multiple input formats, including typed text and file uploads, with flexible customization options for diverse audiences. We employ GPT-4 and Llama-3 and evaluate outputs across multiple metrics. Overall, our work contributes to research on automatic text simplification and highlights the importance of tailored communication in promoting inclusivity.

Michael Färber, Parisa Aghdam, Kyuri Im, Mario Tawfelis, Hardik Ghoshal
Sim4Rec: Flexible and Extensible Simulator for Recommender Systems for Large-Scale Data

Simulators for recommender systems are widely used for recommender systems performance evaluation and feedback loop effects analysis. Existing simulators often propose inflexible pipelines, are focused on narrow research tasks, or are not adapted to work with industrial large data volumes. To address these challenges, we developed the Sim4Rec simulation framework. The Sim4Rec models key aspects of the user-recommender system interaction process, such as user visits, items’ availability, users’ responses, and preferences dynamics using real and synthetic data, and provides additional functionality for the generation of synthetic users and items. The architecture of Sim4Rec is designed to be flexible and extensible to suit particular users’ needs and perform experiments on large-scale industrial datasets.

Anna Volodkevich, Veronika Ivanova, Alexey Vasilev, Dmitry Bugaychenko, Maxim Savchenko
TimIR: Time-Traveling Through IR History

In live systems, where the underlying document corpora evolve frequently, a query executed at two different points in time can yield two different result sets. Although not important in traditional web search settings, domains, such as patent retrieval or systematic literature reviews, rely on the time of execution to obtain the relevant result set. To ensure reproducibility and auditability of their search results researchers in these fields usually rely on Boolean Retrieval. This is because sparse and dense retrieval methods do not satisfy this requirement, as sparse retrieval relies on global term and document statistics, and dense retrieval relies on document embeddings. These values and vectors are subject to change if the document corpora are updated and therefore change the ranking as well.In this paper we present TimIR ( https://timir.ds-ifs.tuwien.ac.at ), a dashboard to explore the evolution of Information Retrieval publications over time, while showcasing a hybrid retrieval system that allows researchers to recreate sparse rankings for historical states of the document corpora. Having search result lists that can be recreated, with a system like TimIR for example, makes it possible to cite such sets of data, especially when the results are used for further down-the-stream research, without the impracticality of additionally storing the search results themselves. Furthermore, it possible to compare rankings for a particular query over time, or to explore the literature available during a specific time period, for example the time when Karen Spärck Jones introduced TF-IDF.

Moritz Staudinger, Wojciech Kusa, Florina Piroi, Andreas Rauber, Allan Hanbury
Prabodhini: Making Large Language Models Inclusive for Low-Text Literate Users

The usage of large language models is restricted to people with quality education and high text literacy. In the absence of any intervention, the power of LLMs is likely to be exploited by empowered and elite groups in society. We aim to make access to LLMs more equitable by enabling low-text-literate users to interact with LLM-backed Conversational AI systems through iterative question answering over voice. We study this in the context of queries related to Social Welfare Schemes in India. We introduce actionable information retrieval (AIR), a system that improves accessibility for low-text literate users by guiding them through queries via an interactive flowchart of yes/no questions. This approach enhances user engagement, progressively leading users to precise answers without text-dense responses. We demonstrate these functionalities through Prabodhini, a light-weight, mobile-friendly application that uses Retrieval Augmented Generation (RAG) with chain-of-thought prompting over GPT-4o to retrieve personalized responses. Our pilot study results performed over the low-text literate users comprising the housekeeping, gardening, house-help, and security staff at BITS Pilani Hyderabad campus indicate a high level of user satisfaction, with positive feedback reported across all participants regarding the app’s usability, design, and functionality. Link to demonstration video can be found here , link to code, datasets, and evaluations are here .

Vivan Jain, Srivant Vishnuvajjala, Pranathi Voora, Bhaskar Ruthvik Bikkina, Bharghavaram Boddapati, C. R. Chaitra, Dipanjan Chakraborty, Prajna Upadhyay
Correction to: Enabling Low-Resource Language Retrieval: Establishing Baselines for Urdu MS MARCO
Umer Butt, Stalin Varanasi, Günter Neumann
Backmatter
Metadaten
Titel
Advances in Information Retrieval
herausgegeben von
Claudia Hauff
Craig Macdonald
Dietmar Jannach
Gabriella Kazai
Franco Maria Nardini
Fabio Pinelli
Fabrizio Silvestri
Nicola Tonellotto
Copyright-Jahr
2025
Electronic ISBN
978-3-031-88717-8
Print ISBN
978-3-031-88716-1
DOI
https://doi.org/10.1007/978-3-031-88717-8