Zum Inhalt

2025 | Buch

Advances in Information Retrieval

47th European Conference on Information Retrieval, ECIR 2025, Lucca, Italy, April 6–10, 2025, Proceedings, Part II

herausgegeben von: Claudia Hauff, Craig Macdonald, Dietmar Jannach, Gabriella Kazai, Franco Maria Nardini, Fabio Pinelli, Fabrizio Silvestri, Nicola Tonellotto

Verlag: Springer Nature Switzerland

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

Die fünfbändige Reihe LNCS 15572, 15573, 15574, 15575 und 15576 stellt die referierten Konferenzunterlagen der 47. Europäischen Konferenz zur Informationsgewinnung (ECIR 2025) dar, die vom 6. bis 10. April 2025 in Lucca, Italien, stattfand. Die 52 vollständigen Beiträge, 11 Ergebnisse, 42 Kurzbeiträge und 76 Beiträge anderer Art, die in diesem Verfahren präsentiert wurden, wurden sorgfältig geprüft und aus 530 Einreichungen ausgewählt. Die anerkannten Arbeiten decken den aktuellen Stand der Informationsgewinnung und Empfehlungssysteme ab: Nutzeraspekte, System- und Grundlagenaspekte, künstliche Intelligenz und maschinelles Lernen, Anwendungen, Evaluierung, neue soziale und technische Herausforderungen und andere Themen von direkter oder indirekter Relevanz für Suche und Empfehlung.

Inhaltsverzeichnis

Frontmatter
Set-Encoder: Permutation-Invariant Inter-passage Attention for Listwise Passage Re-ranking with Cross-Encoders
Abstract
Existing cross-encoder models can be categorized as pointwise, pairwise, or listwise. Pairwise and listwise models allow passage interactions, which typically makes them more effective than pointwise models but less efficient and less robust to input passage order permutations. To enable efficient permutation-invariant passage interactions during re-ranking, we propose a new cross-encoder architecture with inter-passage attention: the Set-Encoder. In experiments on TREC Deep Learning and TIREx, the Set-Encoder is as effective as state-of-the-art listwise models while being more efficient and invariant to input passage order permutations. Compared to pointwise models, the Set-Encoder is particularly more effective when considering inter-passage information, such as novelty, and retains its advantageous properties compared to other listwise models. Our code is publicly available at https://​github.​com/​webis-de/​ECIR-25.
Ferdinand Schlatt, Maik Fröbe, Harrisen Scells, Shengyao Zhuang, Bevan Koopman, Guido Zuccon, Benno Stein, Martin Potthast, Matthias Hagen
Patent Figure Classification Using Large Vision-Language Models
Abstract
Patent figure classification facilitates faceted search in patent retrieval systems, enabling efficient prior art search. Existing approaches have explored patent figure classification for only a single aspect and for aspects with a limited number of concepts. In recent years, large vision-language models (LVLMs) have shown tremendous performance across numerous computer vision downstream tasks, however, they remain unexplored for patent figure classification. Our work explores the efficacy of LVLMs in patent figure visual question answering (VQA) and classification, focusing on zero-shot and few-shot learning scenarios. For this purpose, we adapt existing patent figure datasets to create new datasets, PatFigVQA and PatFigCLS suitable for fine-tuning and evaluation regarding multiple aspects of patent figures (i.e., type, projection, patent class, and objects). For a computational-effective handling of a large number of classes using LVLM, we propose a novel tournament-style classification strategy that leverages a series of multiple-choice questions. Experimental results and comparisons of multiple classification approaches based on LVLMs and Convolutional Neural Networks (CNNs) in few-shot settings show the feasibility of the proposed approaches.
Sushil Awale, Eric Müller-Budack, Ralph Ewerth
Efficient Session Retrieval Using Topical Index Shards
Abstract
Retrieval is often considered one query at a time. However, in practice, queries regularly come in the context of sessions with coherent topics. By dividing a collection into topical index shards and matching the topical context of a session with the right shards, we may reduce the amount of resources required for answering each query. We consider two alternatives: (1) starting with exhaustive search and pruning unnecessary shards after each session turn, and (2) applying a resource selection algorithm to pre-select shards at the start of the session.
We empirically evaluate our approaches on a conversational search dataset (CAsT), and compare effectiveness and resource usage against exhaustive retrieval. Our experiments show that both approaches reduce the number of postings necessary to fulfill a search request (by 50–80%), and in terms of effectiveness our systems are statistically indistinguishable from a system performing exhaustive retrieval.
Gijs Hendriksen, Djoerd Hiemstra, Arjen P. de Vries
Feature Attribution Explanations of Session-Based Recommendations
Abstract
Session-based recommender systems often use black-box models to dynamically make recommendations based on current session interactions. However, explaining why an item is recommended is required under various legislations to increase the interpretability of the system. A popular approach for explaining recommendations is additive feature attribution, which assigns a score of how much each feature contributes to the prediction, such that the sum of the scores equals the prediction score. We identify two limitations in applying additive feature attribution to interpret session-based recommendations: 1. Additive feature attribution does not model the attribution of sequential dependencies in the session interactions learned by the recommendation model; it assumes that interactions occur independently of each other. 2. As additive feature attribution relies on independent features, it fails when the features are correlated due to repeated interactions in sessions. We empirically verify the impact of these limitations upon explanation faithfulness. We further fix these limitations, by presenting a simple occlusion-based feature attribution approach that is specifically tailored to session-based recommendations. Our method computes joint feature attributions for sets of interactions with sequential dependencies and sets of repeated interactions to account for their non-linear relations. Experimental results on multiple datasets and models confirm that our method is more faithful and stable than state-of-the-art attribution-based explanation methods. Our code and additional analyses are publicly available at https://​github.​com/​simonebbruun/​explaining_​session_​based_​RSs.
Simone Borg Bruun, Maria Maistro, Christina Lioma
Evaluating Sequential Recommendations in the Wild: A Case Study on Offline Accuracy, Click Rates, and Consumption
Abstract
Sequential recommendation problems have received increased research interest in recent years. Our knowledge about the effectiveness of sequential algorithms in practice is however limited. In this paper, we report on the outcomes of an A/B test on a video and movie streaming platform, where we benchmarked a sequential model against a non-sequential, personalized recommendation model, as well as a popularity-based baseline. Contrary to what we had expected from a preceding offline experiment, we observed that the popularity-based and the non-sequential models led to the highest click-through rates. However, in terms of the adoption of the recommendations, the sequential model was the most successful one in terms of viewing times. While our work points out the effectiveness of sequential models in practice, it also reminds us about important open challenges regarding (a) the sometimes limited predictive power of classic offline evaluations and (b) the dangers of optimizing recommendation models for click-through-rates.
Anastasiia Klimashevskaia, Snorre Alvsvåg, Christoph Trattner, Alain D. Starke, Astrid Tessem, Dietmar Jannach
Graph-Convolutional Networks: Named Entity Recognition and Large Language Model Embedding in Document Clustering
Abstract
Recent advances in machine learning, particularly Large Language Models (LLMs) such as BERT and GPT, provide rich contextual embeddings that improve text representation. However, current document clustering approaches often ignore the deeper relationships between named entities and the potential of LLM embeddings. This paper proposes a novel approach that integrates Named Entity Recognition (NER) and LLM embeddings within a graph-based framework for document clustering. The method builds a graph with nodes representing documents and edges weighted by named entity similarity, optimized using a graph-convolutional network (GCN). This ensures a more effective grouping of semantically related documents. Experimental results indicate that our approach outperforms conventional co-occurrence-based methods in clustering, notably for documents rich in named entities.
Imed Keraghel, Mohamed Nadif
Exploring the Relationship Between Listener Receptivity and Source of Music Recommendations
Abstract
Music recommender systems are utilised by many music streaming platforms to provide new artist and song recommendations on a personalised basis to listeners. By applying dynamic data modelling techniques, music recommender systems add to the current ecosystem of music recommendations. This includes direct and indirect, such as word of mouth, musical journalism, TV, radio, and live events. We report results from a study designed to investigate listener receptivity to music recommender systems, compared to editorial and peer based recommendations. Our results suggest participants’ self-reported receptivity is significantly greater for music recommender systems compared to editorial and peer based. However, results from participants’ evaluation of playlists perceived to be created by each recommender source suggests a significantly greater duration of play for peer-based recommendation playlists. No difference was found in likelihood to spend time or money on artists when only the recommendation source was considered. We discuss these results in relation to how anchoring bias may influence listeners’ behaviours and how platform design may be informed based on their requirements and objectives.
John Paul Vargheese, Marianne Wilson, Katherine Stephen, Rachel Salzano, David Brazier
News Without Borders: Domain Adaptation of Multilingual Sentence Embeddings for Cross-Lingual News Recommendation
Abstract
Rapidly growing numbers of multilingual news consumers pose an increasing challenge to news recommender systems in terms of providing customized recommendations. First, existing neural news recommenders, even when powered by multilingual language models (LMs), suffer substantial performance losses in zero-shot cross-lingual transfer (ZS-XLT). Second, the current paradigm of fine-tuning the backbone LM of a neural recommender on task-specific data is computationally expensive and infeasible in few-shot recommendation and cold-start setups, where data are scarce or completely unavailable. In this work, we propose a news-adapted sentence encoder (NaSE), domain-specialized from a pretrained massively multilingual sentence encoder (SE). To this end, we compile and leverage PolyNews and PolyNewsParallel, two multilingual news-specific corpora. With the news-adapted multilingual SE in place, we test the effectiveness of (i.e., question the need for) supervised fine-tuning for news recommendation, and propose a simple and strong baseline based on (i) frozen NaSE embeddings and (ii) late click behavior fusion. We show that NaSE achieves state-of-the-art performance in ZS-XLT in true cold-start and few-shot news recommendation.
Andreea Iana, Fabian David Schmidt, Goran Glavaš, Heiko Paulheim
Maybe You Are Looking for CroQS Cross-Modal Query Suggestion for Text-to-Image Retrieval
Abstract
Query suggestion, a technique widely adopted in information retrieval, enhances system interactivity and the browsing experience of document collections. In cross-modal retrieval, many works have focused on retrieving relevant items from natural language queries, while few have explored query suggestion solutions. In this work, we address query suggestion in cross-modal retrieval, introducing a novel task that focuses on suggesting minimal textual modifications needed to explore visually consistent subsets of the collection, following the premise of “Maybe you are looking for”. To facilitate the evaluation and development of methods, we present a tailored benchmark named CroQS. This dataset comprises initial queries, grouped result sets, and human-defined suggested queries for each group. We establish dedicated metrics to rigorously evaluate the performance of various methods on this task, measuring representativeness, cluster specificity, and similarity of the suggested queries to the original ones. Baseline methods from related fields, such as image captioning and content summarization, are adapted for this task to provide reference performance scores. Although relatively far from human performance, our experiments reveal that both LLM-based and captioning-based methods achieve competitive results on CroQS, improving the recall on cluster specificity by more than 115% and representativeness mAP by more than 52% with respect to the initial query. The dataset, the implementation of the baseline methods and the notebooks containing our experiments are available here: paciosoft.​com/​CroQS-benchmark/​.
Giacomo Pacini, Fabio Carrara, Nicola Messina, Nicola Tonellotto, Giuseppe Amato, Fabrizio Falchi
Evaluating LLM Abilities to Understand Tabular Electronic Health Records: A Comprehensive Study of Patient Data Extraction and Retrieval
Abstract
Electronic Health Record (EHR) tables pose unique challenges among which is the presence of hidden contextual dependencies between medical features with a high level of data dimensionality and sparsity. This study presents the first investigation into the abilities of LLMs to comprehend EHRs for patient data extraction and retrieval. We conduct extensive experiments using the MIMICSQL dataset to explore the impact of the prompt structure, instruction, context, and demonstration, of two backbone LLMs, Llama2 and Meditron, based on task performance. Through quantitative and qualitative analyses, our findings show that optimal feature selection and serialization methods can enhance task performance by up to 26.79% compared to naive approaches. Similarly, in-context learning setups with relevant example selection improve data extraction performance by 5.95%. Based on our study findings, we propose guidelines that we believe would help the design of LLM-based models to support health search.
Jesús Lovón-Melgarejo, Martin Mouysset, Jo Oleiwan, Jose G. Moreno, Christine Damase-Michel, Lynda Tamine
MVAM: Multi-View Attention Method for Fine-Grained Image-Text Matching
Abstract
Existing two-stream models, such as CLIP, encode images and text through independent representations, showing good performance while ensuring retrieval speed, have attracted attention from industry and academia. However, the single representation often struggles to capture complex content fully. Such models may ignore fine-grained information during matching, resulting in suboptimal retrieval results. To overcome this limitation and enhance the performance of two-stream models, we propose a Multi-View Atention Method (MVAM) for image-text matching. This approach leverages diverse attention heads with unique view codes to learn multiple representations for images and text, which are then concatenated for matching. We also incorporate a diversity objective to explicitly encourage attention heads to focus on distinct aspects of the input data, capturing complementary fine-grained details. This diversity enables the model to represent image-text pairs from multiple perspectives, ensuring a more comprehensive understanding and alignment of critical content. Our method allows models to encode images and text from different perspectives and focus on more critical details, leading to better matching performance. Our experiments on MSCOCO and Flickr30K demonstrate enhancements over existing models, and further case studies reveal that different attention heads can focus on distinct content, achieving more comprehensive representations.
Wanqing Cui, Rui Cheng, Jiafeng Guo, Xueqi Cheng
An Investigation of Prompt Variations for Zero-Shot LLM-Based Rankers
Abstract
We provide a systematic understanding of the impact of specific components and wordings used in prompts on the effectiveness of rankers based on zero-shot Large Language Models (LLMs). Several zero-shot ranking methods based on LLMs have recently been proposed. Among many aspects, methods differ across (1) the ranking algorithm they implement, e.g., pointwise vs. listwise, (2) the backbone LLMs used, e.g., GPT3.5 vs. FLAN-T5, (3) the components and wording used in prompts, e.g., the use or not of role-definition (role-playing) and the actual words used to express this. It is currently unclear whether performance differences are due to the underlying ranking algorithm, or because of spurious factors such as better choice of words used in prompts. This confusion risks to undermine future research. Through our large-scale experimentation and analysis, we find that ranking algorithms do contribute to differences between methods for zero-shot LLM ranking. However, so do the LLM backbones – but even more importantly, the choice of prompt components and wordings affect the ranking. In fact, in our experiments, we find that, at times, these latter elements have more impact on the ranker’s effectiveness than the actual ranking algorithms, and that differences among ranking methods become more blurred when prompt variations are considered.
Shuoqi Sun, Shengyao Zhuang, Shuai Wang, Guido Zuccon
Query Performance Prediction Using Dimension Importance Estimators
Abstract
Query Performance Prediction (QPP) tends to fall short when predicting the performance of dense Information Retrieval (IR) systems. Therefore, the research community is investigating QPP approaches designed to synergize with this class of state-of-the-art IR models. At the same time, recent advances concerning dense IR have shown that we can improve the retrieval performance by projecting embeddings in a (query-wise) optimal linear subspace of the dense representation space. The Dimension IMportance Estimation (DIME) framework was proposed to identify such optimal subspaces on a query-by-query basis. In this paper, we illustrate how to design QPP models that rely on measuring the alignment between the query and document representations and the optimal DIME dimensions, based on the hypothesis that good alignment indicates better retrieval performance. We experimentally evaluate the proposed QPPs, showing that our approach outperforms the state-of-the-art when predicting the performance of two commonly used dense encoders, Contriever and TAS-B, on two popular TREC collections, Deep Learning 2019 and 2020.
Guglielmo Faggioli, Nicola Ferro, Raffaele Perego, Nicola Tonellotto
Uncertainty Estimation in the Real World: A Study on Music Emotion Recognition
Abstract
Any data annotation for subjective tasks shows potential variations between individuals. This is particularly true for annotations of emotional responses to musical stimuli. While older approaches to music emotion recognition systems frequently addressed this uncertainty problem through probabilistic modeling, modern systems based on neural networks tend to ignore the variability and focus only on predicting central tendencies of human subjective responses. In this work, we explore several methods for estimating not only the central tendencies of the subjective responses to a musical stimulus, but also for estimating the uncertainty associated with these responses. In particular, we investigate probabilistic loss functions and inference-time random sampling. Experimental results indicate that while the modeling of the central tendencies is achievable, modeling of the uncertainty in subjective responses proves significantly more challenging with currently available approaches even when empirical estimates of variations in the responses are available.
Karn N. Watcharasupat, Yiwei Ding, T. Aleksandra Ma, Pavan Seshadri, Alexander Lerch
Rank-Without-GPT: Building GPT-Independent Listwise Rerankers on Open-Source Large Language Models
Abstract
Listwise rerankers based on large language models (LLMs) are the zero-shot state of the art. However, current work in this direction all depend on GPT models, making them a single point of failure in scientific reproducibility. In this work, we lift this pre-condition and build effective listwise rerankers without any form of dependency on GPT for the first time. Our passage retrieval experiments show that our best listwise reranker surpasses the listwise rerankers based on GPT-3.5 by 13% and achieves 97% effectiveness of the ones based on GPT-4. Our results also show that the existing training datasets, which were expressly constructed for pointwise ranking, are insufficient for building such listwise rerankers. Instead, high-quality listwise ranking data is required and crucial, calling for further work on building human-annotated listwise data resources.
Crystina Zhang, Sebastian Hofstätter, Patrick Lewis, Raphael Tang, Jimmy Lin
Semi-Supervised Image-Based Narrative Extraction: A Case Study with Historical Photographic Records
Abstract
This paper presents a semi-supervised approach to extracting narratives from historical photographic records using an adaptation of the narrative maps algorithm. We extend the original unsupervised text-based method to work with image data, leveraging deep learning techniques for visual feature extraction and similarity computation. Our method is applied to the ROGER dataset, a collection of photographs from the 1928 Sacambaya Expedition in Bolivia captured by Robert Gerstmann. We compare our algorithmically extracted visual narratives with expert-curated timelines of varying lengths (5 to 30 images) to evaluate the effectiveness of our approach. In particular, we use the Dynamic Time Warping (DTW) algorithm to match the extracted narratives with the expert-curated baseline. In addition, we asked an expert on the topic to qualitatively evaluate a representative example of the resulting narratives. Our findings show that the narrative maps approach generally outperforms random sampling for longer timelines (10+ images, p < 0.05), with expert evaluation confirming the historical accuracy and coherence of the extracted narratives. This research contributes to the field of computational analysis of visual cultural heritage, offering new tools for historians, archivists, and digital humanities scholars to explore and understand large-scale image collections. The method’s ability to generate meaningful narratives from visual data opens up new possibilities for the study and interpretation of historical events through photographic evidence. Source code and experiments available on GitHub.
Fausto German, Brian Keith, Mauricio Matus, Diego Urrutia, Claudio Meneses
LLM is Knowledge Graph Reasoner: LLM’s Intuition-Aware Knowledge Graph Reasoning for Cold-Start Sequential Recommendation
Abstract
Knowledge Graphs (KGs) represent relationships between entities in a graph structure and have been widely studied as promising tools for realizing recommendations that consider the accurate content information of items. However, traditional KG-based recommendation methods face fundamental challenges: insufficient consideration of temporal information and poor performance in cold-start scenarios. On the other hand, Large Language Models (LLMs) can be considered databases with a wealth of knowledge learned from the web data, and they have recently gained attention due to their potential application as recommendation systems. Although approaches that treat LLMs as recommendation systems can leverage LLMs’ high recommendation literacy, their input token limitations make it impractical to consider the entire recommendation domain dataset and result in scalability issues. To address these challenges, we propose a LLM’s Intuition-aware Knowledge graph Reasoning model (LIKR). Our main idea is to treat LLMs as reasoners that output intuitive exploration strategies for KGs. To integrate the knowledge of LLMs and KGs, we trained a recommendation agent through reinforcement learning using a reward function that integrates different recommendation strategies, including LLM’s intuition and KG embeddings. By incorporating temporal awareness through prompt engineering and generating textual representations of user preferences from limited interactions, LIKR can improve recommendation performance in cold-start scenarios. Furthermore, LIKR can avoid scalability issues by using KGs to represent recommendation domain datasets and limiting the LLM’s output to KG exploration strategies. Experiments on real-world datasets demonstrate that our model outperforms state-of-the-art recommendation methods in cold-start sequential recommendation scenarios.
Keigo Sakurai, Ren Togo, Takahiro Ogawa, Miki Haseyama
PEIR: Modeling Performance in Neural Information Retrieval
Abstract
The efficiency of neural information retrieval methods is primarily evaluated by measuring query latency. In practice, measuring latency is highly tied to hardware configurations and requires extensive computational resources. Given the rapid introduction of retrieval models, achieving an overall comparison of their efficiency is challenging. In this paper, we introduce PEIR, a framework for hardware-independent efficiency measurements in Learned Sparse Retrieval (LSR). By employing performance modeling approaches from high-performance computing, we derive performance models for query evaluation approaches such as BlockMax-MaxScore (BMM) and propose to measure memory and/or floating-point operations while performing retrieval on input queries. We demonstrate that by using PEIR, similar conclusions on comparing the latency of retrieval models are obtained.
Pooya Khandel, Andrew Yates, Ana-Lucia Varbanescu, Maarten de Rijke, Andy Pimentel
mFollowIR: A Multilingual Benchmark for Instruction Following in Retrieval
Abstract
Retrieval systems generally focus on web-style queries that are short and underspecified. However, advances in language models have facilitated the nascent rise of retrieval models that can understand more complex queries with diverse intents. However, these efforts have focused exclusively on English; therefore, we do not yet understand how they work across languages. We introduce mFollowIR, a multilingual benchmark for measuring instruction-following ability in retrieval models. mFollowIR builds upon the TREC NeuCLIR narratives (or instructions) that span three diverse languages (Russian, Chinese, Persian) giving both query and instruction to the retrieval models. We make small changes to the narratives and isolate how well retrieval models can follow these nuanced changes. We present results for both multilingual (XX-XX) and cross-lingual (En-XX) performance. We see strong cross-lingual performance with English-based retrievers that trained using instructions, but find a notable drop in performance in the multilingual setting, indicating that more work is needed in developing data for instruction-based multilingual retrievers. (We release all code and data publicly at https://​github.​com/​orionw/​FollowIR).
Orion Weller, Benjamin Chang, Eugene Yang, Mahsa Yarmohammadi, Samuel Barham, Sean MacAvaney, Arman Cohan, Luca Soldaini, Benjamin Van Durme, Dawn Lawrie
Leveraging Retrieval-Augmented Generation for Keyphrase Synonym Suggestion
Abstract
One common challenge for users in complex search scenarios is formulating queries that return complete but also relevant results. A frequent issue is term mismatching, where key documents are unintentionally excluded due to differences in terminology, jargon, or phrasing used by authors. This mismatch can not only lead to null search sessions, where queries yield empty result sets, but it also risks overlooking relevant documents. To mitigate this, we propose leveraging Retrieval-Augmented Generation (RAG) to suggest alternative terminology to the user. Unlike traditional query expansion methods that focus on individual terms, our approach produces meaningful keyphrase-level suggestions. Academic and professional search users often use keyphrases to formulate their queries (e.g., “information retrieval” or “natural language generation”), and, by completing these query clauses with synonyms, we aim to retrieve a broader set of relevant documents. In particular, we focus on generating disjunctive clauses for boolean queries, the standard format in complex search engines, allowing the inclusion of concept variations within a single query. Experimental results demonstrate that these keyphrase-based suggestions significantly improve retrieval effectiveness, helping users receive more appropriate results without missing relevant documents due to keyphrase mismatch. (Paper code: github.​com/​JorgeGabin/​RAKS).
Jorge Gabín, Javier Parapar
Can Large Language Models Effectively Rerank News Articles for Background Linking?
Abstract
News background linking is the problem of finding news articles that provide context or background on the news reported in a given news article. It is challenging as, compared to news recommendation, the newsreader is assumed to be anonymous. To date, the most effective approach to tackle this problem is a brute-force approach in which the entire news article is issued as an ad-hoc search query to retrieve the background links; however, it is still far from being optimal. Motivated by the success of Large Language Models (LLMs) in several tasks, and in particular reranking of texts, in this work, we explore the potential of using LLMs in reranking a candidate set of news articles retrieved by the full-article search approach. We propose a novel reranking approach that adopts prompt chaining with the LLM to first analyze the query article and its candidate links, then rerank a list of guided summaries of those candidates. Our findings show that aggregating the ranks we got through our proposed approach using GPT-4 Turbo LLM with the original ranks of the candidates results in a statistically-significant improvement over the state-of-the-art (SOTA) baseline, establishing a new SOTA performance for the task.
Marwa Essam, Tamer Elsayed
OKRA: An Explainable, Heterogeneous, Multi-stakeholder Job Recommender System
Abstract
The use of recommender systems in the recruitment domain has been labeled as ‘high-risk’ in recent legislation. As a result, strict requirements regarding explainability and fairness have been put in place to ensure proper treatment of all involved stakeholders. To allow for stakeholder-specific explainability, while also handling highly heterogeneous recruitment data, we propose a novel explainable multi-stakeholder job recommender system using graph neural networks: the Occupational Knowledge-based Recommender using Attention (OKRA). The proposed method is capable of providing both candidate- and company-side recommendations and explanations. We find that OKRA performs substantially better than six baselines in terms of nDCG for two datasets. Furthermore, we find that the tested models show a bias toward candidates and vacancies located in urban areas. Overall, our findings suggest that OKRA provides a balance between accuracy, explainability, and fairness.
Roan Schellingerhout, Francesco Barile, Nava Tintarev
CUP: A Framework for Resource-Efficient Review-Based Recommenders
Abstract
Recommender systems perform well for popular items and users with ample interactions (likes, ratings etc.). This work addresses the difficult and underexplored case of users who have very sparse interactions but post informative review texts. This setting naturally calls for encoding user-specific text with large language models (LLM). However, feeding the full text of all reviews through an LLM has a weak signal-to-noise ratio and incurs high costs of processed tokens. This paper addresses these two issues. It presents a light-weight framework, called CUP, which first computes concise user profiles and feeds only these into the training of transformer-based recommenders. For user profiles, we devise various techniques to select the most informative cues from noisy reviews. Experiments, with book reviews data, show that fine-tuning a small language model with judiciously constructed profiles achieves the best performance, even in comparison to LLM-generated rankings.
Ghazaleh H. Torbati, Anna Tigunova, Gerhard Weikum, Andrew Yates
Towards Efficient and Explainable Hate Speech Detection via Model Distillation
Abstract
Automatic detection of hate and abusive language is essential to combat its online spread. Moreover, recognising and explaining hate speech serves to educate people about its negative effects. However, most current detection models operate as black boxes, lacking interpretability and explainability. In this context, Large Language Models (LLMs) have proven effective for hate speech detection and to promote interpretability. Nevertheless, they are computationally costly to run. In this work, we propose distilling big language models by using Chain-of-Thought to extract explanations that support the hate speech classification task. Having small language models for these tasks will contribute to their use in operational settings. In this paper, we demonstrate that distilled models deliver explanations of the same quality as larger models while surpassing them in classification performance. This dual capability -classifying and explaining- advances hate speech detection making it more affordable, understandable and actionable. (Our code, models and prompts are available at https://​github.​com/​palomapiot/​distil-metahate.)
Paloma Piot, Javier Parapar
Visual Latent Captioning - Towards Verbalizing Vision Transformer Encoders
Abstract
The efficient adaptability of large multimodal models to downstream tasks depends on understanding how these models have learned their knowledge and how the information can be accessed and manipulated to achieve the desired performance. Beyond performance, transparency is also important for understanding and evaluating a model’s behavior. In this paper, we propose a novel method for analyzing each layer of the vision encoder within vision-language models in the form of natural language. We leverage the model’s multimodal text decoder to generate captions for visual features at each layer of its transformer-based vision encoder. In essence, we use the model to interpret its own components by translating information from one modality to another. Subsequently, we use a large language model to interpret the generated captions to provide insight into the type of information represented at each layer of the vision encoder. We specifically track detectable visual classes, such as actions, objects, and colors, to determine in which layer sufficient visual information has been accumulated to form more complex descriptions. We find that the textual representations develop while progressing through the layers, starting from simple visual characteristics to complex scene descriptions featuring multiple objects. The detection of actions starts by first generating a prototypical action in layer 18, which is then refined in later layers. Our code is available online. (https://​github.​com/​SogolHaghighat/​latent_​verbalizer).
Sogol Haghighat, Tim Daniel Metzler, Santosh Thoduka, Sebastian Houben
On the Robustness of Generative Information Retrieval Models: An Out-of-Distribution Perspective
Abstract
Generative information retrieval methods retrieve documents by directly generating their identifiers. Much effort has been devoted to developing effective generative information retrieval (IR) models. Less attention has been paid to the robustness of these models. It is critical to assess the out-of-distribution (OOD) generalization of generative IR models, i.e., how would such models generalize to new distributions? To answer this question, we focus on OOD scenarios from four perspectives in retrieval problems: (i) query variations; (ii) unseen query types; (iii) unseen tasks; and (iv) corpus expansion. Based on this taxonomy, we conduct empirical studies to analyze the OOD robustness of representative generative IR models against dense retrieval models. Our empirical results indicate that the OOD robustness of generative IR models is in need of improvement. By inspecting the OOD robustness of generative IR models we aim to contribute to the development of more reliable IR models. The code is available at https://​github.​com/​Davion-Liu/​GR_​OOD.
Yu-An Liu, Ruqing Zhang, Jiafeng Guo, Changjiang Zhou, Maarten de Rijke, Xueqi Cheng
Towards Reliable Testing for Multiple Information Retrieval System Comparisons
Abstract
Null Hypothesis Significance Testing is the de facto tool for assessing effectiveness differences between Information Retrieval systems. Researchers use statistical tests to check whether those differences will generalise to online settings or are just due to the samples observed in the laboratory. Much work has been devoted to studying which test is the most reliable when comparing a pair of systems, but most of the IR real-world experiments involve more than two. In the multiple comparisons scenario, testing several systems simultaneously may inflate the errors committed by the tests. In this paper, we use a new approach to assess the reliability of multiple comparison procedures using simulated and real TREC data. Experiments show that Wilcoxon plus the Benjamini-Hochberg correction yields Type I error rates according to the significance level for typical sample sizes while being the best test in terms of statistical power.
David Otero, Javier Parapar, Álvaro Barreiro
Leveraging High-Resolution Features for Improved Deep Hashing-Based Image Retrieval
Abstract
Deep hashing techniques have emerged as the predominant approach for efficient image retrieval. Traditionally, these methods utilize pre-trained convolutional neural networks (CNNs) such as AlexNet and VGG-16 as feature extractors. However, the increasing complexity of datasets reveals the limitations for these backbone architectures in capturing meaningful features essential for effective image retrieval. In this study, we explore the efficacy of employing high-resolution features learned through state-of-the-art techniques for image retrieval tasks. Specifically, we propose a novel methodology that utilizes High-Resolution Networks (HRNets) as the backbone for the deep hashing task, termed High-Resolution Hashing Network (HHNet). Our approach demonstrates superior performance compared to existing methods across all tested benchmark datasets, including CIFAR-10, NUS-WIDE, MS COCO, and ImageNet. This performance improvement is more pronounced for complex datasets, which highlights the need to learn high-resolution features for intricate image retrieval tasks. Furthermore, we conduct a comprehensive analysis of different HRNet configurations and provide insights into the optimal architecture for the deep hashing task.
Aymene Berriche, Mehdi Zakaria Adjal, Riyadh Baghdadi
Backmatter
Metadaten
Titel
Advances in Information Retrieval
herausgegeben von
Claudia Hauff
Craig Macdonald
Dietmar Jannach
Gabriella Kazai
Franco Maria Nardini
Fabio Pinelli
Fabrizio Silvestri
Nicola Tonellotto
Copyright-Jahr
2025
Electronic ISBN
978-3-031-88711-6
Print ISBN
978-3-031-88710-9
DOI
https://doi.org/10.1007/978-3-031-88711-6