Skip to main content

2024 | Buch

Advances in Information Retrieval

46th European Conference on Information Retrieval, ECIR 2024, Glasgow, UK, March 24–28, 2024, Proceedings, Part IV

herausgegeben von: Nazli Goharian, Nicola Tonellotto, Yulan He, Aldo Lipani, Graham McDonald, Craig Macdonald, Iadh Ounis

Verlag: Springer Nature Switzerland

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

The six-volume set LNCS 14608, 14609, 14609, 14610, 14611, 14612 and 14613 constitutes the refereed proceedings of the 46th European Conference on IR Research, ECIR 2024, held in Glasgow, UK, during March 24–28, 2024.

The 57 full papers, 18 finding papers, 36 short papers, 26 IR4Good papers, 18 demonstration papers, 9 reproducibility papers, 8 doctoral consortium papers, and 15 invited CLEF papers were carefully reviewed and selected from 578 submissions. The accepted papers cover the state of the art in information retrieval focusing on user aspects, system and foundational aspects, machine learning, applications, evaluation, new social and technical challenges, and other topics of direct or indirect relevance to search.

Inhaltsverzeichnis

Frontmatter

Short Papers

Frontmatter
ChatGPT Goes Shopping: LLMs Can Predict Relevance in eCommerce Search

The dependence on human relevance judgments limits the development of information retrieval test collections that are vital for evaluating these systems. Since their launch, large language models (LLMs) have been applied to automate several human tasks. Recently, LLMs started being used to provide relevance judgments for document search. In this work, our goal is to assess whether LLMs can replace human annotators in a different setting – product search in eCommerce. We conducted experiments on open and proprietary industrial datasets to measure LLM’s ability to predict relevance judgments. Our results found that LLM-generated relevance assessments present a strong agreement ( $$\sim $$ ∼ 82%) with human annotations indicating that LLMs have an innate ability to perform relevance judgments in an eCommerce setting. Then, we went further and tested whether LLMs can generate annotation guidelines. Our results found that relevance assessments obtained with LLM-generated guidelines are as accurate as the ones obtained from human instructions. $$^1$$ 1 (The source code for this work is available at https://github.com/danimtk/chatGPT-goes-shopping )

Beatriz Soviero, Daniel Kuhn, Alexandre Salle, Viviane Pereira Moreira
Taxonomy of Mathematical Plagiarism

Plagiarism is a pressing concern, even more so with the availability of large language models. Existing plagiarism detection systems reliably find copied and moderately reworded text but fail for idea plagiarism, especially in mathematical science, which heavily uses formal mathematical notation. We make two contributions. First, we establish a taxonomy of mathematical content reuse by annotating potentially plagiarised 122 scientific document pairs. Second, we analyze the best-performing approaches to detect plagiarism and mathematical content similarity on the newly established taxonomy. We found that the best-performing methods for plagiarism and math content similarity achieve an overall detection score (PlagDet) of 0.06 and 0.16, respectively. The best-performing methods failed to detect most cases from all seven newly established math similarity types. Outlined contributions will benefit research in plagiarism detection systems, recommender systems, question-answering systems, and search engines. We make our experiment’s code and annotated dataset available to the community: https://github.com/gipplab/Taxonomy-of-Mathematical-Plagiarism .

Ankit Satpute, André Greiner-Petter, Noah Gießing, Isabel Beckenbach, Moritz Schubotz, Olaf Teschke, Akiko Aizawa, Bela Gipp
Unraveling Disagreement Constituents in Hateful Speech

This paper presents a probabilistic semantic approach to identifying disagreement-related textual constituents in hateful content. Several methodologies to exploit the selected constituents to determine if a message could lead to disagreement have been defined. The proposed approach is evaluated on 4 datasets made available for the SemEval 2023 Task 11 shared task, highlighting that a few constituents can be used as a proxy to identify if a sentence could be perceived differently by multiple readers. The source code of our approaches is publicly available ( https://github.com/MIND-Lab/Unrevealing-Disagreement-Constituents-in-Hateful-Speech ).

Giulia Rizzi, Alessandro Astorino, Paolo Rosso, Elisabetta Fersini
Context-Aware Query Term Difficulty Estimation for Performance Prediction

Research has already found that many retrieval methods are sensitive to the choice and order of terms that appear in a query, which can significantly impact retrieval effectiveness. We capitalize on this finding in order to predict the performance of a query. More specifically, we propose to learn query term difficulty weights specifically within the context of each query, which could then be used as indicators of whether each query term has the likelihood of making the query more effective or not. We show how such difficulty weights can be learnt through the finetuning of a language model. In addition, we propose an approach to integrate the learnt weights into a cross-encoder architecture to predict query performance. We show that our proposed approach shows a consistently strong performance prediction on the MSMARCO collection and its associated widely used Trec Deep Learning tracks query sets. Our findings demonstrate that our method is able to show consistently strong performance prediction over different query sets (MSMARCO Dev, TREC DL’19, ’20, Hard) and a range of evaluation metrics (Kendall, Spearman, sMARE).

Abbas Saleminezhad, Negar Arabzadeh, Soosan Beheshti, Ebrahim Bagheri
Learning to Jointly Transform and Rank Difficult Queries

Recent empirical studies have shown that while neural rankers exhibit increasingly higher retrieval effectiveness on tasks such as ad hoc retrieval, these improved performances are not experienced uniformly across the range of all queries. There are typically a large subset of queries that are not satisfied by neural rankers. These queries are often referred to as difficult queries. Given the fact that neural rankers operate based on the similarity between the embedding representations of queries and their relevant documents, the poor performance of difficult queries can be due to the sub-optimal representations learnt for difficult queries. As such, the objective of our work in this paper is to learn to rank documents and also transform query representations in tandem such that the representation of queries are transformed into one that shows higher resemblance to their relevant document. This way, our method will provide the opportunity to satisfy a large number of difficult queries that would otherwise not be addressed. In order to learn to jointly rank documents and transform queries, we propose to integrate two forms of triplet loss functions into neural rankers such that they ensure that each query is moved along the embedding space, through the transformation of its embedding representation, in order to be placed close to its relevant document(s). We perform experiments based on the MS MARCO passage ranking task and show that our proposed method has been able to show noticeable performance improvement for queries that were extremely difficult for existing neural rankers. On average, our approach has been able to satisfy 277 queries with an MRR@10 of 0.21 for queries that had a reciprocal rank of zero on the initial neural ranker.

Amin Bigdeli, Negar Arabzadeh, Ebrahim Bagheri
Estimating Query Performance Through Rich Contextualized Query Representations

The state-of-the-art query performance prediction methods rely on the fine-tuning of contextual language models to estimate retrieval effectiveness on a per-query basis. Our work in this paper builds on this strong foundation and proposes to learn rich query representations by learning the interactions between the query and two important contextual information, namely (1) the set of documents retrieved by that query, and (2) the set of similar historical queries with known retrieval effectiveness. We propose that such contextualized query representations can be more accurate estimators of query performance as they embed the performance of past similar queries and the semantics of the documents retrieved by the query. We perform extensive experiments on the MSMARCO collection and its accompanying query sets including MSMARCO Dev set and TREC Deep Learning tracks of 2019, 2020, 2021, and DL-Hard. Our experiments reveal that our proposed method shows robust and effective performance compared to state-of-the-art baselines.

Sajad Ebrahimi, Maryam Khodabakhsh, Negar Arabzadeh, Ebrahim Bagheri
Instant Answering in E-Commerce Buyer-Seller Messaging Using Message-to-Question Reformulation

E-commerce customers frequently seek detailed product information for purchase decisions, commonly contacting sellers directly with extended queries. This manual response requirement imposes additional costs and disrupts buyer’s shopping experience with response time fluctuations ranging from hours to days. We seek to automate buyer inquiries to sellers in a leading e-commerce store using a domain-specific federated Question Answering (QA) system. The main challenge is adapting current QA systems, designed for single questions, to address detailed customer queries. We address this with a low-latency, sequence-to-sequence approach, Message-to-Question (M2Q). It reformulates buyer messages into succinct questions by identifying and extracting the most salient information from a message. Evaluation against baselines shows that M2Q yields relative increases of 757% in question understanding, and 1,746% in answering rate from the federated QA system. Live deployment shows that automatic answering saves sellers from manually responding to millions of messages per year, and also accelerates customer purchase decisions by eliminating the need for buyers to wait for a reply.

Besnik Fetahu, Tejas Mehta, Qun Song, Nikhita Vedula, Oleg Rokhlenko, Shervin Malmasi
SoftQE: Learned Representations of Queries Expanded by LLMs

We investigate the integration of Large Language Models (LLMs) into query encoders to improve dense retrieval without increasing latency and cost, by circumventing the dependency on LLMs at inference time. SoftQE incorporates knowledge from LLMs by mapping embeddings of input queries to those of the LLM-expanded queries. While improvements over various strong baselines on in-domain MS-MARCO metrics are marginal, SoftQE improves performance by 2.83 absolute percentage points on average on five out-of-domain BEIR tasks.

Varad Pimpalkhute, John Heyer, Xusen Yin, Sameer Gupta
Towards Automated End-to-End Health Misinformation Free Search with a Large Language Model

In the information age, health misinformation remains a notable challenge to public welfare. Integral to addressing this issue is the development of search systems adept at identifying and filtering out misleading content. This paper presents the automation of Vera, a state-of-the-art consumer health search system. While Vera can discern articles containing misinformation, it requires expert ground truth answers and rule-based reformulations. We introduce an answer prediction module that integrates GPT $$_\text {x}$$ x with Vera and a GPT-based query reformulator to yield high-quality stance reformulations and boost downstream retrieval effectiveness. Further, we find that chain-of-thought reasoning is paramount to higher effectiveness. When assessed in the TREC Health Misinformation Track of 2022, our systems surpassed all competitors, including human-in-the-loop configurations, underscoring their pivotal role in the evolution towards a health misinformation-free search landscape. We provide all code necessary to reproduce our results at https://github.com/castorini/pygaggle .

Ronak Pradeep, Jimmy Lin
Weighted AUReC: Handling Skew in Shard Map Quality Estimation for Selective Search

In selective search, a document collection is partitioned into a collection of topical index shards. To efficiently estimate the topical coherence (or quality) of a shard map, the AUReC measure was introduced. AUReC makes the assumption that shards are of similar sizes, one that is violated in practice, even for unsupervised approaches. The problem might be amplified if supervised labelling approaches with skewed class distributions are used. To estimate the quality of such unbalanced shard maps, we introduce a weighted adaptation of the AUReC measure, and empirically evaluate its effectiveness using the ClueWeb09B and Gov2 datasets. We show that it closely matches the evaluations of the original AUReC when shards are similar in size, but captures better the differences in performance when shard sizes are skewed.

Gijs Hendriksen, Djoerd Hiemstra, Arjen P. de Vries

Reproducibility Papers

Frontmatter
A Second Look on BASS – Boosting Abstractive Summarization with Unified Semantic Graphs
A Replication Study

We present a detailed replication study of the BASS framework, an abstractive summarization system based on the notion of Unified Semantic Graphs. Our investigation includes challenges in replicating key components and an ablation study to systematically isolate error sources rooted in replicating novel components. Our findings reveal discrepancies in performance compared to the original work. We highlight the significance of paying careful attention even to reasonably omitted details for replicating advanced frameworks like BASS, and emphasize key practices for writing replicable papers.

Osman Alperen Koraş, Jörg Schlötterer, Christin Seifert
Performance Comparison of Session-Based Recommendation Algorithms Based on GNNs

In session-based recommendation settings, a recommender system has to base its suggestions on the user interactions that are observed in an ongoing session. Since such sessions can consist of only a small set of interactions, various approaches based on Graph Neural Networks (GNN) were recently proposed, as they allow us to integrate various types of side information about the items in a natural way. Unfortunately, a variety of evaluation settings are used in the literature, e.g., in terms of protocols, metrics and baselines, making it difficult to assess what represents the state of the art. In this work, we present the results of an evaluation of eight recent GNN-based approaches that were published in high-quality outlets. For a fair comparison, all models are systematically tuned and tested under identical conditions using three common datasets. We furthermore include k-nearest-neighbor and sequential rules-based models as baselines, as such models have previously exhibited competitive performance results for similar settings. To our surprise, the evaluation showed that the simple models outperform all recent GNN models in terms of the Mean Reciprocal Rank, which we used as an optimization criterion, and were only outperformed in three cases in terms of the Hit Rate. Additional analyses furthermore reveal that several other factors that are often not deeply discussed in papers, e.g., random seeds, can markedly impact the performance of GNN-based models. Our results therefore (a) point to continuing issues in the community in terms of research methodology and (b) indicate that there is ample room for improvement in session-based recommendation.

Faisal Shehzad, Dietmar Jannach
A Reproducibility Study of Goldilocks: Just-Right Tuning of BERT for TAR

Screening documents is a tedious and time-consuming aspect of high-recall retrieval tasks, such as compiling a systematic literature review, where the goal is to identify all relevant documents for a topic. To help streamline this process, many Technology-Assisted Review (TAR) methods leverage active learning techniques to reduce the number of documents requiring review. BERT-based models have shown high effectiveness in text classification, leading to interest in their potential use in TAR workflows. In this paper, we investigate recent work that examined the impact of further pre-training epochs on the effectiveness and efficiency of a BERT-based active learning pipeline. We first report that we could replicate the original experiments on two specific TAR datasets, confirming some of the findings: importantly, that further pre-training is critical to high effectiveness, but requires attention in terms of selecting the correct training epoch. We then investigate the generalisability of the pipeline on a different TAR task, that of medical systematic reviews. In this context, we show that there is no need for further pre-training if a domain-specific BERT backbone is used within the active learning pipeline. This finding provides practical implications for using the studied active learning pipeline within domain-specific TAR tasks.

Xinyu Mao, Bevan Koopman, Guido Zuccon
Optimizing BERTopic: Analysis and Reproducibility Study of Parameter Influences on Topic Modeling

This paper reproduces key experiments and results from the BERTopic neural topic modeling framework. We validate prior findings regarding the role of text preprocessing, embedding models and term weighting strategies in optimizing BERTopic’s modular pipeline. Specifically, we show that advanced embedding models like MPNet benefit from raw input while simpler models like GloVe perform better with preprocessed text. We also demonstrate that excluding outlier documents from the topic model provides minimal gains. Additionally, we highlight that appropriate term weighting schemes, such as $$\sqrt{TF}\hbox {-}BM25(IDF)$$ TF - B M 25 ( I D F ) , are critical for topic quality. We manage to reproduce prior results and our rigorous reproductions affirm the effectiveness of BERTopic’s flexible framework while providing novel insights into tuning its components for enhanced topic modeling performance. The findings offer guidance and provide insightful refinements and clarifications, serving as a valuable reference for both researchers and practitioners applying clustering-based neural topic modeling.

Martin Borčin, Joemon M. Jose
Does the Performance of Text-to-Image Retrieval Models Generalize Beyond Captions-as-a-Query?

Text-image retrieval (T2I) refers to the task of recovering all images relevant to a keyword query. Popular datasets for text-image retrieval, such as Flickr30k, VG, or MS-COCO, utilize annotated image captions, e.g., “a man playing with a kid”, as a surrogate for queries. With such surrogate queries, current multi-modal machine learning models, such as CLIP or BLIP, perform remarkably well. The main reason is the descriptive nature of captions, which detail the content of an image. Yet, T2I queries go beyond the mere descriptions in image-caption pairs. Thus, these datasets are ill-suited to test methods on more abstract or conceptual queries, e.g., “family vacations”. In such queries, the image content is implied rather than explicitly described. In this paper, we replicate the T2I results on descriptive queries and generalize them to conceptual queries. To this end, we perform new experiments on a novel T2I benchmark for the task of conceptual query answering, called ConQA. ConQA comprises 30 descriptive and 50 conceptual queries on 43k images with more than 100 manually annotated images per query. Our results on established measures show that both large pretrained models (e.g., CLIP, BLIP, and BLIP2) and small models (e.g., SGRAF and NAAF), perform up to $$4{\times }$$ 4 × better on descriptive rather than conceptual queries. We also find that the models perform better on queries with more than 6 keywords as in MS-COCO captions.

Juan Manuel Rodriguez, Nima Tavassoli, Eliezer Levy, Gil Lederman, Dima Sivov, Matteo Lissandrini, Davide Mottin
Exploring the Nexus Between Retrievability and Query Generation Strategies

Quantifying bias in retrieval functions through document retrievability scores is vital for assessing recall-oriented retrieval systems. However, many studies investigating retrieval model bias lack validation of their query generation methods as accurate representations of retrievability for real users and their queries. This limitation results from the absence of established criteria for query generation in retrievability assessments. Typically, researchers resort to using frequent collocations from document corpora when no query log is available. In this study, we address the issue of reproducibility and seek to validate query generation methods by comparing retrievability scores generated from artificially generated queries to those derived from query logs. Our findings demonstrate a minimal or negligible correlation between retrievability scores from artificial queries and those from query logs. This suggests that artificially generated queries may not accurately reflect retrievability scores as derived from query logs. We further explore alternative query generation techniques, uncovering a variation that exhibits the highest correlation. This alternative approach holds promise for improving reproducibility when query logs are unavailable.

Aman Sinha, Priyanshu Raj Mall, Dwaipayan Roy
Reproducibility Analysis and Enhancements for Multi-aspect Dense Retriever with Aspect Learning

Multi-aspect dense retrieval aims to incorporate aspect information (e.g., brand and category) into dual encoders to facilitate relevance matching. As an early and representative multi-aspect dense retriever, MADRAL learns several extra aspect embeddings and fuses the explicit aspects with an implicit aspect “OTHER” for final representation. MADRAL was evaluated on proprietary data and its code was not released, making it challenging to validate its effectiveness on other datasets. We failed to reproduce its effectiveness on the public MA-Amazon data, motivating us to probe the reasons and re-examine its components. We propose several component alternatives for comparisons, including replacing “OTHER” with “CLS” and representing aspects with the first several content tokens. Through extensive experiments, we confirm that learning “OTHER” from scratch in aspect fusion is harmful. In contrast, our proposed variants can greatly enhance the retrieval performance. Our research not only sheds light on the limitations of MADRAL but also provides valuable insights for future studies on more powerful multi-aspect dense retrieval models. Code will be released at: https://github.com/sunxiaojie99/Reproducibility-for-MADRAL .

Keping Bi, Xiaojie Sun, Jiafeng Guo, Xueqi Cheng
Measuring Item Fairness in Next Basket Recommendation: A Reproducibility Study

Item fairness of recommender systems aims to evaluate whether items receive a fair share of exposure according to different definitions of fairness. Raj and Ekstrand [26] study multiple fairness metrics under a common evaluation framework and test their sensitivity with respect to various configurations. They find that fairness metrics show varying degrees of sensitivity towards position weighting models and parameter settings under different information access systems. Although their study considers various domains and datasets, their findings do not necessarily generalize to next basket recommendation (NBR) where users exhibit a more repeat-oriented behavior compared to other recommendation domains. This paper investigates fairness metrics in the NBR domain under a unified experimental setup. Specifically, we directly evaluate the item fairness of various NBR methods. These fairness metrics rank NBR methods in different orders, while most of the metrics agree that repeat-biased methods are fairer than explore-biased ones. Furthermore, we study the effect of unique characteristics of the NBR task on the sensitivity of the metrics, including the basket size, position weighting models, and user repeat behavior. Unlike the findings in [26], Inequity of Amortized Attention (IAA) is the most sensitive metric, as observed in multiple experiments. Our experiments lead to novel findings in the field of NBR and fairness. We find that Expected Exposure Loss (EEL) and Expected Exposure Disparity (EED) are the most robust and adaptable fairness metrics to be used in the NBR domain.

Yuanna Liu, Ming Li, Mozhdeh Ariannezhad, Masoud Mansoury, Mohammad Aliannejadi, Maarten de Rijke
Query Generation Using Large Language Models
A Reproducibility Study of Unsupervised Passage Reranking

Existing passage retrieval techniques predominantly emphasize classification or dense matching strategies. This is in contrast with classic language modeling approaches focusing on query or question generation. Recently, Sachan et al. introduced an Unsupervised Passage Retrieval (UPR) approach that resembles this by exploiting the inherent generative capabilities of large language models. In this replicability study, we revisit the concept of zero-shot question generation for re-ranking and focus our investigation on the ranking experiments, validating the UPR findings, particularly on the widely recognized BEIR benchmark. Furthermore, we extend the original work by evaluating the proposed method additionally on the TREC Deep Learning track benchmarks of 2019 and 2020. To enhance our understanding of the technique’s performance, we introduce novel experiments exploring the influence of different prompts on retrieval outcomes. Our comprehensive analysis provides valuable insights into the robustness and applicability of zero-shot question generation as a re-ranking strategy in passage retrieval.

David Rau, Jaap Kamps

IR for Good Papers

Frontmatter
Absolute Variation Distance: An Inversion Attack Evaluation Metric for Federated Learning

Federated Learning (FL) has emerged as a pivotal approach for training models on decentralized data sources by sharing only model gradients. However, the shared gradients in FL are susceptible to inversion attacks which can expose sensitive information. While several defense and attack strategies have been proposed, their effectiveness is often evaluated using metrics that may not necessarily reflect the success rate of an attack or information retrieval, especially in the context of multidimensional data such as images. Traditional metrics like the Structural Similarity Index (SSIM), Peak Signal-to-Noise Ratio (PSNR), and Mean Squared Error (MSE) are typically used as lightweight metrics, assume only pixel-wise comparison, but fail to consider the semantic context of the recovered data. This paper introduces the Absolute Variation Distance (AVD), a lightweight metric derived from total variation, to assess data recovery and information leakage in FL. Unlike traditional metrics, AVD offers a continuous measure for extracting information in noisy images and aligns closely with human perception. Our results combined with a user experience survey demonstrate that AVD provides a more accurate and consistent measure of data recovery. It also matches the accuracy of the more costly and complex Neural Network based metric, the Learned Perceptual Image Patch Similarity (LPIPS). Hence it offers an effective tool for automatic evaluation of data security in FL and a reliable way of studying defence and inversion attacks strategies in FL.

Georgios Papadopoulos, Yash Satsangi, Shaltiel Eloul, Marco Pistoia
Ranking Distance Metric for Privacy Budget in Distributed Learning of Finite Embedding Data

Federated Learning (FL) is a collective of distributed learning paradigm that aims to preserve privacy in data. Recent studies have shown FL models to be vulnerable to reconstruction attacks that compromise data privacy by inverting gradients computed on confidential data. To address the challenge of defending against these attacks, it is common to employ methods that guarantee data confidentiality using the principles of Differential Privacy (DP). However, in many cases, especially for machine learning models trained on unstructured data such as text, evaluating privacy requires to consider also the finite space of embedding for client’s private data. In this study, we show how privacy in a distributed FL setup is sensitive to the underlying finite embeddings of the confidential data. We show that privacy can be quantified for a client batch that uses either noise, or a mixture of finite embeddings, by introducing a normalised rank distance ( $$d_{rank}$$ d rank ). This measure has the advantage of taking into account the size of a finite vocabulary embedding, and align the privacy budget to a partitioned space. We further explore the impact of noise and client batch size on the privacy budget and compare it to the standard $$\epsilon $$ ϵ derived from Local-DP.

Georgios Papadopoulos, Yash Satsangi, Shaltiel Eloul, Marco Pistoia
Experiments in News Bias Detection with Pre-trained Neural Transformers

The World Wide Web provides unrivalled access to information globally, including factual news reporting and commentary. However, state actors and commercial players increasingly spread biased (distorted) or fake (non-factual) information to promote their agendas.We compare several large, pre-trained language models on the task of sentence-level news bias detection and sub-type classification, providing quantitative and qualitative results.Our findings are to be seen as part of a wider effort towards realizing the conceptual vision, articulated by Fuhr et al. [10], of a “nutrition label” for online content for the social good.

Tim Menzner, Jochen L. Leidner
An Empirical Analysis of Intervention Strategies’ Effectiveness for Countering Misinformation Amplification by Recommendation Algorithms

Social network platforms connect people worldwide, facilitating communication, information sharing, and personal/professional networking. They use recommendation algorithms to personalize content and enhance user experiences. However, these algorithms can unintentionally amplify misinformation by prioritizing engagement over accuracy. For instance, recent works suggest that popularity-based and network-based recommendation algorithms contribute the most to misinformation diffusion. In our study, we present an exploration on two Twitter datasets to understand the impact of intervention techniques on combating misinformation amplification initiated by recommendation algorithms. We simulate various scenarios and evaluate the effectiveness of intervention strategies in social sciences such as Virality Circuit Breakers and accuracy nudges. Our findings highlight that these intervention strategies are generally successful when applied on top of collaborative filtering and content-based recommendation algorithms, while having different levels of effectiveness depending on the number of users keen to spread fake news present in the dataset.

Royal Pathak, Francesca Spezzano
Good for Children, Good for All?

In this work, we reason how focusing on Information Retrieval (IR) for children and involving them in participatory studies would benefit the IR community. The Child Computer Interaction (CCI) community has embraced the child as a protagonist as their main philosophy, regarding children as informants, co-designers, and evaluators, not just users. Leveraging prior literature, we posit that putting children in the centre of the IR world and giving them an active role could enable the IR community to break free from the preexisting bias derived from interpretations inferred from past use by adult users and the still dominant system-oriented approach. This shift would allow researchers to revisit complex foundational concepts that greatly influence the use of IR tools as part of socio-technical systems in different domains. In doing so, IR practitioners could provide more inclusive, and supportive information access experiences to children and other understudied user groups alike in different contexts.

Monica Landoni, Theo Huibers, Emiliana Murgia, Maria Soledad Pera
Not Just Algorithms: Strategically Addressing Consumer Impacts in Information Retrieval

Information Retrieval (IR) systems have a wide range of impacts on consumers. We offer maps to help identify goals IR systems could—or should—strive for, and guide the process of scoping how to gauge a wide range of consumer-side impacts and the possible interventions needed to address these effects. Grounded in prior work on scoping algorithmic impact efforts, our goal is to promote and facilitate research that (1) is grounded in impacts on information consumers, contextualizing these impacts in the broader landscape of positive and negative consumer experience; (2) takes a broad view of the possible means of changing or improving that impact, including non-technical interventions; and (3) uses operationalizations and strategies that are well-matched to the technical, social, ethical, legal, and other dimensions of the specific problem in question.

Michael D. Ekstrand, Lex Beattie, Maria Soledad Pera, Henriette Cramer
A Study of Pre-processing Fairness Intervention Methods for Ranking People

Fairness interventions are hard to use in practice when ranking people due to legal constraints that limit access to sensitive information. Pre-processing fairness interventions, however, can be used in practice to create more fair training data that encourage the model to generate fair predictions without having access to sensitive information during inference. Little is known about the performance of pre-processing fairness interventions in a recruitment setting. To simulate a real scenario, we train a ranking model on pre-processed representations, while access to sensitive information is limited during inference. We evaluate pre-processing fairness intervention methods in terms of individual fairness and group fairness. On two real-world datasets, the pre-processing methods are found to improve the diversity of rankings with respect to gender, while individual fairness is not affected. Moreover, we discuss advantages and disadvantages of using pre-processing fairness interventions in practice for ranking people.

Clara Rus, Andrew Yates, Maarten de Rijke
Fairness Through Domain Awareness: Mitigating Popularity Bias for Music Discovery

As online music platforms continue to grow, music recommender systems play a vital role in helping users navigate and discover content within their vast musical databases. At odds with this larger goal, is the presence of popularity bias, which causes algorithmic systems to favor mainstream content over, potentially more relevant, but niche items. In this work we explore the intrinsic relationship between music discovery and popularity bias through the lens of individual fairness. We propose a domain-aware, individual fairness-based approach which addresses popularity bias in graph neural network based recommender systems. Our approach uses individual fairness to reflect a ground truth listening experience, i.e., if two songs sound similar, this similarity should be reflected in their representations. In doing so, we facilitate meaningful music discovery that is resistant to popularity bias and grounded in the music domain. We apply our BOOST methodology to two discovery based tasks, performing recommendations at both the playlist level and user level. Then, we ground our evaluation in the cold start setting, showing that our approach outperforms existing fairness benchmarks in both performance and recommendation of lesser-known content. Finally, our analysis makes the case for the importance of domain-awareness when mitigating popularity bias in music recommendation.

Rebecca Salganik, Fernando Diaz, Golnoosh Farnadi
Evaluating the Explainability of Neural Rankers

Information retrieval models have witnessed a paradigm shift from unsupervised statistical approaches to feature-based supervised approaches to completely data-driven ones that make use of the pre-training of large language models. While the increasing complexity of the search models have been able to demonstrate improvements in effectiveness (measured in terms of relevance of top-retrieved results), a question worthy of a thorough inspection is - “how explainable are these models?”, which is what this paper aims to evaluate. In particular, we propose a common evaluation platform to systematically evaluate the explainability of any ranking model (the explanation algorithm being identical for all the models that are to be evaluated). In our proposed framework, each model, in addition to returning a ranked list of documents, also requires to return a list of explanation units or rationales for each document. This meta-information from each document is then used to measure how locally consistent these rationales are as an intrinsic measure of interpretability - one that does not require manual relevance assessments. Additionally, as an extrinsic measure, we compute how relevant these rationales are by leveraging sub-document level relevance assessments. Our findings show a number of interesting observations, such as sentence-level rationales are more consistent, an increase in complexity mostly leads to less consistent explanations, and that interpretability measures offer a complementary dimension of evaluation of IR systems because consistency is not well-correlated with nDCG at top ranks.

Saran Pandian, Debasis Ganguly, Sean MacAvaney
Is Interpretable Machine Learning Effective at Feature Selection for Neural Learning-to-Rank?

Neural ranking models have become increasingly popular for real-world search and recommendation systems in recent years. Unlike their tree-based counterparts, neural models are much less interpretable. That is, it is very difficult to understand their inner workings and answer questions like how do they make their ranking decisions? or what document features do they find important? This is particularly disadvantageous since interpretability is highly important for real-world systems. In this work, we explore feature selection for neural learning-to-rank (LTR). In particular, we investigate six widely-used methods from the field of interpretable machine learning (ML) and introduce our own modification, to select the input features that are most important to the ranking behavior. To understand whether these methods are useful for practitioners, we further study whether they contribute to efficiency enhancement. Our experimental results reveal a large feature redundancy in several LTR benchmarks: the local selection method TabNet can achieve optimal ranking performance with less than 10 features; the global methods, particularly our G-L2x, require slightly more selected features, but exhibit higher potential in improving efficiency. We hope that our analysis of these feature selection methods will bring the fields of interpretable ML and LTR closer together.

Lijun Lyu, Nirmal Roy, Harrie Oosterhuis, Avishek Anand
Navigating the Thin Line: Examining User Behavior in Search to Detect Engagement and Backfire Effects

Opinionated users often seek information that aligns with their preexisting beliefs while dismissing contradictory evidence due to confirmation bias. This conduct hinders their ability to consider alternative stances when searching the web. Despite this, few studies have analyzed how the diversification of search results on disputed topics influences the search behavior of highly opinionated users. To this end, we present a preregistered user study (n = 257) investigating whether different levels (low and high) of bias metrics and search results presentation (with or without AI-predicted stances labels) can affect the stance diversity consumption and search behavior of opinionated users on three debated topics (i.e., atheism, intellectual property rights, and school uniforms). Our results show that exposing participants to (counter- attitudinally) biased search results increases their consumption of attitude-opposing content, but we also found that bias was associated with a trend toward overall fewer interactions within the search page. We also found that 19% of users interacted with queries and search pages but did not select any search results. When we removed these participants in a post-hoc analysis, we found that stance labels increased the diversity of stances consumed by users, particularly when the search results were biased. Our findings highlight the need for future research to explore distinct search scenario settings to gain insight into opinionated users’ behavior.

Federico Maria Cau, Nava Tintarev
Recommendation Fairness in eParticipation: Listening to Minority, Vulnerable and NIMBY Citizens

E-participation refers to the use of digital technologies and online platforms to engage citizens and other stakeholders in democratic and government decision-making processes. Recent research work has explored the application of recommender systems to e-participation, focusing on the development of algorithmic solutions to be effective in terms of personalized content retrieval accuracy, but ignoring underlying societal issues, such as biases, fairness, privacy and transparency. Motivated by this research gap, on a public e-participatory budgeting dataset, we measure and analyze recommendation fairness metrics oriented to several minority, vulnerable and NIMBY (Not In My Back Yard) groups of citizens. Our empirical results show that there is a strong popularity bias (especially for the minority groups) due to how content is presented and accessed in a reference e-participation platform; and that hybrid algorithms exploiting user geolocation information in a collaborative filtering fashion are good candidates to satisfy the proposed fairness conceptualization for the above underrepresented citizen collectives.

Marina Alonso-Cortés, Iván Cantador, Alejandro Bellogín
Responsible Opinion Formation on Debated Topics in Web Search

Web search has evolved into a platform people rely on for opinion formation on debated topics. Yet, pursuing this search intent can carry serious consequences for individuals and society and involves a high risk of biases. We argue that web search can and should empower users to form opinions responsibly and that the information retrieval community is uniquely positioned to lead interdisciplinary efforts to this end. Building on digital humanism—a perspective focused on shaping technology to align with human values and needs—and through an extensive interdisciplinary literature review, we identify challenges and research opportunities that focus on the searcher, search engine, and their complex interplay. We outline a research agenda that provides a foundation for research efforts toward addressing these challenges.

Alisa Rieger, Tim Draws, Nicolas Mattis, David Maxwell, David Elsweiler, Ujwal Gadiraju, Dana McKay, Alessandro Bozzon, Maria Soledad Pera
The Impact of Differential Privacy on Recommendation Accuracy and Popularity Bias

Collaborative filtering-based recommender systems leverage vast amounts of behavioral user data, which poses severe privacy risks. Thus, often random noise is added to the data to ensure Differential Privacy (DP). However, to date, it is not well understood in which ways this impacts personalized recommendations. In this work, we study how DP affects recommendation accuracy and popularity bias when applied to the training data of state-of-the-art recommendation models. Our findings are three-fold: First, we observe that nearly all users’ recommendations change when DP is applied. Second, recommendation accuracy drops substantially while recommended item popularity experiences a sharp increase, suggesting that popularity bias worsens. Finally, we find that DP exacerbates popularity bias more severely for users who prefer unpopular items than for users who prefer popular items.

Peter Müllner, Elisabeth Lex, Markus Schedl, Dominik Kowald
Backmatter
Metadaten
Titel
Advances in Information Retrieval
herausgegeben von
Nazli Goharian
Nicola Tonellotto
Yulan He
Aldo Lipani
Graham McDonald
Craig Macdonald
Iadh Ounis
Copyright-Jahr
2024
Electronic ISBN
978-3-031-56066-8
Print ISBN
978-3-031-56065-1
DOI
https://doi.org/10.1007/978-3-031-56066-8

Premium Partner