main-content

## Über dieses Buch

This two-volume set LNCS 12035 and 12036 constitutes the refereed proceedings of the 42nd European Conference on IR Research, ECIR 2020, held in Lisbon, Portugal, in April 2020.

The 55 full papers presented together with 8 reproducibility papers, 46 short papers, 10 demonstration papers, 12 invited CLEF papers, 7 doctoral consortium papers, 4 workshop papers, and 3 tutorials were carefully reviewed and selected from 457 submissions. They were organized in topical sections named:

Part I: deep learning I; entities; evaluation; recommendation; information extraction; deep learning II; retrieval; multimedia; deep learning III; queries; IR – general; question answering, prediction, and bias; and deep learning IV.

Part II: reproducibility papers; short papers; demonstration papers; CLEF organizers lab track; doctoral consortium papers; workshops; and tutorials.

## Inhaltsverzeichnis

### Knowledge Graph Entity Alignment with Graph Convolutional Networks: Lessons Learned

In this work, we focus on the problem of entity alignment in Knowledge Graphs (KG) and we report on our experiences when applying a Graph Convolutional Network (GCN) based model for this task. Variants of GCN are used in multiple state-of-the-art approaches and therefore it is important to understand the specifics and limitations of GCN-based models. Despite serious efforts, we were not able to fully reproduce the results from the original paper and after a thorough audit of the code provided by authors, we concluded, that their implementation is different from the architecture described in the paper. In addition, several tricks are required to make the model work and some of them are not very intuitive.We provide an extensive ablation study to quantify the effects these tricks and changes of architecture have on final performance. Furthermore, we examine current evaluation approaches and systematize available benchmark datasets.We believe that people interested in KG matching might profit from our work, as well as novices entering the field. (Code: https://github.com/Valentyn1997/kg-alignment-lessons-learned ).

Max Berrendorf, Evgeniy Faerman, Valentyn Melnychuk, Volker Tresp, Thomas Seidl

### The Effect of Content-Equivalent Near-Duplicates on the Evaluation of Search Engines

Current best practices for the evaluation of search engines do not take into account duplicate documents. Dependent on their prevalence, not discounting duplicates during evaluation artificially inflates performance scores, and, it penalizes those whose search systems diligently filter them. Although these negative effects have already been demonstrated a long time ago by Bernstein and Zobel [4], we find that this has failed to move the community. In this paper, we reproduce the aforementioned study and extend it to incorporate all TREC Terabyte, Web, and Core tracks. The worst-case penalty of having filtered duplicates in any of these tracks were losses between 8 and 53 ranks.

Maik Fröbe, Jan Philipp Bittner, Martin Potthast, Matthias Hagen

### From MAXSCORE to Block-Max Wand: The Story of How Lucene Significantly Improved Query Evaluation Performance

The latest major release of Lucene (version 8) in March 2019 incorporates block-max indexes and exploits the block-max variant of Wand for query evaluation, which are innovations that originated from academia. This paper shares the story of how this came to be, which provides an interesting case study at the intersection of reproducibility and academic research achieving impact in the “real world”. We offer additional thoughts on the often idiosyncratic processes by which academic research makes its way into deployed solutions.

Adrien Grand, Robert Muir, Jim Ferenczi, Jimmy Lin

### Which BM25 Do You Mean? A Large-Scale Reproducibility Study of Scoring Variants

When researchers speak of BM25, it is not entirely clear which variant they mean, since many tweaks to Robertson et al.’s original formulation have been proposed. When practitioners speak of BM25, they most likely refer to the implementation in the Lucene open-source search library. Does this ambiguity “matter”? We attempt to answer this question with a large-scale reproducibility study of BM25, considering eight variants. Experiments on three newswire collections show that there are no significant effectiveness differences between them, including Lucene’s often maligned approximation of document length. As an added benefit, our empirical approach takes advantage of databases for rapid IR prototyping, which validates both the feasibility and methodological advantages claimed in previous work.

Chris Kamphuis, Arjen P. de Vries, Leonid Boytsov, Jimmy Lin

### The Unfairness of Popularity Bias in Music Recommendation: A Reproducibility Study

Research has shown that recommender systems are typically biased towards popular items, which leads to less popular items being underrepresented in recommendations. The recent work of Abdollahpouri et al. in the context of movie recommendations has shown that this popularity bias leads to unfair treatment of both long-tail items as well as users with little interest in popular items. In this paper, we reproduce the analyses of Abdollahpouri et al. in the context of music recommendation. Specifically, we investigate three user groups from the Last.fm music platform that are categorized based on how much their listening preferences deviate from the most popular music among all Last.fm users in the dataset: (i) low-mainstream users, (ii) medium-mainstream users, and (iii) high-mainstream users. In line with Abdollahpouri et al., we find that state-of-the-art recommendation algorithms favor popular items also in the music domain. However, their proposed Group Average Popularity metric yields different results for Last.fm than for the movie domain, presumably due to the larger number of available items (i.e., music artists) in the Last.fm dataset we use. Finally, we compare the accuracy results of the recommendation algorithms for the three user groups and find that the low-mainstreaminess group significantly receives the worst recommendations.

Dominik Kowald, Markus Schedl, Elisabeth Lex

### Reproducibility is a Process, Not an Achievement: The Replicability of IR Reproducibility Experiments

This paper espouses a view of reproducibility in the computational sciences as a process and not just a point-in-time “achievement”. As a concrete case study, we revisit the Open-Source IR Reproducibility Challenge from 2015 and attempt to replicate those experiments: four years later, are those computational artifacts still functional? Perhaps not surprisingly, we are not able to replicate most of the retrieval runs encapsulated by those artifacts in a modern computational environment. We outline the various idiosyncratic reasons why, distilled into a series of “lessons learned” to help form an emerging set of best practices for the long-term sustainability of reproducibility efforts.

Jimmy Lin, Qian Zhang

### On the Replicability of Combining Word Embeddings and Retrieval Models

We replicate recent experiments attempting to demonstrate an attractive hypothesis about the use of the Fisher kernel framework and mixture models for aggregating word embeddings towards document representations and the use of these representations in document classification, clustering, and retrieval. Specifically, the hypothesis was that the use of a mixture model of von Mises-Fisher (VMF) distributions instead of Gaussian distributions would be beneficial because of the focus on cosine distances of both VMF and the vector space model traditionally used in information retrieval. Previous experiments had validated this hypothesis. Our replication was not able to validate it, despite a large parameter scan space.

Luca Papariello, Alexandros Bampoulidis, Mihai Lupu

### Influence of Random Walk Parametrization on Graph Embeddings

Network or graph embedding has gained increasing attention in the research community during the last years. In particular, many methods to create graph embeddings using random walk based approaches have been developed. node2vec [10] introduced means to control the random walk behavior, guiding the walks. We aim to reproduce parts of their work and introduce two additional modifications (jump probabilities and attention to hubs), in order to investigate how guiding and modifying the walks influences the learned embeddings. The reproduction includes the case study illustrating homophily and structural equivalence subject to the chosen strategy and a node classification task. We were not able to illustrate structural equivalence and further results show that modifications of the walks only slightly improve node classification, if at all.

Fabian Schliski, Jörg Schlötterer, Michael Granitzer

### Calling Attention to Passages for Biomedical Question Answering

Question answering can be described as retrieving relevant information for questions expressed in natural language, possibly also generating a natural language answer. This paper presents a pipeline for document and passage retrieval for biomedical question answering built around a new variant of the DeepRank network model in which the recursive layer is replaced by a self-attention layer combined with a weighting mechanism. This adaptation halves the total number of parameters and makes the network more suited for identifying the relevant passages in each document. The overall retrieval system was evaluated on the BioASQ tasks 6 and 7, achieving similar retrieval performance when compared to more complex network architectures.

Tiago Almeida, Sérgio Matos

### Neural Embedding-Based Metrics for Pre-retrieval Query Performance Prediction

Query Performance Prediction (QPP) is concerned with estimating the effectiveness of a query within the context of a retrieval model. It allows for operations such as query routing and segmentation, leading to improved retrieval performance. Pre-retrieval QPP methods are oblivious to the performance of the retrieval model as they predict query difficulty prior to observing the set of documents retrieved for the query. Since neural embedding-based models are showing wider adoption in the Information Retrieval (IR) community, we propose a set of pre-retrieval QPP metrics based on the properties of pre-trained neural embeddings and show that such metrics are more effective for query performance prediction compared to the widely known QPP metrics such as SCQ, PMI and SCS. We report our findings based on Robust04, ClueWeb09 and Gov2 corpora and their associated TREC topics.

Negar Arabzadeh, Fattane Zarrinkalam, Jelena Jovanovic, Ebrahim Bagheri

### A Latent Model for Ad Hoc Table Retrieval

The ad hoc table retrieval task is concerned with satisfying a query with a ranked list of tables. While there are strong baselines in the literature that exploit learning to rank and semantic matching techniques, there are still a set of hard queries that are difficult for these baseline methods to address. We find that such hard queries are those whose constituting tokens (i.e., terms or entities) are not fully or partially observed in the relevant tables. We focus on proposing a latent factor model to address such hard queries. Our proposed model factorizes the token-table co-occurrence matrix into two low dimensional latent factor matrices that can be used for measuring table and query similarity even if no shared tokens exist between them. We find that the variation of our proposed model that considers keywords provides statistically significant improvement over three strong baselines in terms of NDCG and ERR.

Ebrahim Bagheri, Feras Al-Obeidat

### Hybrid Semantic Recommender System for Chemical Compounds

Recommending Chemical Compounds of interest to a particular researcher is a poorly explored field. The few existent datasets with information about the preferences of the researchers use implicit feedback. The lack of Recommender Systems in this particular field presents a challenge for the development of new recommendations models. In this work, we propose a Hybrid recommender model for recommending Chemical Compounds. The model integrates collaborative-filtering algorithms for implicit feedback (Alternating Least Squares (ALS) and Bayesian Personalized Ranking (BPR)) and semantic similarity between the Chemical Compounds in the ChEBI ontology (ONTO). We evaluated the model in an implicit dataset of Chemical Compounds, CheRM. The Hybrid model was able to improve the results of state-of-the-art collaborative-filtering algorithms, especially for Mean Reciprocal Rank, with an increase of 6.7% when comparing the collaborative-filtering ALS and the Hybrid ALS_ONTO.

Márcia Barros, André Moitinho, Francisco M. Couto

### Assessing the Impact of OCR Errors in Information Retrieval

A significant amount of the textual content available on the Web is stored in PDF files. These files are typically converted into plain text before they can be processed by information retrieval or text mining systems. Automatic conversion typically introduces various errors, especially if OCR is needed. In this empirical study, we simulate OCR errors and investigate the impact that misspelled words have on retrieval accuracy. In order to quantify such impact, errors were systematically inserted at varying rates in an initially clean IR collection. Our results showed that significant impacts are noticed starting at a 5% error rate. Furthermore, stemming has proven to make systems more robust to errors.

Guilherme Torresan Bazzo, Gustavo Acauan Lorentz, Danny Suarez Vargas, Viviane P. Moreira

### Towards Query Logs for Privacy Studies: On Deriving Search Queries from Questions

Detailed query histories often contain a precise picture of a person’s life, including sensitive and personally identifiable information. As sanitization of such logs is an unsolved research problem, commercial Web search engines that possess large datasets of this kind at their disposal refrain from disseminating them to the wider research community. Ironically, studies examining privacy in search often require detailed search logs with user profiles. This paper builds on an observation that information needs are also expressed in the form of questions in online Community Question Answering (CQA) communities. We take a step towards understanding the process of formulating queries from questions to form a basis for automatic derivation of search logs from CQA forums. Specifically, we sample natural language (NL) questions spanning diverse themes from the StackExchange platform, and conduct a large-scale conversion experiment where crowdworkers submit search queries they would use when looking for equivalent information. We also release a dataset of 7,000 question-query pairs from our study.

Asia J. Biega, Jana Schmidt, Rishiraj Saha Roy

### Machine-Actionable Data Management Plans: A Knowledge Retrieval Approach to Automate the Assessment of Funders’ Requirements

Funding bodies and other policy-makers are increasingly more concerned with Research Data Management (RDM). The Data Management Plan (DMP) is one of the tools available to perform RDM tasks, however it is not a perfect concept. The Machine-Actionable Data Management Plan (maDMP) is a concept that aims to make the DMP interoperable, automated and increasingly standardised. In this paper we showcase that through the usage of semantic technologies, it is possible to both express and exploit the features of the maDMP. In particular, we focus on showing how a maDMP formalised as an ontology can be used automate the assessment of a funder’s requirements for a given organisation.

João Cardoso, Diogo Proença, José Borbinha

### Session-Based Path Prediction by Combining Local and Global Content Preferences

Session-based future page prediction is important for online web experiences to understand user behavior, pre-fetching future content, and for creating future experiences for users. While webpages visited by the user in the current session capture the users’ local preferences, in this work, we show how the global content preferences at the given instant can assist in this task. We present DRS-LaG, a Deep Reinforcement Learning System, based on Local and Global preferences. We capture these global content preferences by tracking a key analytics KPI, the number of views. The problem is formulated using an agent which predicts the next page to be visited by the user, based on the historic webpage content and analytics. In an offline setting, we show how the model can be used for predicting the next webpage that the user visits. The online evaluation shows how this framework can be deployed on a website for dynamic adaptation of web experiences, based on both local and global preferences.

Kushal Chawla, Niyati Chhaya

### Unsupervised Ensemble of Ranking Models for News Comments Using Pseudo Answers

Ranking comments on an online news service is a practically important task, and thus there have been many studies on this task. Although ensemble techniques are widely known to improve the performance of models, there is little types of research on ensemble neural-ranking models. In this paper, we investigate how to improve the performance on the comment-ranking task by using unsupervised ensemble methods. We propose a new hybrid method composed of an output selection method and a typical averaging method. Our method uses a pseudo answer represented by the average of multiple model outputs. The pseudo answer is used to evaluate multiple model outputs via ranking evaluation metrics, and the results are used to select and weight the models. Experimental results on the comment-ranking task show that our proposed method outperforms several ensemble baselines, including supervised one.

Soichiro Fujita, Hayato Kobayashi, Manabu Okumura

### Irony Detection in a Multilingual Context

This paper proposes the first multilingual (French, English and Arabic) and multicultural (Indo-European languages vs. less culturally close languages) irony detection system. We employ both feature-based models and neural architectures using monolingual word representation. We compare the performance of these systems with state-of-the-art systems to identify their capabilities. We show that these monolingual models trained separately on different languages using multilingual word representation or text-based features can open the door to irony detection in languages that lack of annotated data for irony.

Bilal Ghanem, Jihen Karoui, Farah Benamara, Paolo Rosso, Véronique Moriceau

### Document Network Projection in Pretrained Word Embedding Space

We present Regularized Linear Embedding (RLE), a novel method that projects a collection of linked documents (e.g., citation network) into a pretrained word embedding space. In addition to the textual content, we leverage a matrix of pairwise similarities providing complementary information (e.g., the network proximity of two documents in a citation graph). We first build a simple word vector average for each document, and we use the similarities to alter this average representation. The document representations can help to solve many information retrieval tasks, such as recommendation, classification and clustering. We demonstrate that our approach outperforms or matches existing document network embedding methods on node classification and link prediction tasks. Furthermore, we show that it helps identifying relevant keywords to describe document classes.

Antoine Gourru, Adrien Guille, Julien Velcin, Julien Jacques

### Supervised Learning Methods for Diversification of Image Search Results

We adopt a supervised learning framework, namely R-LTR [17], to diversify image search results, and extend it in various ways. Our experiments show that the adopted and proposed variants are superior to two well-known baselines, with relative gains up to 11.4%.

Burak Goynuk, Ismail Sengor Altingovde

### ANTIQUE: A Non-factoid Question Answering Benchmark

Considering the widespread use of mobile and voice search, answer passage retrieval for non-factoid questions plays a critical role in modern information retrieval systems. Despite the importance of the task, the community still feels the significant lack of large-scale non-factoid question answering collections with real questions and comprehensive relevance judgments. In this paper, we develop and release a collection of 2,626 open-domain non-factoid questions from a diverse set of categories. The dataset, called ANTIQUE, contains 34k manual relevance annotations. The questions were asked by real users in a community question answering service, i.e., Yahoo! Answers. Relevance judgments for all the answers to each question were collected through crowdsourcing. To facilitate further research, we also include a brief analysis of the data as well as baseline results on both classical and neural IR models.

### Neural Query-Biased Abstractive Summarization Using Copying Mechanism

This paper deals with the query-biased summarization task. Conventional non-neural network-based approaches have achieved better performance by primarily including the words overlapping between the source and the query in the summary. However, recurrent neural network (RNN)-based approaches do not explicitly model this phenomenon. Therefore, we model an RNN-based query-biased summarizer to primarily include the overlapping words in the summary, using a copying mechanism. Experimental results, in terms of both automatic evaluation with ROUGE and manual evaluation, show that the strategy to include the overlapping words also works well for neural query-biased summarizers.

Tatsuya Ishigaki, Hen-Hsen Huang, Hiroya Takamura, Hsin-Hsi Chen, Manabu Okumura

### Distant Supervision for Extractive Question Summarization

Questions are often lengthy and difficult to understand because they tend to contain peripheral information. Previous work relies on costly human-annotated data or question-title pairs. In this work, we propose a distant supervision framework that can train a question summarizer without annotation costs or question-title pairs, where sentences are automatically annotated by means of heuristic rules. The key idea is that a single-sentence question tends to have a summary-like property. We empirically show that our models trained on the framework perform competitively with respect to supervised models without the requirement of a costly human-annotated dataset.

Tatsuya Ishigaki, Kazuya Machida, Hayato Kobayashi, Hiroya Takamura, Manabu Okumura

### Text-Image-Video Summary Generation Using Joint Integer Linear Programming

Automatically generating a summary for asynchronous data can help users to keep up with the rapid growth of multi-modal information on the Internet. However, the current multi-modal systems usually generate summaries composed of text and images. In this paper, we propose a novel research problem of text-image-video summary generation (TIVS). We first develop a multi-modal dataset containing text documents, images and videos. We then propose a novel joint integer linear programming multi-modal summarization (JILP-MMS) framework. We report the performance of our model on the developed dataset.

### Domain Adaptation via Context Prediction for Engineering Diagram Search

Effective search for engineering diagram images in larger collections is challenging because most existing feature extraction models are pre-trained on natural image data rather than diagrams. Surprisingly, we observe through experiments that even in-domain training with standard unsupervised representation learning techniques leads to poor results. We argue that, because of their structured nature, diagram images require more specially-tailored learning objectives. We propose a new method for unsupervised adaptation of out-of-domain feature extractors that asks the model to reason about spatial context. Specifically, we fine-tune a pre-trained image encoder by requiring it to correctly predict the relative orientation between pairs of nearby image regions. Experiments on the recently released Ikea Diagram Dataset show that our proposed method leads to substantial improvements on a downstream search task, more than doubling recall for certain query categories in the dataset.

Harsh Jhamtani, Taylor Berg-Kirkpatrick

### Crowdsourcing Truthfulness: The Impact of Judgment Scale and Assessor Bias

News content can sometimes be misleading and influence users’ decision making processes (e.g., voting decisions). Quantitatively assessing the truthfulness of content becomes key, but it is often challenging and thus done by experts. In this work we look at how experts and non-expert assess truthfulness of content by focusing on the effect of the adopted judgment scale and of assessors’ own bias on the judgments they perform. Our results indicate a clear effect of the assessors’ political background on their judgments where they tend to trust content which is aligned to their own belief, even if experts have marked it as false. Crowd assessors also seem to have a preference towards coarse-grained scales, as they tend to use a few extreme values rather than the full breadth of fine-grained scales.

David La Barbera, Kevin Roitero, Gianluca Demartini, Stefano Mizzaro, Damiano Spina

### Novel and Diverse Recommendations by Leveraging Linear Models with User and Item Embeddings

Nowadays, item recommendation is an increasing concern for many companies. Users tend to be more reactive than proactive for solving information needs. Recommendation accuracy became the most studied aspect of the quality of the suggestions. However, novel and diverse suggestions also contribute to user satisfaction. Unfortunately, it is common to harm those two aspects when optimizing recommendation accuracy. In this paper, we present EER, a linear model for the top-N recommendation task, which takes advantage of user and item embeddings for improving novelty and diversity without harming accuracy.

Alfonso Landin, Javier Parapar, Álvaro Barreiro

### A Multi-task Approach to Open Domain Suggestion Mining Using Language Model for Text Over-Sampling

Consumer reviews online may contain suggestions useful for improving commercial products and services. Mining suggestions is challenging due to the absence of large labeled and balanced datasets. Furthermore, most prior studies attempting to mine suggestions, have focused on a single domain such as Hotel or Travel only. In this work, we introduce a novel over-sampling technique to address the problem of class imbalance, and propose a multi-task deep learning approach for mining suggestions from multiple domains. Experimental results on a publicly available dataset show that our over-sampling technique, coupled with the multi-task framework outperforms state-of-the-art open domain suggestion mining models in terms of the F-1 measure and AUC.

Maitree Leekha, Mononito Goswami, Minni Jain

Progress in the field of Natural Language Processing (NLP) has been closely followed by applications in the medical domain. Recent advancements in Neural Language Models (NLMs) have transformed the field and are currently motivating numerous works exploring their application in different domains. In this paper, we explore how NLMs can be used for Medical Entity Linking with the recently introduced MedMentions dataset, which presents two major challenges: (1) a large target ontology of over 2M concepts, and (2) low overlap between concepts in train, validation and test sets. We introduce a solution, MedLinker, that addresses these issues by leveraging specialized NLMs with Approximate Dictionary Matching, and show that it performs competitively on semantic type linking, while improving the state-of-the-art on the more fine-grained task of concept linking (+4 F1 on MedMentions main task).

Daniel Loureiro, Alípio Mário Jorge

### Ranking Significant Discrepancies in Clinical Reports

Medical errors are a major public health concern and a leading cause of death worldwide. Many healthcare centers and hospitals use reporting systems where medical practitioners write a preliminary medical report and the report is later reviewed, revised, and finalized by a more experienced physician. The revisions range from stylistic to corrections of critical errors or misinterpretations of the case. Due to the large quantity of reports written daily, it is often difficult to manually and thoroughly review all the finalized reports to find such errors and learn from them. To address this challenge, we propose a novel ranking approach, consisting of textual and ontological overlaps between the preliminary and final versions of reports. The approach learns to rank the reports based on the degree of discrepancy between the versions. This allows medical practitioners to easily identify and learn from the reports in which their interpretation most substantially differed from that of the attending physician (who finalized the report). This is a crucial step towards uncovering potential errors and helping medical practitioners to learn from such errors, thus improving patient-care in the long run. We evaluate our model on a dataset of radiology reports and show that our approach outperforms both previously-proposed approaches and more recent language models by 4.5% to 15.4%.

Sean MacAvaney, Arman Cohan, Nazli Goharian, Ross Filice

### Teaching a New Dog Old Tricks: Resurrecting Multilingual Retrieval Using Zero-Shot Learning

While billions of non-English speaking users rely on search engines every day, the problem of ad-hoc information retrieval is rarely studied for non-English languages. This is primarily due to a lack of data set that are suitable to train ranking algorithms. In this paper, we tackle the lack of data by leveraging pre-trained multilingual language models to transfer a retrieval system trained on English collections to non-English queries and documents. Our model is evaluated in a zero-shot setting, meaning that we use them to predict relevance scores for query-document pairs in languages never seen during training. Our results show that the proposed approach can significantly outperform unsupervised retrieval techniques for Arabic, Chinese Mandarin, and Spanish. We also show that augmenting the English training collection with some examples from the target language can sometimes improve performance.

Sean MacAvaney, Luca Soldaini, Nazli Goharian

### Semi-supervised Extractive Question Summarization Using Question-Answer Pairs

Neural extractive summarization methods often require much labeled training data, for which headlines or lead summaries of news articles can sometimes be used. Such directly useful summaries are not always available, however, especially for user-generated content, such as questions posted on community question answering services. In this paper, we address an extractive summarization (i.e., headline extraction) task for such questions as a case study and consider how to alleviate the problem by using question-answer pairs, instead of missing-headline pairs. To this end, we propose a framework to examine how to use such unlabeled paired data from the viewpoint of training methods. Experimental results show that multi-task training performs well with undersampling and distant supervision.

Kazuya Machida, Tatsuya Ishigaki, Hayato Kobayashi, Hiroya Takamura, Manabu Okumura

### Utilizing Temporal Psycholinguistic Cues for Suicidal Intent Estimation

Temporal psycholinguistics can play a crucial role in studying expressions of suicidal intent on social media. Current methods are limited in their approach in leveraging contextual psychological cues from online user communities. This work embarks in a novel direction to explore historical activities of users and homophily networks formed between Twitter users for extracting suicidality trends. Empirical evidence proves the advantages of incorporating historical user profiling and temporal graph convolutional modeling for automated detection of suicidal connotations on Twitter.

Puneet Mathur, Ramit Sawhney, Shivang Chopra, Maitree Leekha, Rajiv Ratn Shah

### PMD: An Optimal Transportation-Based User Distance for Recommender Systems

Collaborative filtering predicts a user’s preferences by aggregating ratings from similar users and thus the user similarity (or distance) measure is key to good performance. Existing similarity measures either consider only the co-rated items for a pair of users (but co-rated items are rare in real-world sparse datasets), or try to utilize the non-co-rated items via some heuristics. We propose a novel user distance measure, called Preference Mover’s Distance (PMD), based on the optimal transportation theory. PMD exploits all ratings made by each user and works even if users do not share co-rated items at all. In addition, PMD is a metric and has favorable properties such as triangle inequality and zero self-distance. Experimental results show that PMD achieves superior recommendation accuracy compared with the state-of-the-art similarity measures, especially on highly sparse datasets.

Yitong Meng, Xinyan Dai, Xiao Yan, James Cheng, Weiwen Liu, Jun Guo, Benben Liao, Guangyong Chen

### On Biomedical Named Entity Recognition: Experiments in Interlingual Transfer for Clinical and Social Media Texts

Although deep neural networks yield state-of-the-art performance in biomedical named entity recognition (bioNER), much research shares one limitation: models are usually trained and evaluated on English texts from a single domain. In this work, we present a fine-grained evaluation intended to understand the efficiency of multilingual BERT-based models for bioNER of drug and disease mentions across two domains in two languages, namely clinical data and user-generated texts on drug therapy in English and Russian. We investigate the role of transfer learning (TL) strategies between four corpora to reduce the number of examples that have to be manually annotated. Evaluation results demonstrate that multi-BERT shows the best transfer capabilities in the zero-shot setting when training and test sets are either in the same language or in the same domain. TL reduces the amount of labeled data needed to achieve high performance on three out of four corpora: pretrained models reach 98–99% of the full dataset performance on both types of entities after training on 10–25% of sentences. We demonstrate that pretraining on data with one or both types of transfer can be effective.

Zulfat Miftahutdinov, Ilseyar Alimova, Elena Tutubalina

### SlideImages: A Dataset for Educational Image Classification

In the past few years, convolutional neural networks (CNNs) have achieved impressive results in computer vision tasks, which however mainly focus on photos with natural scene content. Besides, non-sensor derived images such as illustrations, data visualizations, figures, etc. are typically used to convey complex information or to explore large datasets. However, this kind of images has received little attention in computer vision. CNNs and similar techniques use large volumes of training data. Currently, many document analysis systems are trained in part on scene images due to the lack of large datasets of educational image data. In this paper, we address this issue and present SlideImages, a dataset for the task of classifying educational illustrations. SlideImages contains training data collected from various sources, e.g., Wikimedia Commons and the AI2D dataset, and test data collected from educational slides. We have reserved all the actual educational images as a test dataset in order to ensure that the approaches using this dataset generalize well to new educational images, and potentially other domains. Furthermore, we present a baseline system using a standard deep neural architecture and discuss dealing with the challenge of limited training data.

David Morris, Eric Müller-Budack, Ralph Ewerth

### Rethinking Query Expansion for BERT Reranking

Recent studies have shown promising results of using BERT for Information Retrieval with its advantages in understanding the text content of documents and queries. Compared to short, keywords queries, higher accuracy of BERT were observed on long, natural language queries, demonstrating BERT’s ability in extracting rich information from complex queries. These results show the potential of using query expansion to generate better queries for BERT-based rankers. In this work, we explore BERT’s sensitivity to the addition of structure and concepts. We find that traditional word-based query expansion is not entirely applicable, and provide insight into methods that produce better experimental results.

Ramith Padaki, Zhuyun Dai, Jamie Callan

### Personalized Video Summarization Based Exclusively on User Preferences

We propose a recommender system to detect personalized video summaries, that make visual content interesting for the subjective criteria of the user. In order to provide accurate video summarization, the video segmentation provided by the users and the features of the video segments’ duration are combined using a Synthetic Coordinate based Recommendation system.

### SentiInc: Incorporating Sentiment Information into Sentiment Transfer Without Parallel Data

Sentiment-to-sentiment transfer involves changing the sentiment of the given text while preserving the underlying information. In this work, we present a model SentiInc for sentiment-to-sentiment transfer using unpaired mono-sentiment data. Existing sentiment-to-sentiment transfer models ignore the valuable sentiment-specific details already present in the text. We address this issue by providing a simple framework for encoding sentiment-specific information in the target sentence while preserving the content information. This is done by incorporating sentiment based loss in the back-translation based style transfer. Extensive experiments over the Yelp dataset show that the SentiInc outperforms state-of-the-art methods by a margin of as large as $$\sim$$11% in G-score. The results also demonstrate that our model produces sentiment-accurate and information-preserved sentences.

Kartikey Pant, Yash Verma, Radhika Mamidi

### Dualism in Topical Relevance

There are several concepts whose interpretation and meaning is defined through their binary opposition with other opposite concepts. To this end, in this paper we elaborate on the idea of leveraging the available antonyms of the original query terms for eventually producing an answer which provides a better overview of the related conceptual and information space. Specifically, we sketch a method in which antonyms are used for producing dual queries, which can in turn be exploited for defining a multi-dimensional topical relevance based on the antonyms. We motivate this direction by providing examples and by conducting a preliminary evaluation that shows its importance to specific users.

### Keyphrase Extraction as Sequence Labeling Using Contextualized Embeddings

In this paper, we formulate keyphrase extraction from scholarly articles as a sequence labeling task solved using a BiLSTM-CRF, where the words in the input text are represented using deep contextualized embeddings. We evaluate the proposed architecture using both contextualized and fixed word embedding models on three different benchmark datasets, and compare with existing popular unsupervised and supervised techniques. Our results quantify the benefits of: (a) using contextualized embeddings over fixed word embeddings; (b) using a BiLSTM-CRF architecture with contextualized word embeddings over fine-tuning the contextualized embedding model directly; and (c) using domain-specific contextualized embeddings (SciBERT). Through error analysis, we also provide some insights into why particular models work better than the others. Lastly, we present a case study where we analyze different self-attention layers of the two best models (BERT and SciBERT) to better understand their predictions.

Dhruva Sahrawat, Debanjan Mahata, Haimin Zhang, Mayank Kulkarni, Agniv Sharma, Rakesh Gosangi, Amanda Stent, Yaman Kumar, Rajiv Ratn Shah, Roger Zimmermann

### Easing Legal News Monitoring with Learning to Rank and BERT

While ranking approaches have made rapid advances in the Web search, systems that cater to the complex information needs in professional search tasks are not widely developed, common issues and solutions typically rely on dedicated search strategies backed by ad-hoc retrieval models. In this paper we present a legal search problem where professionals monitor news articles with constant queries on a periodic basis. Firstly, we demonstrate the effectiveness of using traditional retrieval models against the Boolean search of documents in chronological order. In an attempt to capture the complex information needs of users, a learning to rank approach is adopted with user specified relevance criteria as features. This approach, however, only achieves mediocre results compared to the traditional models. However, we find that by fine-tuning a contextualised language model (e.g. BERT), significantly improved retrieval performance can be achieved, providing a flexible solution to satisfying complex information needs without explicit feature engineering.

Luis Sanchez, Jiyin He, Jarana Manotumruksa, Dyaa Albakour, Miguel Martinez, Aldo Lipani

### Generating Query Suggestions for Cross-language and Cross-terminology Health Information Retrieval

Medico-scientific concepts are not easily understood by laypeople that frequently use lay synonyms. For this reason, strategies that help users formulate health queries are essential. Health Suggestions is an existing extension for Google Chrome that provides suggestions in lay and medico-scientific terminologies, both in English and Portuguese. This work proposes, evaluates, and compares further strategies for generating suggestions based on the initial consumer query, using multi-concept recognition and the Unified Medical Language System (UMLS). The evaluation was done with an English and a Portuguese test collection, considering as baseline the suggestions initially provided by Health Suggestions. Given the importance of understandability, we used measures that combine relevance and understandability, namely, uRBP and uRBPgr. Our best method merges the Consumer Health Vocabulary (CHV)-preferred expression for each concept identified in the initial query for lay suggestions and the UMLS-preferred expressions for medico-scientific suggestions. Multi-concept recognition was critical for this improvement.

Paulo Miguel Santos, Carla Teixeira Lopes

### Identifying Notable News Stories

The volume of news content has increased significantly in recent years and systems to process and deliver this information in an automated fashion at scale are becoming increasingly prevalent. One critical component that is required in such systems is a method to automatically determine how notable a certain news story is, in order to prioritize these stories during delivery. One way to do so is to compare each story in a stream of news stories to a notable event. In other words, the problem of detecting notable news can be defined as a ranking task; given a trusted source of notable events and a stream of candidate news stories, we aim to answer the question: “Which of the candidate news stories is most similar to the notable one?”. We employ different combinations of features and learning to rank (LTR) models and gather relevance labels using crowdsourcing. In our approach, we use structured representations of candidate news stories (triples) and we link them to corresponding entities. Our evaluation shows that the features in our proposed method outperform standard ranking methods, and that the trained model generalizes well to unseen news stories.

Antonia Saravanou, Giorgio Stefanoni, Edgar Meij

### BERT for Evidence Retrieval and Claim Verification

We investigate BERT in an evidence retrieval and claim verification pipeline for the task of evidence-based claim verification. To this end, we propose to use two BERT models, one for retrieving evidence sentences supporting or rejecting claims, and another for verifying claims based on the retrieved evidence sentences. To train the BERT retrieval system, we use pointwise and pairwise loss functions and examine the effect of hard negative mining. Our system achieves a new state of the art recall of 87.1 for retrieving evidence sentences out of the FEVER dataset 50K Wikipedia pages, and scores second in the leaderboard with the FEVER score of 69.7.

Amir Soleimani, Christof Monz, Marcel Worring

### BiOnt: Deep Learning Using Multiple Biomedical Ontologies for Relation Extraction

Successful biomedical relation extraction can provide evidence to researchers and clinicians about possible unknown associations between biomedical entities, advancing the current knowledge we have about those entities and their inherent mechanisms. Most biomedical relation extraction systems do not resort to external sources of knowledge, such as domain-specific ontologies. However, using deep learning methods, along with biomedical ontologies, has been recently shown to effectively advance the biomedical relation extraction field. To perform relation extraction, our deep learning system, BiOnt, employs four types of biomedical ontologies, namely, the Gene Ontology, the Human Phenotype Ontology, the Human Disease Ontology, and the Chemical Entities of Biological Interest, regarding gene-products, phenotypes, diseases, and chemical compounds, respectively. We tested our system with three data sets that represent three different types of relations of biomedical entities. BiOnt achieved, in F-score, an improvement of 4.93% points for drug-drug interactions (DDI corpus), 4.99% points for phenotype-gene relations (PGR corpus), and 2.21% points for chemical-induced disease relations (BC5CDR corpus), relatively to the state-of-the-art. The code supporting this system is available at https://github.com/lasigeBioTM/BiONT .

Diana Sousa, Francisco M. Couto

### On the Temporality of Priors in Entity Linking

Entity linking is a fundamental task in natural language processing which deals with the lexical ambiguity in texts. An important component in entity linking approaches is the mention-to-entity prior probability. Even though there is a large number of works in entity linking, the existing approaches do not explicitly consider the time aspect, specifically the temporality of an entity’s prior probability. We posit that this prior probability is temporal in nature and affects the performance of entity linking systems. In this paper we systematically study the effect of the prior on the entity linking performance over the temporal validity of both texts and KBs.

Renato Stoffalette João

### Contextualized Embeddings in Named-Entity Recognition: An Empirical Study on Generalization

Contextualized embeddings use unsupervised language model pretraining to compute word representations depending on their context. This is intuitively useful for generalization, especially in Named-Entity Recognition where it is crucial to detect mentions never seen during training. However, standard English benchmarks overestimate the importance of lexical over contextual features because of an unrealistic lexical overlap between train and test mentions. In this paper, we perform an empirical analysis of the generalization capabilities of state-of-the-art contextualized embeddings by separating mentions by novelty and with out-of-domain evaluation. We show that they are particularly beneficial for unseen mentions detection, especially out-of-domain. For models trained on CoNLL03, language model contextualization leads to a +1.2% maximal relative micro-F1 score increase in-domain against +13% out-of-domain on the WNUT dataset (The code is available at https://github.com/btaille/contener ).

Bruno Taillé, Vincent Guigue, Patrick Gallinari

### DAKE: Document-Level Attention for Keyphrase Extraction

Keyphrases provide a concise representation of the topical content of a document and they are helpful in various downstream tasks. Previous approaches for keyphrase extraction model it as a sequence labelling task and use local contextual information to understand the semantics of the input text but they fail when the local context is ambiguous or unclear. We present a new framework to improve keyphrase extraction by utilizing additional supporting contextual information. We retrieve this additional information from other sentences within the same document. To this end, we propose Document-level Attention for Keyphrase Extraction (DAKE), which comprises Bidirectional Long Short-Term Memory networks that capture hidden semantics in text, a document-level attention mechanism to incorporate document level contextual information, gating mechanisms which help to determine the influence of additional contextual information on the fusion with local contextual information, and Conditional Random Fields which capture output label dependencies. Our experimental results on a dataset of research papers show that the proposed model outperforms previous state-of-the-art approaches for keyphrase extraction.

Tokala Yaswanth Sri Sai Santosh, Debarshi Kumar Sanyal, Plaban Kumar Bhowmick, Partha Pratim Das

### Understanding Depression from Psycholinguistic Patterns in Social Media Texts

The World Health Organization reports that half of all mental illnesses begin by the age of 14. Most of these cases go undetected and untreated. The expanding use of social media has the potential to leverage the early identification of mental health diseases. As data gathered via social media are already digital, they have the ability to power up faster automatic analysis. In this article we evaluate the impact that psycholinguistic patterns can have on a standard machine learning approach for classifying depressed users based on their writings in an online public forum. We combine psycholinguistic features in a rule-based estimator and we evaluate their impact on this classification problem, along with three other standard classifiers. Our results on the Reddit Self-reported Depression Diagnosis dataset outperform some previously reported works on the same dataset. They stand for the importance of extracting psychologically motivated features when processing social media texts with the purpose of studying mental health.

Alina Trifan, Rui Antunes, Sérgio Matos, Jose Luís Oliveira

### Predicting the Size of Candidate Document Set for Implicit Web Search Result Diversification

Implicit result diversification methods exploit the content of the documents in the candidate set, i.e., the initial retrieval results of a query, to obtain a relevant and diverse ranking. As our first contribution, we explore whether recently introduced word embeddings can be exploited for representing documents to improve diversification, and show a positive result. As a second improvement, we propose to automatically predict the size of candidate set on per query basis. Experimental evaluations using our BM25 runs as well as the best-performing ad hoc runs submitted to TREC (2009–2012) show that our approach improves the performance of implicit diversification up to 5.4% wrt. initial ranking.

Yasar Baris Ulu, Ismail Sengor Altingovde

### Aspect-Based Academic Search Using Domain-Specific KB

Academic search engines allow scientists to explore related work relevant to a given query. Often, the user is also aware of the aspect to retrieve a relevant document. In such cases, existing search engines can be used by expanding the query with terms describing that aspect. However, this approach does not guarantee good results since plain keyword matches do not always imply relevance. To address this issue, we define and solve a novel academic search task, called aspect-based retrieval, which allows the user to specify the aspect along with the query to retrieve a ranked list of relevant documents. The primary idea is to estimate a language model for the aspect as well as the query using a domain-specific knowledge base and use a mixture of the two to determine the relevance of the article. Our evaluation of the results over the Open Research Corpus dataset shows that our method outperforms keyword-based expansion of query with aspect with and without relevance feedback.

Prajna Upadhyay, Srikanta Bedathur, Tanmoy Chakraborty, Maya Ramanath

### Dynamic Heterogeneous Graph Embedding Using Hierarchical Attentions

Graph embedding has attracted many research interests. Existing works mainly focus on static homogeneous/heterogeneous networks or dynamic homogeneous networks. However, dynamic heterogeneous networks are more ubiquitous in reality, e.g. social network, e-commerce network, citation network, etc. There is still a lack of research on dynamic heterogeneous graph embedding. In this paper, we propose a novel dynamic heterogeneous graph embedding method using hierarchical attentions (DyHAN) that learns node embeddings leveraging both structural heterogeneity and temporal evolution. We evaluate our method on three real-world datasets. The results show that DyHAN outperforms various state-of-the-art baselines in terms of link prediction task.

Luwei Yang, Zhibo Xiao, Wen Jiang, Yi Wei, Yi Hu, Hao Wang

### DSR: A Collection for the Evaluation of Graded Disease-Symptom Relations

The effective extraction of ranked disease-symptom relationships is a critical component in various medical tasks, including computer-assisted medical diagnosis or the discovery of unexpected associations between diseases. While existing disease-symptom relationship extraction methods are used as the foundation in the various medical tasks, no collection is available to systematically evaluate the performance of such methods. In this paper, we introduce the Disease-Symptom Relation Collection (dsr-collection), created by five physicians as expert annotators. We provide graded symptom judgments for diseases by differentiating between relevant symptoms and primary symptoms. Further, we provide several strong baselines, based on the methods used in previous studies. The first method is based on word embeddings, and the second on co-occurrences of MeSH-keywords of medical articles. For the co-occurrence method, we propose an adaption in which not only keywords are considered, but also the full text of medical articles. The evaluation on the dsr-collection shows the effectiveness of the proposed adaption in terms of nDCG, precision, and recall.

Markus Zlabinger, Sebastian Hofstätter, Navid Rekabsaz, Allan Hanbury

### A Web-Based Platform for Mining and Ranking Association Rules

In this demo, we introduce an interactive system, which effectively applies multiple criteria analysis to rank association rules. We first use association rules techniques to explore the correlations between variables in given data (i.e., database and linked data (LD)), and secondly apply multiple criteria analysis (MCA) to select the most relevant rules according to user preferences. The developed system is flexible and allows intuitive creation and execution of different algorithms for an extensive range of advanced data analysis topics. Furthermore, we demonstrate a case study of association rule mining and ranking on road accident data.

### Army ANT: A Workbench for Innovation in Entity-Oriented Search

As entity-oriented search takes the lead in modern search, the need for increasingly flexible tools, capable of motivating innovation in information retrieval research, also becomes more evident. Army ANT is an open source framework that takes a step forward in generalizing information retrieval research, so that modern approaches can be easily integrated in a shared evaluation environment. We present an overview on the system architecture of Army ANT, which has four main abstractions: (i) readers, to iterate over text collections, potentially containing associated entities and triples; (ii) engines, that implement indexing and searching approaches, supporting different retrieval tasks and ranking functions; (iii) databases, to store additional document metadata; and (iv) evaluators, to assess retrieval performance for specific tasks and test collections. We also introduce the command line interface and the web interface, presenting a learn mode as a way to explore, analyze and understand representation and retrieval models, through tracing, score component visualization and documentation.

José Devezas, Sérgio Nunes

### A Search Engine for Police Press Releases to Double-Check the News

Many people have doubts about the factual accuracy of online news, while still trusting the press releases of police departments. To enable an easy corroboration of online news about police-related events, we build a search engine for press releases of police departments. Addressing the German “market”, the search engine takes the URL of a German piece of online news as input and retrieves relevant press releases of the German police. Comparing different query-by-document strategies in a TREC-style evaluation on 105 topics, we show that our system is able to accurately identify relevant press releases if there are any.

Maik Fröbe, Nina Schwanke, Matthias Hagen, Martin Potthast

### Neural-IR-Explorer: A Content-Focused Tool to Explore Neural Re-ranking Results

In this paper we look beyond metrics-based evaluation of Information Retrieval systems, to explore the reasons behind ranking results. We present the content-focused Neural-IR-Explorer, which empowers users to browse through retrieval results and inspect the inner workings and fine-grained results of neural re-ranking models. The explorer includes a categorized overview of the available queries, as well as an individual query result view with various options to highlight semantic connections between query-document pairs.The Neural-IR-Explorer is available at: https://neural-ir-explorer.ec.tuwien.ac.at/ .

Sebastian Hofstätter, Markus Zlabinger, Allan Hanbury

### Revisionista.PT: Uncovering the News Cycle Using Web Archives

In this demo, we present a meta-journalistic tool that reveals post-publication changes in articles of Portuguese online news media. Revisionista.PT can uncover the news cycle of online media, offering a glimpse into an otherwise unknown dynamic edit history. We leverage on article snapshots periodically collected by Web archives to reconstruct an approximate timeline of the changes: additions, edits, and corrections. Revisionista.PT is currently tracking changes in about 140,000 articles published by 12 selected news sources and has a user-friendly interface that will be familiar to users of version control systems. In addition, an open source browser extension can be installed by users so that they can be alerted of changes to articles they may be reading. Initial work on this demo was started as an entry submitted into Arquivo.PT ’s 2019 Prize, where it received an award for second place.

Flávio Martins, André Mourão

### MathSeer: A Math-Aware Search Interface with Intuitive Formula Editing, Reuse, and Lookup

There has been growing interest in math-aware search engines that support retrieval using both formulas and keywords. An important unresolved issue is the design of search interfaces: for wide adoption, they must be engaging and easy-to-use, particularly for non-experts. The MathSeer interface addresses this with straightforward formula creation, editing, and lookup. Formulas are stored in ‘chips’ created using handwriting, , and images. MathSeer sessions are also stored at automatically generated URLs that save all chips and their editing history. To avoid re-entering formulas, chips can be reused, edited, or used in creating other formulas. As users enter formulas, our novel autocompletion facility returns entity cards searchable by formula or entity name, making formulas easy to (re)locate, and descriptions of symbols and notation available before queries are issued.

Gavin Nishizawa, Jennifer Liu, Yancarlos Diaz, Abishai Dmello, Wei Zhong, Richard Zanibbi

### NLPExplorer: Exploring the Universe of NLP Papers

Understanding the current research trends, problems, and their innovative solutions remains a bottleneck due to the ever-increasing volume of scientific articles. In this paper, we propose NLPExplorer, a completely automatic portal for indexing, searching, and visualizing Natural Language Processing (NLP) research volume. NLPExplorer presents interesting insights from papers, authors, venues, and topics. In contrast to previous topic modelling based approaches, we manually curate five course-grained non-exclusive topical categories namely Linguistic Target (Syntax, Discourse, etc.), Tasks (Tagging, Summarization, etc.), Approaches (unsupervised, supervised, etc.), Languages (English, Chinese, etc.) and Dataset types (news, clinical notes, etc.). Some of the novel features include a list of young popular authors, popular URLs and datasets, list of topically diverse papers and recent popular papers. Also, it provides temporal statistics such as yearwise popularity of topics, datasets, and seminal papers. To facilitate future research and system development, we make all the processed dataset accessible through API calls. The current system is available at http://nlpexplorer.org .

Monarch Parmar, Naman Jain, Pranjali Jain, P. Jayakrishna Sahit, Soham Pachpande, Shruti Singh, Mayank Singh

### Personal Research Assistant for Online Exploration of Historical News

We present a novel environment for exploratory search in large collections of historical newspapers developed as a part of the NewsEye project. In this paper we focus on the intelligent Personal Research Assistant (PRA) component in the environment and the web interface. The PRA is an interactive exploratory engine that combines results of various text analysis tools in an unsupervised fashion to conduct autonomous investigations on the data according to users’ needs. The PRA is freely available online together with some datasets of European historical newspapers. The methods used by the assistant are of potential benefit to other exploratory search applications.

Lidia Pivovarova, Axel Jean-Caurant, Jari Avikainen, Khalid Alnajjar, Mark Granroth-Wilding, Leo Leppänen, Elaine Zosa, Hannu Toivonen

### QISS: An Open Source Image Similarity Search Engine

Qwant Image Similarity Search (QISS) is a multi-lingual image similarity search engine based on a dual path neural networks that embed texts and images into a common feature space where they are easily comparable. Our demonstrator, available at http://research.qwant.com/images , allows real-time searches in a database of approximately 100 million images.

Maxime Portaz, Adrien Nivaggioli, Hicham Randrianarivo, Ilyes Kacher, Sylvain Peyronnet

### EveSense: What Can You Sense from Twitter?

Social media has become a useful source for detecting real-life events. This paper presents an event detection application EveSense. It detects real-life events and related trending topics from the Twitter stream and allows users to find interesting events that have recently occurred. It uses a novel Dynamic Heartbeat Graph (DHG) approach, which efficiently extracts distinguishing features and performs better than the existing event detection methods. We tested and evaluated the application on three case studies, including a sports event (FA cup Final) and two political events (Super Tuesday and US Election).

Zafar Saeed, Rabeeh Ayaz Abbasi, Imran Razzak

### CheckThat! at CLEF 2020: Enabling the Automatic Identification and Verification of Claims in Social Media

We describe the third edition of the CheckThat! Lab, which is part of the 2020 Cross-Language Evaluation Forum (CLEF). CheckThat! proposes four complementary tasks and a related task from previous lab editions, offered in English, Arabic, and Spanish. Task 1 asks to predict which tweets in a Twitter stream are worth fact-checking. Task 2 asks to determine whether a claim posted in a tweet can be verified using a set of previously fact-checked claims. Task 3 asks to retrieve text snippets from a given set of Web pages that would be useful for verifying a target tweet’s claim. Task 4 asks to predict the veracity of a target tweet’s claim using a set of potentially-relevant Web pages. Finally, the lab offers a fifth task that asks to predict the check-worthiness of the claims made in English political debates and speeches. CheckThat! features a full evaluation framework. The evaluation is carried out using mean average precision or precision at rank k for ranking tasks, and F$$_1$$ for classification tasks.

Alberto Barrón-Cedeño, Tamer Elsayed, Preslav Nakov, Giovanni Da San Martino, Maram Hasanain, Reem Suwaileh, Fatima Haouari

### Shared Tasks on Authorship Analysis at PAN 2020

The paper gives a brief overview of the four shared tasks that are to be organized at the PAN 2020 lab on digital text forensics and stylometry, hosted at CLEF conference. The tasks include author profiling, celebrity profiling, cross-domain author verification, and style change detection, seeking to advance the state of the art and to evaluate it on new benchmark datasets.

Janek Bevendorff, Bilal Ghanem, Anastasia Giachanou, Mike Kestemont, Enrique Manjavacas, Martin Potthast, Francisco Rangel, Paolo Rosso, Günther Specht, Efstathios Stamatatos, Benno Stein, Matti Wiegmann, Eva Zangerle

### Touché: First Shared Task on Argument Retrieval

Technologies for argument mining and argumentation processing are maturing continuously, giving rise to the idea of retrieving arguments in search scenarios. We introduce Touché, the first lab on Argument Retrieval featuring two subtasks: (1) the retrieval of arguments from a focused debate collection to support argumentative conversations, and (2) the retrieval of arguments from a generic web crawl to answer comparative questions with argumentative results. The goal of this lab is to perform an evaluation of various strategies to retrieve argumentative information from the web content. In this paper, we describe the setting of each subtask: the motivation, the data, and the evaluation methodology.

Alexander Bondarenko, Matthias Hagen, Martin Potthast, Henning Wachsmuth, Meriem Beloucif, Chris Biemann, Alexander Panchenko, Benno Stein

### Introducing the CLEF 2020 HIPE Shared Task: Named Entity Recognition and Linking on Historical Newspapers

Since its introduction some twenty years ago, named entity (NE) processing has become an essential component of virtually any text mining application and has undergone major changes. Recently, two main trends characterise its developments: the adoption of deep learning architectures and the consideration of textual material originating from historical and cultural heritage collections. While the former opens up new opportunities, the latter introduces new challenges with heterogeneous, historical and noisy inputs. If NE processing tools are increasingly being used in the context of historical documents, performance values are below the ones on contemporary data and are hardly comparable. In this context, this paper introduces the CLEF 2020 Evaluation Lab HIPE (Identifying Historical People, Places and other Entities) on named entity recognition and linking on diachronic historical newspaper material in French, German and English. Our objective is threefold: strengthening the robustness of existing approaches on non-standard inputs, enabling performance comparison of NE processing on historical texts, and, in the long run, fostering efficient semantic indexing of historical documents in order to support scholarship on digital cultural heritage collections.

Maud Ehrmann, Matteo Romanello, Stefan Bircher, Simon Clematide

### ImageCLEF 2020: Multimedia Retrieval in Lifelogging, Medical, Nature, and Internet Applications

Bogdan Ionescu, Henning Müller, Renaud Péteri, Duc-Tien Dang-Nguyen, Liting Zhou, Luca Piras, Michael Riegler, Pål Halvorsen, Minh-Triet Tran, Mathias Lux, Cathal Gurrin, Jon Chamberlain, Adrian Clark, Antonio Campello, Alba G. Seco de Herrera, Asma Ben Abacha, Vivek Datla, Sadid A. Hasan, Joey Liu, Dina Demner-Fushman, Obioma Pelka, Christoph M. Friedrich, Yashin Dicente Cid, Serge Kozlovski, Vitali Liauchuk, Vassili Kovalev, Raul Berari, Paul Brie, Dimitri Fichou, Mihai Dogariu, Liviu Daniel Stefan, Mihai Gabriel Constantin

### LifeCLEF 2020 Teaser: Biodiversity Identification and Prediction Challenges

Building accurate knowledge of the identity, the geographic distribution and the evolution of species is essential for the sustainable development of humanity, as well as for biodiversity conservation. However, the difficulty of identifying plants and animals in the field is hindering the aggregation of new data and knowledge. Identifying and naming living plants or animals is almost impossible for the general public and is often difficult even for professionals and naturalists. Bridging this gap is a key step towards enabling effective biodiversity monitoring systems. The LifeCLEF campaign, presented in this paper, has been promoting and evaluating advances in this domain since 2011. The 2020 edition proposes four data-oriented challenges related to the identification and prediction of biodiversity: (i) PlantCLEF: cross-domain plant identification based on herbarium sheets, (ii) BirdCLEF: bird species recognition in audio soundscapes, (iii) GeoLifeCLEF: location-based prediction of species based on environmental and occurrence data, and (iv) SnakeCLEF: image-based snake identification.

Alexis Joly, Hervé Goëau, Stefan Kahl, Christophe Botella, Rafael Ruiz De Castaneda, Hervé Glotin, Elijah Cole, Julien Champ, Benjamin Deneu, Maximillien Servajean, Titouan Lorieul, Willem-Pier Vellinga, Fabian-Robert Stöter, Andrew Durso, Pierre Bonnet, Henning Müller

### BioASQ at CLEF2020: Large-Scale Biomedical Semantic Indexing and Question Answering

This paper describes the eighth edition of the BioASQ Challenge, which will run as an evaluation Lab in the context of CLEF2020. The aim of BioASQ is the promotion of systems and methods for highly precise biomedical information access. This is done through the organization of a series of challenges (shared tasks) on large-scale biomedical semantic indexing and question answering, where different teams develop systems that compete on the same demanding benchmark datasets that represent the real information needs of biomedical experts. In order to facilitate this information finding process, the BioASQ challenge introduced two complementary tasks: (a) the automated indexing of large volumes of unlabelled data, primarily scientific articles, with biomedical concepts, (b) the processing of biomedical questions and the generation of comprehensible answers. Rewarding the most competitive systems that outperform the state of the art, BioASQ manages to push the research frontier towards ensuring that the biomedical experts will have direct access to valuable knowledge.

Martin Krallinger, Anastasia Krithara, Anastasios Nentidis, Georgios Paliouras, Marta Villegas

### eRisk 2020: Self-harm and Depression Challenges

This paper describes eRisk, the CLEF lab on early risk prediction on the Internet. eRisk started in 2017 as an attempt to set the experimental foundations of early risk detection. Over the last three editions of eRisk (2017, 2018 and 2019), the lab organized a number of early risk detection challenges oriented to the problems of detecting depression, anorexia and self-harm. We review in this paper the main lessons learned from the past and we discuss our future plans for the 2020 edition.

David E. Losada, Fabio Crestani, Javier Parapar

### Finding Old Answers to New Math Questions: The ARQMath Lab at CLEF 2020

Behrooz Mansouri, Anurag Agarwal, Douglas Oard, Richard Zanibbi

### ChEMU: Named Entity Recognition and Event Extraction of Chemical Reactions from Patents

We introduce a new evaluation lab named ChEMU (Cheminformatics Elsevier Melbourne University), part of the 11th Conference and Labs of the Evaluation Forum (CLEF-2020). ChEMU involves two key information extraction tasks over chemical reactions from patents. Task 1—Named entity recognition—involves identifying chemical compounds as well as their types in context, i.e., to assign the label of a chemical compound according to the role which the compound plays within a chemical reaction. Task 2—Event extraction over chemical reactions—involves event trigger detection and argument recognition. We briefly present the motivations and goals of the ChEMU tasks, as well as resources and evaluation methodology.

Dat Quoc Nguyen, Zenan Zhai, Hiyori Yoshikawa, Biaoyan Fang, Christian Druckenbrodt, Camilo Thorne, Ralph Hoessel, Saber A. Akhondi, Trevor Cohn, Timothy Baldwin, Karin Verspoor

### Living Labs for Academic Search at CLEF 2020

The need for innovation in the field of academic search and IR, in general, is shown by the stagnating system performance in controlled evaluation campaigns, as demonstrated in TREC and CLEF meta-evaluation studies, as well as user studies in real systems of scientific information and digital libraries. The question of what constitutes relevance in academic search is multi-layered and a topic that drives research communities for years. The Living Labs for Academic Search (LiLAS) workshop has the goal to inspire the discussion on research and evaluation of academic search systems by strengthening the concept of living labs to the domain of academic search. We want to bring together IR researchers interested in online evaluations of academic search systems and foster knowledge on improving the search for academic resources like literature, research data, and the interlinking between these resources. The employed online evaluation approach based on a living lab infrastructure allows the direct connection to real-world academic search systems from the life sciences and the social sciences.

Philipp Schaer, Johann Schaible, Bernd Müller

### CLEF eHealth Evaluation Lab 2020

Hanna Suominen, Liadh Kelly, Lorraine Goeuriot, Martin Krallinger

### Reproducible Online Search Experiments

In the empirical sciences, the evidence is commonly manifested by experimental results. However, very often, these findings are not reproducible, hindering scientific progress. Innovations in the field of information retrieval (IR) are mainly driven by experimental results as well. While there are several attempts to assure the reproducibility of offline experiments with standardized test collections, reproducible outcomes of online experiments remain an open issue. This research project will be concerned with the reproducibility of online experiments, including real-world user feedback. In contrast to previous living lab attempts by the IR community, this project has a stronger focus on making IR systems and corresponding results reproducible. The project aims to provide insights concerning key components that affect reproducibility in online search experiments. Outcomes help to improve the design of reproducible IR online experiments in the future.

Timo Breuer

### Graph-Based Entity-Oriented Search: A Unified Framework in Information Retrieval

Modern search engines have evolved beyond document retrieval. Nowadays, the information needs of the users can be directly satisfied through entity-oriented search, by taking into account the entities that better relate to the query, as opposed to relying exclusively on the best matching terms. Evolving from keyword-based to entity-oriented search poses several challenges, not only regarding the understanding of natural language queries, which are more familiar to the end-user, but also regarding the integration of unstructured documents and structured information sources such as knowledge bases. One opportunity that remains open is the research of unified frameworks for the representation and retrieval of heterogeneous information sources. The doctoral work we present here proposes graph-based models to promote the cooperation between different units of information, in order to maximize the amount of available leads that help the user satisfy an information need.

José Devezas

### Graph Databases for Information Retrieval

Graph models have been deployed in the context of information retrieval for many years. Computations involving the graph structure are often separated from computations related to the base ranking. In recent years, graph data management has been a topic of interest in database research. We propose to deploy graph database management systems to implement existing and novel graph-based models for information retrieval. For this a unifying mapping from a graph query language to graph based retrieval models needs to be developed; extending standard graph database operations with functionality for keyword search. We also investigate how data structures and algorithms for ranking should change in presence of continuous database updates. We want to investigate how temporal decay can affect ranking when data is continuously updated. Finally, can databases be deployed for efficient two-stage retrieval approaches?

Chris Kamphuis

### Towards a Better Contextualization of Web Contents via Entity-Level Analytics

With the abundance of data and wide access to the internet, a user can be overwhelmed with information. For an average Web user, it is very difficult to identify which information is relevant or irrelevant. Hence, in the era of continuously enhancing Web, organization and interpretation of Web contents are very important in order to easily access the relevant information. Many recent advancements in the area of Web content management such as classification of Web contents, information diffusion, credibility of information, etc. have been explored based on text and semantic of the document. In this paper, we propose a purely semantic contextualization of Web contents. We hypothesize that named entities and their types present in a Web document convey substantial semantic information. By extraction of this information, we aim to study the reasoning and explanation behind the Web contents or patterns. Furthermore, we also plan to exploit LOD (Linked Open Data) to get a deeper insight of Web contents.

Amit Kumar

### Incremental Approach for Automatic Generation of Domain-Specific Sentiment Lexicon

Sentiment lexicon plays a vital role in lexicon-based sentiment analysis. The lexicon-based method is often preferred because it leads to more explainable answers in comparison with many machine learning-based methods. But, semantic orientation of a word depends on its domain. Hence, a general-purpose sentiment lexicon may gives sub-optimal performance compare with a domain-specific lexicon. However, it is challenging to manually generate a domain-specific sentiment lexicon for each domain. Still, it is impractical to generate complete sentiment lexicon for a domain from a single corpus. To this end, we propose an approach to automatically generate a domain-specific sentiment lexicon using a vector model enriched by weights. Importantly, we propose an incremental approach for updating an existing lexicon to either the same domain or different domain (domain-adaptation). Finally, we discuss how to incorporate sentiment lexicons information in neural models (word embedding) for better performance.

Shamsuddeen Hassan Muhammad, Pavel Brazdil, Alípio Jorge

### Time-Critical Geolocation for Social Good

Twitter has become an instrumental source of news in emergencies where efficient access, dissemination of information, and immediate reactions are critical. Nevertheless, due to several challenges, the current fully-automated processing methods are not yet mature enough for deployment in real scenarios. In this dissertation, I focus on tackling the lack of context problem by studying automatic geo-location techniques. I specifically aim to study the Location Mention Prediction problem in which the system has to extract location mentions in tweets and pin them on the map. To address this problem, I aim to exploit different techniques such as training neural models, enriching the tweet representation, and studying methods to mitigate the lack of labeled data. I anticipate many downstream applications for the Location Mention Prediction problem such as incident detection, real-time action management during emergencies, and fake news and rumor detection among others.

Reem Suwaileh

### Bibliometric-Enhanced Legal Information Retrieval

This research project addresses user-focused ranking in legal information retrieval (IR). It studies the perception of relevance of search results for users of Dutch legal IR systems, the employment of usage and citation variables to improve the ranking of search results (bibliometric-enhanced information retrieval), and user-centred evaluation for ranking improvements. The goal of this project is improve the ranking in legal IR systems. Ultimately, this will help legal professionals find relevant information faster.

Gineke Wiggers

### International Workshop on Algorithmic Bias in Search and Recommendation (Bias 2020)

Both search and recommendation algorithms provide results based on their relevance for the current user. In order to do so, such a relevance is usually computed by models trained on historical data, which is biased in most cases. Hence, the results produced by these algorithms naturally propagate, and frequently reinforce, biases hidden in the data, consequently strengthening inequalities. Being able to measure, characterize, and mitigate these biases while keeping high effectiveness is a topic of central interest for the information retrieval community. In this workshop, we aim to collect novel contributions in this emerging field and to provide a common ground for interested researchers and practitioners.

Ludovico Boratto, Mirko Marras, Stefano Faralli, Giovanni Stilo

### Bibliometric-Enhanced Information Retrieval 10th Anniversary Workshop Edition

The Bibliometric-enhanced Information Retrieval workshop series (BIR) was launched at ECIR in 2014 [19] and it was held at ECIR each year since then. This year we organize the 10th iteration of BIR. The workshop series at ECIR and JCDL/SIGIR tackles issues related to academic search, at the crossroads between Information Retrieval, Natural Language Processing and Bibliometrics. In this overview paper, we summarize the past workshops, present the workshop topics for 2020 and reflect on some future steps for this workshop series.

Guillaume Cabanac, Ingo Frommholz, Philipp Mayr

### The 3 International Workshop on Narrative Extraction from Texts: Text2Story 2020

The Third International Workshop on Narrative Extraction from Texts ( Text2Story’20) [text2story20.inesctec.pt] held in conjunction with the 42$$^{\mathrm {nd}}$$ European Conference on Information Retrieval ( ECIR 2020 ) gives researchers of IR, NLP and other fields, the opportunity to share their recent advances in extraction and formal representation of narratives. This workshop also presents a forum to consolidate the multi-disciplinary efforts and foster discussions around the narrative extraction task, a hot topic in recent years.

Ricardo Campos, Alí­pio Jorge, Adam Jatowt, Sumit Bhatia

### Proposal of the First International Workshop on Semantic Indexing and Information Retrieval for Health from Heterogeneous Content Types and Languages (SIIRH)

The application of Information Retrieval (IR) and deep learning strategies to explore the vast amount of rapidly growing health-related content is of utmost importance, but is also particularly challenging, due to the very specialized domain language, and implicit differences in language characteristics depending on the content type.This workshop aims at presenting and discussing current and future directions for IR and machine learning approaches devoted to the retrieval and classification of different types of health-related documents ranging from layman or patient generated texts to highly specialized medical literature or clinical records. It includes a session on the MESINESP shared task, supported by the Spanish National Language Technology plan (Plan TL), in order to address the importance and impact of community evaluation efforts, in particular BioASQ, BioCreative, eHealth CLEF, MEDIQA and TREC, as scenarios for exploring evaluation settings and generate data collections of key importance for promoting the development and comparison of IR resources. Additionally, an open session will address IR technologies for heterogeneous health-related content open to multiple languages with a particular interest in the exploitation of structured controlled vocabularies and entity linking, covering the following topics: multilingual and non-English health-related IR, concept indexing, text categorization, generation of evaluation resources biomedical document IR strategies; scalability, robustness and reproducibility of health IR and text mining resources; use of specialized machine translation and advanced deep learning approaches for improving health related search results; medical Question Answering search tools; retrieval of multilingual health related web-content; and other related topics.

Francisco M. Couto, Martin Krallinger

### Principle-to-Program: Neural Methods for Similar Question Retrieval in Online Communities

Similar question retrieval is a challenge due to lexical gap between query and candidates in archive and is very different from traditional IR methods for duplicate detection, paraphrase identification and semantic equivalence. This tutorial covers recent deep learning techniques which overcome feature engineering issues with existing approaches based on translation models and latent topics. Hands-on proposal thus will introduce each concept from end user (e.g., question-answer pairs) and technique (e.g., attention) perspectives, present state of the art methods and a walkthrough of programs executed on Jupyter notebook using real-world datasets demonstrating principles introduced.

Muthusamy Chelliah, Manish Shrivastava, Jaidam Ram Tej

### Text Meets Space: Geographic Content Extraction, Resolution and Information Retrieval

In this half-day tutorial, we will review the basic concepts of, methods for, and applications of geographic information retrieval, also showing some possible applications in fields such as the digital humanities. The tutorial is organized in four parts. First we introduce some basic ideas about geography, and demonstrate why text is a powerful way of exploring relevant questions. We then introduce a basic end-to-end pipeline discussing geographic information in documents, spatial and multi-dimensional indexing [19], and spatial retrieval and spatial filtering. After showing a range of possible applications, we conclude with suggestions for future work in the area.

Jochen L. Leidner, Bruno Martins, Katherine McDonough, Ross S. Purves

### Backmatter

Weitere Informationen