Top

2019 | Book

Experimental IR Meets Multilinguality, Multimodality, and Interaction

10th International Conference of the CLEF Association, CLEF 2019, Lugano, Switzerland, September 9–12, 2019, Proceedings

Editors: Fabio Crestani, Martin Braschler, Jacques Savoy, Andreas Rauber, Henning Müller, David E. Losada, Gundula Heinatz Bürki, Linda Cappellato, Nicola Ferro

Publisher: Springer International Publishing

Book Series : Lecture Notes in Computer Science

Part of: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

About this book

This book constitutes the refereed proceedings of the 10th International Conference of the CLEF Association, CLEF 2019, held in Lugano, Switzerland, in September 2019.

The conference has a clear focus on experimental information retrieval with special attention to the challenges of multimodality, multilinguality, and interactive search ranging from unstructured to semi structures and structured data. The 7 full papers and 8 short papers presented in this volume were carefully reviewed and selected from 30 submissions. This year, many contributions tackle the social networks with the detection of stances or early identiﬁcation of depression signs on Twitter in a cross-lingual context.

Further this volume presents 7 “best of the labs” papers which were reviewed as a full paper submission with the same review criteria. The labs represented scientific challenges based on new data sets and real world problems in multimodal and multilingual information access. In addition to this, 9 benchmarking labs reported results of their yearlong activities in overview talks and lab sessions.

Frontmatter

History

Frontmatter

What Happened in CLEF For a While?

2019 marks the 20 $$^\text {th}$$ birthday for CLEF, an evaluation campaign activity which has applied the Cranfield evaluation paradigm to the testing of multilingual and multimodal information access systems in Europe. This paper provides a summary of the motivations which led to the establishment of CLEF, and a description of how it has evolved over the years, the major achievements, and what we see as the next challenges.

Nicola Ferro

Full Papers

Frontmatter

Crosslingual Depression Detection in Twitter Using Bilingual Word Alignments

Depression is a mental disorder with strong social and economic implications. Due to its relevance, recently several researches have explored the analysis of social media content to identify and track depressed users. Most approaches follow a supervised learning strategy supported on the availability of labeled training data. Unfortunately, acquiring such data is very complex and costly. To handle this problem, in this paper we propose a crosslingual approach based on the idea that data already labeled in a specific language can be leveraged to classify depression in other languages. The proposed method is based on a word-level alignment process. Particularly, we propose two representations for the alignment; one of them takes advantage of the psycholinguistic resource LIWC and the other uses bilingual word embeddings. For evaluating the proposed approach, we faced the detection of depression by employing English and Spanish tweets as the source and target data respectively. The results outperformed solutions based on automatic translation of texts, confirming the usefulness of the proposed approach.

Laritza Coello-Guilarte, Rosa María Ortega-Mendoza, Luis Villaseñor-Pineda, Manuel Montes-y-Gómez

Studying the Variability of System Setting Effectiveness by Data Analytics and Visualization

Search engines differ from their modules and parameters; defining the optimal system setting is challenging the more because of the complexity of a retrieval stream. The main goal of this study is to determine which are the most important system components and parameters in system setting, thus which ones should be tuned as the first priority. We carry out an extensive analysis of 20, 000 different system settings applied to three TREC ad-hoc collections. Our analysis includes zooming in and out the data using various data analysis methods such as ANOVA, CART, and data visualization. We found that the query expansion model is the most significant component that changes the system effectiveness, consistently across collections. Zooming in the queries, we show that the most significant component changes to the retrieval model when considering easy queries only. The results of our study are directly re-usable for the system designers and for system tuning.

Sébastien Déjean, Josiane Mothe, Md. Zia Ullah

Stance Detection in Web and Social Media: A Comparative Study

Online forums and social media platforms are increasingly being used to discuss topics of varying polarities where different people take different stances. Several methodologies for automatic stance detection from text have been proposed in literature. To our knowledge, there has not been any systematic investigation towards their reproducibility, and their comparative performances. In this work, we explore the reproducibility of several existing stance detection models, including both neural models and classical classifier-based models. Through experiments on two datasets – (i) the popular SemEval microblog dataset, and (ii) a set of health-related online news articles – we also perform a detailed comparative analysis of various methods and explore their shortcomings.

Shalmoli Ghosh, Prajwal Singhania, Siddharth Singh, Koustav Rudra, Saptarshi Ghosh

TwitCID: A Collection of Data Sets for Studies on Information Diffusion on Social Networks

Online social networks play a crucial role in spreading information at a very large scale. Modeling information propagation on social networks has been attracting a lot of attention from researchers. However, none of the data sets used in past works are made available to the research community, while they would be very useful for comparative studies. In this paper, we detail a collection of tweets composed of five data sets for a total of 18 million tweets that we release, and which is designed to evaluate methods on modeling the information spread, in the case of general information and brands marketing information. In addition to tweet IDs and a script to retrieve the whole tweet in JSON from the Twitter API, we release the values of the 29 extracted features for these data sets. These features consist of user based, content based and temporal based features. Finally, we provide the results of information diffusion prediction models (80% accuracy) which could serve as strong baselines for this research topic.

Thi Bich Ngoc Hoang, Josiane Mothe, Manon Baillon

Sonny, Cerca! Evaluating the Impact of Using a Vocal Assistant to Search at School

Children struggle with translating their information needs into effective queries to initiate the search process. In this paper, we explore the degree to which the use of a Vocal Assistant (VA) as an intermediary between a child and a search engine can ease query formulation and foster completion of successful searches. We also examine the potential influence VA can have on the search process when compared to a traditional keyboard-driven approach. This comparison motivates the second contribution of our work, an evaluation framework that covers 4 dimensions: (1) a new search strategy (VA) for (2) a specific user group (children) given (3) a particular task (answering questions) in (4) a defined environment (school). The proposed framework can be adopted by the research community to conduct comprehensive assessments of search systems given new interaction methods, user groups, contexts, and tasks.

Monica Landoni, Davide Matteri, Emiliana Murgia, Theo Huibers, Maria Soledad Pera

Generating Cross-Domain Text Classification Corpora from Social Media Comments

In natural language processing (NLP), cross-domain text classification problems like cross-topic, cross-genre or cross-language authorship attribution are characterized by having different contexts for training and testing data. That is, learning algorithms which are trained on the specific properties of the training data have to make predictions on test data which comprises substantially different properties. To this end, the corpora that are used for analyses in cross-domain problems are limited in size and variation, decreasing the expressive power and generalizability of the proposed solutions. In this paper, we present a methodological framework and toolset for dynamically creating cross-domain datasets by utilizing millions of Reddit comments. We show that different types of cross-domain datasets such as cross-topic or cross-lingual corpora can be constructed, and demonstrate a wide variety of use cases, including previously unfeasible analyses like cross-lingual authorship attribution on original, non-translated texts. Using state-of-the-art authorship attribution methods, we show the potential of a cross-topic corpus generated by our framework when compared to the corpora that were used in related approaches, and enable the advance of research previously limited by corpora availability.

Benjamin Murauer, Günther Specht

Efficient Answer-Annotation for Frequent Questions

Ground truth is a crucial resource for the creation of effective question-answering (Q-A) systems. When no appropriate ground truth is available, as it is often the case in domain-specific Q-A systems (e.g. customer-support, tourism) or in languages other than English, new ground truth can be created by human annotation. The annotation process in which a human annotator looks up the corresponding answer label for each question from an answer catalog ( $$\textsc {Sequential}$$ approach), however, is usually time-consuming and costly. In this paper, we propose a new approach, in which the annotator first manually groups questions that have the same intent as a candidate question, and then, labels the entire group in one step ( $$\textsc {Group}\text {-}\textsc {Wise}$$ approach). To retrieve same-intent questions effectively, we evaluate various unsupervised semantic similarity methods from recent literature, and implement the most effective one in our annotation approach. Afterwards, we compare the $$\textsc {Group}\text {-}\textsc {Wise}$$ approach with the $$\textsc {Sequential}$$ approach with respect to answer look-ups, annotation time, and label-quality. We show based on 500 German customer-support questions that the $$\textsc {Group}\text {-}\textsc {Wise}$$ approach requires 51% fewer answer look-ups, is 41% more time-efficient, and retains the same label-quality as the $$\textsc {Sequential}$$ approach. Note that the described approach is limited to Q-A systems where frequently asked questions occur.

Markus Zlabinger, Navid Rekabsaz, Stefan Zlabinger, Allan Hanbury

Short Papers

Frontmatter

Improving Ranking for Systematic Reviews Using Query Adaptation

Identifying relevant studies for inclusion in systematic reviews requires significant effort from human experts who manually screen large numbers of studies. The problem is made more difficult by the growing volume of medical literature and Information Retrieval techniques have proved to be useful to reduce workload. Reviewers are often interested in particular types of evidence such as Diagnostic Test Accuracy studies. This paper explores the use of query adaption to identify particular types of evidence and thereby reduce the workload placed on reviewers. A simple retrieval system that ranks studies using TF.IDF weighted cosine similarity was implemented. The Log-Likelihood, Chi-Squared and Odds-Ratio lexical statistics and relevance feedback were used to generate sets of terms that indicate evidence relevant to Diagnostic Test Accuracy reviews. Experiments using a set of 80 systematic reviews from the CLEF2017 and CLEF2018 eHealth tasks demonstrate that the approach improves retrieval performance.

Amal Alharbi, Mark Stevenson

Analyzing the Adequacy of Readability Indicators to a Non-English Language

Readability is a linguistic feature that indicates how difficult it is to read a text. Traditional readability formulas were made for the English language. This study evaluates their adequacy to the Portuguese language. We applied the traditional formulas in 10 parallel corpora. We verified that the Portuguese language had higher grade scores (less readability) in the formulas that use the number of syllables per words or number of complex words per sentence. Formulas that use letters by words instead of syllables by words output similar grade scores. Considering this, we evaluated the correlation of the complex words in 65 Portuguese school books of 12 schooling years. We found out that the concept of complex word as a word with 4 or more syllables, instead of 3 or more syllables as originally used in traditional formulas applied to English texts, is more correlated with the grade of Portuguese school books. In the end, for each traditional readability formula, we adapted it to the Portuguese language performing a multiple linear regression in the same dataset of school books.

Hélder Antunes, Carla Teixeira Lopes

How Many Labels? Determining the Number of Labels in Multi-Label Text Classification

Multi-Label Text Classification (MLTC) is a supervised machine learning task in which the goal is to learn a classifier that assigns multiple labels to text documents. When all documents have the same number of labels, this task is very close to ordinary (single label) text classification. However, in case this number varies another classifier needs to determine, for each document, how many labels to assign. The topic of this paper is exactly this additional classifier. We compare several baselines to a system which learns a dynamic threshold for a given text classifier. The thresholding classifier receives the ranked list of scores for each label for a document as input and returns a threshold score. All labels with a score higher than this threshold will then be assigned to the document. Our results show that, first, this dynamic thresholding significantly improves recall but has the same precision as a static system which assigns the same (the mean) number of classes to each document, and second, that the accuracy of predicting the number of classes is positively related to the quality (measured by MAP) of the text classifier.

Hosein Azarbonyad, Maarten Marx

Using Audio Transformations to Improve Comprehension in Voice Question Answering

Many popular form factors of digital assistants—such as Amazon Echo or Google Home—enable users to converse with speech-based systems. The lack of screens presents unique challenges. To satisfy users’ information needs, the presentation of answers has to be optimized for voice-only interactions. We evaluate the usefulness of audio transformations (i.e., prosodic modifications) for voice-only question answering. We introduce a crowdsourcing setup evaluating the quality of our proposed modifications along multiple dimensions corresponding to the informativeness, naturalness, and ability of users to identify key parts of the answer. We offer a set of prosodic modifications that highlight potentially important parts of the answer using various acoustic cues. Our experiments show that different modifications lead to better comprehension at the expense of slightly degraded naturalness of the audio.

Aleksandr Chuklin, Aliaksei Severyn, Johanne R. Trippas, Enrique Alfonseca, Hanna Silen, Damiano Spina

A User Modeling Shared Challenge Proposal

Comparative evaluation in the areas of User Modeling, Adaptation and Personalization (UMAP) is significantly challenging. It has always been difficult to rigorously compare different approaches to personalization, as the function of the resulting systems is, by their nature, heavily influenced by the behavior of the users involved in trialing the systems. Developing comparative evaluations in this space would be a huge advancement as it would enable shared comparison across research. Here we present a proposal for a shared challenge generation in UMAP, focusing on user model generation using logged mobile phone data, with an assumed purpose of supporting mobile phone notification suggestion. The dataset, evaluation metrics, and challenge operation are described.

Owen Conlan, Kieran Fraser, Liadh Kelly, Bilal Yousuf

How Lexical Gold Standards Have Effects on the Usefulness of Text Analysis Tools for Digital Scholarship

This paper describes how the current lexical similarity and analogy gold standards are built to conform to certain ideas about what the models they are designed to evaluate are used for. Topical relevance has always been the most important target notion for information access tools and related language technology technologies, and while this has proven a useful starting point for much of what information technology is used for, it does not always align well with other uses to which technologies are being put, most notably use cases from digital scholarship in the humanities or social sciences. This paper argues for more systematic formulation of requirements from the digital humanities and social sciences and more explicit description of the assumptions underlying model design.

Jussi Karlgren

Personality Facets Recognition from Text

Fundamental Big Five personality traits (e.g., Extraversion) and their facets (e.g., Activity) are known to correlate with a broad range of linguistic features and, accordingly, the recognition of personality traits from text is a well-known Natural Language Processing task. Labelling text data with facets information, however, may require the use of lengthy personality inventories, and perhaps for that reason existing computational models of this kind are usually limited to the recognition of the fundamental traits. Based on these observations, this paper investigates the issue of personality facets recognition from text labelled only with information available from a shorter personality inventory. In doing so, we provide a low-cost model for the recognition of certain personality facets, and present reference results for further studies in this field.

Wesley Ramos dos Santos, Ivandré Paraboni

Unsupervised System Combination for Set-Based Retrieval with Expectation Maximization

System combination has been shown to improve overall performance on many rank-based retrieval tasks, often by combining results from multiple systems into a single ranked list. In contrast, set-based retrieval tasks call for a technique to combine results in ways that require decisions on whether each document is in or out of the result set. This paper presents a set-generating unsupervised system combination framework that draws inspiration from evaluation techniques in sparse data settings. It argues for the existence of a duality between evaluation and system combination, and then capitalizes on this duality to perform unsupervised system combination. To do this, the framework relies on the consensus of the systems to estimate latent “goodness” for each system. An implementation of this framework using data programming is compared to other unsupervised system combination approaches to demonstrate its effectiveness on CLEF and MATERIAL collections.

Han-Chin Shing, Joe Barrow, Petra Galuščáková, Douglas W. Oard, Philip Resnik

Best of CLEF 2018 Labs

Frontmatter

An Ensemble Approach to Cross-Domain Authorship Attribution

This paper presents an ensemble approach to cross-domain authorship attribution that combines predictions made by three independent classifiers, namely, standard character n-grams, character n-grams with non-diacritic distortion and word n-grams. Our proposal relies on variable-length n-gram models and multinomial logistic regression to select the prediction of highest probability among the three models as the output for the task. The present approach is compared against a number of baseline systems, and we report results based on both the PAN-CLEF 2018 test data, and on a new corpus of song lyrics in English and Portuguese.

José Eleandro Custódio, Ivandré Paraboni

Evaluation of Deep Species Distribution Models Using Environment and Co-occurrences

This paper presents an evaluation of several approaches of plants species distribution modeling based on spatial, environmental and co-occurrences data using machine learning methods. In particular, we re-evaluate the environmental convolutional neural network model that obtained the best performance of the GeoLifeCLEF 2018 challenge but on a revised dataset that fixes some of the issues of the previous one. We also go deeper in the analysis of co-occurrences information by evaluating a new model that jointly takes environmental variables and co-occurrences as inputs of an end-to-end network. Results show that the environmental models are the best performing methods and that there is a significant amount of complementary information between co-occurrences and environment. Indeed, the model learned on both inputs allows a significant performance gain compared to the environmental model alone.

Benjamin Deneu, Maximilien Servajean, Christophe Botella, Alexis Joly

Interactive Learning-Based Retrieval Technique for Visual Lifelogging

Currently, there is a plethora of video wearable devices that can easily collect data from daily user life. This fact has promoted the development of lifelogging applications for security, healthcare, and leisure. However, the retrieval of not-pre-defined events is still a challenge due to the impossibility of having a potentially unlimited number of fully annotated databases covering all possible events. This work proposes an interactive and weakly supervised learning approach that is able of retrieving any kinds of events using general and weakly annotated databases. The proposed system has been evaluated with the database provided by the Lifelog Moment Retrieval (LMRT) challenge of ImageCLEF (Lifelog2018), where it reached the first position in the final ranking.

Ergina Kavallieratou, Carlos R. del-Blanco, Carlos Cuevas, Narciso García

An Effective Deep Transfer Learning and Information Fusion Framework for Medical Visual Question Answering

Medical visual question answering (Med-VQA) is very important for better clinical decision support and enhanced patient engagement in patient-centered medical care. Compared with open domain VQA tasks, VQA in medical domain becomes more challenging due to limited training resources as well as unique characteristics on medical images and domain vocabularies. In this paper, we propose and develop a novel deep transfer learning model, ETM-Trans, which exploits embedding topic modeling (ETM) on textual questions to derive topic labels to pair with associated medical images for finetuning the pre-trained ImageNet model. We also explore and implement a co-attention mechanism where residual networks is used to extract visual features from image interacting with the long-short term memory (LSTM) based question representation providing fine-grained contextual information for answer derivation. To efficiently integrate visual features from the image and textual features from the question, we employ Multimodal Factorized Bilinear (MFB) pooling as well as Multimodal Factorized High-order (MFH) pooling. The ETM-Trans model won the international Med-VQA 2018 challenge, achieving the best WBSS score of 0.186.

Feifan Liu, Yalei Peng, Max P. Rosen

Language Modeling in Temporal Mood Variation Models for Early Risk Detection on the Internet

Early risk detection can be useful in different areas, particularly those related to health and safety. Two tasks are proposed at CLEF eRisk-2018 for predicting mental disorder using users posts on Reddit. Depression and anorexia disorders must be detected as early as possible. In this paper, we extend the participation of LIRMM (Laboratoire d’Informatique, de Robotique et de Microélectronique de Montpellier) in both tasks. The proposed model addresses this problem by modeling the temporal mood variation detected from user posts. The proposed architectures use only textual information without any hand-crafted features or dictionaries. The basic architecture uses two learning phases through exploration of state-of-the-art text vectorizations and deep language models. The proposed models perform comparably to other contributions while further experiments shows that attentive based deep language models outperformed the shallow learning text vectorizations.

Waleed Ragheb, Jérôme Azé, Sandra Bringay, Maximilien Servajean

Medical Image Labelling and Semantic Understanding for Clinical Applications

Semantic concept detection contributes to machine understanding and learning from medical images; it also plays an important role in image reading and image-assisted diagnosis. In this study, the problem of detecting high-frequency concepts from medical images was transformed into a multi-label classification task. The transfer learning method based on convolutional neural networks (CNNs) was used to recognize high-frequency medical concepts. The image retrieval-based topic modelling method was used to obtain the semantically related concepts from images similar to the given medical images. Our group participated in the concept detection subtasks that were launched by ImageCLEFcaption 2018 and ImageCLEFmed Caption 2019. In the 2018 task, the CNN-based transfer learning method achieved an F1 score of 0.0928, while the retrieval-based topic model achieved an F1 score of 0.0907. Although the latter method recalled some low-frequency concepts, it heavily depended on the image retrieval results. For the latter 2019 task, we proposed body part-based pre-classification strategies and achieved an F1 score of 0.2235. The results indicated that the transfer learning-based multi-label classification method was more robust in high-frequency concept detection across different data sets, but there is still much room for improvement in large-scale open semantic concept detection research.

Xuwen Wang, Zhen Guo, Yu Zhang, Jiao Li

To Check or Not to Check: Syntax, Semantics, and Context in the Language of Check-Worthy Claims

As the spread of information has received a compelling boost due to pervasive use of social media, so has the spread of misinformation. The sheer volume of data has rendered the traditional methods of expert-driven manual fact-checking largely infeasible. As a result, computational linguistics and data-driven algorithms have been explored in recent years. Despite this progress, identifying and prioritizing what needs to be checked has received little attention. Given that expert-driven manual intervention is likely to remain an important component of fact-checking, especially in specific domains (e.g., politics, environmental science), this identification and prioritization is critical. A successful algorithmic ranking of “check-worthy” claims can help an expert-in-the-loop fact-checking system, thereby reducing the expert’s workload while still tackling the most salient bits of misinformation. In this work, we explore how linguistic syntax, semantics, and the contextual meaning of words play a role in determining the check-worthiness of claims. Our preliminary experiments used explicit stylometric features and simple word embeddings on the English language dataset in the Check-worthiness task of the CLEF-2018 Fact-Checking Lab, where our primary solution outperformed the other systems in terms of the mean average precision, R-precision, reciprocal rank, and precision at k for multiple values k. Here, we present an extension of this approach with more sophisticated word embeddings and report further improvements in this task.

Chaoyuan Zuo, Ayla Ida Karakas, Ritwik Banerjee

CLEF 2019 Lab Overviews

Frontmatter

Overview of CENTRE@CLEF 2019: Sequel in the Systematic Reproducibility Realm

Reproducibility has become increasingly important for many research areas, among those IR is not an exception and has started to be concerned with reproducibility and its impact on research results. This paper describes our second attempt to propose a lab on reproducibility named CENTRE, held during CLEF 2019. The aim of CENTRE is to run both a replicability and reproducibility challenge across all the major IR evaluation campaigns and to provide the IR community with a venue where previous research results can be explored and discussed. This paper reports the participant results and preliminary considerations on the second edition of CENTRE@CLEF 2019.

Nicola Ferro, Norbert Fuhr, Maria Maistro, Tetsuya Sakai, Ian Soboroff

Overview of the CLEF-2019 CheckThat! Lab: Automatic Identification and Verification of Claims

We present an overview of the second edition of the CheckThat! Lab at CLEF 2019. The lab featured two tasks in two different languages: English and Arabic. Task 1 (English) challenged the participating systems to predict which claims in a political debate or speech should be prioritized for fact-checking. Task 2 (Arabic) asked to (A) rank a given set of Web pages with respect to a check-worthy claim based on their usefulness for fact-checking that claim, (B) classify these same Web pages according to their degree of usefulness for fact-checking the target claim, (C) identify useful passages from these pages, and (D) use the useful pages to predict the claim’s factuality. CheckThat! provided a full evaluation framework, consisting of data in English (derived from fact-checking sources) and Arabic (gathered and annotated from scratch) and evaluation based on mean average precision (MAP) and normalized discounted cumulative gain (nDCG) for ranking, and F $$_1$$ for classification. A total of 47 teams registered to participate in this lab, and fourteen of them actually submitted runs (compared to nine last year). The evaluation results show that the most successful approaches to Task 1 used various neural networks and logistic regression. As for Task 2, learning-to-rank was used by the highest scoring runs for subtask A, while different classifiers were used in the other subtasks. We release to the research community all datasets from the lab as well as the evaluation scripts, which should enable further research in the important tasks of check-worthiness estimation and automatic claim verification.

Tamer Elsayed, Preslav Nakov, Alberto Barrón-Cedeño, Maram Hasanain, Reem Suwaileh, Giovanni Da San Martino, Pepa Atanasova

Overview of the CLEF eHealth Evaluation Lab 2019

In this paper, we provide an overview of the seventh annual edition of the CLEF eHealth evaluation lab. CLEF eHealth 2019 continues our evaluation resource building efforts around the easing and support of patients, their next-of-kins, clinical staff, and health scientists in understanding, accessing, and authoring electronic health information in a multilingual setting. This year’s lab advertised three tasks: Task 1 on indexing non-technical summaries of German animal experiments with International Classification of Diseases, Version 10 codes; Task 2 on technology assisted reviews in empirical medicine building on 2017 and 2018 tasks in English; and Task 3 on consumer health search in mono- and multilingual settings that builds on the 2013–18 Information Retrieval tasks. In total nine teams took part in these tasks (six in Task 1 and three in Task 2). Herein, we describe the resources created for these tasks and evaluation methodology adopted. We also provide a brief summary of participants of this year’s challenges and results obtained. As in previous years, the organizers have made data and tools associated with the lab tasks available for future research and development.

Liadh Kelly, Hanna Suominen, Lorraine Goeuriot, Mariana Neves, Evangelos Kanoulas, Dan Li, Leif Azzopardi, Rene Spijker, Guido Zuccon, Harrisen Scells, João Palotti

Overview of eRisk 2019 Early Risk Prediction on the Internet

This paper provides an overview of eRisk 2019, the third edition of this lab under the CLEF conference. The main purpose of eRisk is to explore issues of evaluation methodology, effectiveness metrics and other processes related to early risk detection. Early detection technologies can be employed in different areas, particularly those related to health and safety. This edition of eRisk had three tasks. Two of them shared the same format and focused on early detecting signs of depression (T1) or self-harm (T2). The third task focused on an innovative challenge related to automatically filling a depression questionnaire based on user interactions in social media.

David E. Losada, Fabio Crestani, Javier Parapar

ImageCLEF 2019: Multimedia Retrieval in Medicine, Lifelogging, Security and Nature

This paper presents an overview of the ImageCLEF 2019 lab, organized as part of the Conference and Labs of the Evaluation Forum - CLEF Labs 2019. ImageCLEF is an ongoing evaluation initiative (started in 2003) that promotes the evaluation of technologies for annotation, indexing and retrieval of visual data with the aim of providing information access to large collections of images in various usage scenarios and domains. In 2019, the 17th edition of ImageCLEF runs four main tasks: (i) a medical task that groups three previous tasks (caption analysis, tuberculosis prediction, and medical visual question answering) with new data, (ii) a lifelog task (videos, images and other sources) about daily activities understanding, retrieval and summarization, (iii) a new security task addressing the problems of automatically identifying forged content and retrieve hidden information, and (iv) a new coral task about segmenting and labeling collections of coral images for 3D modeling. The strong participation, with 235 research groups registering, and 63 submitting over 359 runs, shows an important interest in this benchmark campaign.

Bogdan Ionescu, Henning Müller, Renaud Péteri, Yashin Dicente Cid, Vitali Liauchuk, Vassili Kovalev, Dzmitri Klimuk, Aleh Tarasau, Asma Ben Abacha, Sadid A. Hasan, Vivek Datla, Joey Liu, Dina Demner-Fushman, Duc-Tien Dang-Nguyen, Luca Piras, Michael Riegler, Minh-Triet Tran, Mathias Lux, Cathal Gurrin, Obioma Pelka, Christoph M. Friedrich, Alba Garcìa Seco de Herrera, Narciso Garcia, Ergina Kavallieratou, Carlos Roberto del Blanco, Carlos Cuevas, Nikos Vasillopoulos, Konstantinos Karampidis, Jon Chamberlain, Adrian Clark, Antonio Campello

Overview of LifeCLEF 2019: Identification of Amazonian Plants, South & North American Birds, and Niche Prediction

Building accurate knowledge of the identity, the geographic distribution and the evolution of living species is essential for a sustainable development of humanity, as well as for biodiversity conservation. Unfortunately, such basic information is often only partially available for professional stakeholders, teachers, scientists and citizens, and often incomplete for ecosystems that possess the highest diversity. In this context, an ultimate ambition is to set up innovative information systems relying on the automated identification and understanding of living organisms as a means to engage massive crowds of observers and boost the production of biodiversity and agro-biodiversity data. The LifeCLEF 2019 initiative proposes three data-oriented challenges related to this vision, in the continuity of the previous editions but with several consistent novelties intended to push the boundaries of the state-of-the-art in several research directions. This paper describes the methodology of the conducted evaluations as well as the synthesis of the main results and lessons learned.

Alexis Joly, Hervé Goëau, Christophe Botella, Stefan Kahl, Maximillien Servajean, Hervé Glotin, Pierre Bonnet, Robert Planqué, Fabian Robert-Stöter, Willem-Pier Vellinga, Henning Müller

Overview of PAN 2019: Bots and Gender Profiling, Celebrity Profiling, Cross-Domain Authorship Attribution and Style Change Detection

We briefly report on the four shared tasks organized as part of the PAN 2019 evaluation lab on digital text forensics and authorship analysis. Each task is introduced, motivated, and the results obtained are presented. Altogether, the four tasks attracted 373 registrations, yielding 72 successful submissions. This, and the fact that we continue to invite the submission of software rather than its run output using the TIRA experimentation platform, demarcates a good start into the second decade of PAN evaluations labs.

Walter Daelemans, Mike Kestemont, Enrique Manjavacas, Martin Potthast, Francisco Rangel, Paolo Rosso, Günther Specht, Efstathios Stamatatos, Benno Stein, Michael Tschuggnall, Matti Wiegmann, Eva Zangerle

Overview of the CLEF 2019 Personalised Information Retrieval Lab (PIR-CLEF 2019)

The Personalised Information Retrieval Lab (PIR-CLEF 2019) lab is an initiative aimed at both providing and critically analysing the evaluation of Personalization in Information Retrieval (PIR) applications. PIR-CLEF 2019 is the second edition of the Lab after the successful Pilot lab organised at CLEF 2017 and the first edition of the Lab at CLEF 2018. PIR-CLEF 2019 provided registered participants with two tracks: the Web Search Task and the Medical Search Task. The Web Search Task continues the activities introduced in the previous editions of the PIR-CLEF Lab, while the Medical Search Track focuses on personalisation within an ad hoc search task introduced in previous editions of the CLEF eHealth Lab.

Gabriella Pasi, Gareth J. F. Jones, Lorraine Goeuriot, Liadh Kelly, Stefania Marrara, Camilla Sanvitto

Overview of CLEF 2019 Lab ProtestNews: Extracting Protests from News in a Cross-Context Setting

We present an overview of the CLEF-2019 Lab ProtestNews on Extracting Protests from News in the context of generalizable natural language processing. The lab consists of document, sentence, and token level information classification and extraction tasks that were referred as task 1, task 2, and task 3 respectively in the scope of this lab. The tasks required the participants to identify protest relevant information from English local news at one or more aforementioned levels in a cross-context setting, which is cross-country in the scope of this lab. The training and development data were collected from India and test data was collected from India and China. The lab attracted 58 teams to participate in the lab. 12 and 9 of these teams submitted results and working notes respectively. We have observed neural networks yield the best results and the performance drops significantly for majority of the submissions in the cross-country setting, which is China.

Ali Hürriyetoğlu, Erdem Yörük, Deniz Yüret, Çağrı Yoltar, Burak Gürel, Fırat Duruşan, Osman Mutlu, Arda Akdemir

Backmatter

Title: Experimental IR Meets Multilinguality, Multimodality, and Interaction
Editors: Fabio Crestani
Martin Braschler
Jacques Savoy
Andreas Rauber
Henning Müller
David E. Losada
Gundula Heinatz Bürki
Linda Cappellato
Nicola Ferro
Publisher: Springer International Publishing
Electronic ISBN: 978-3-030-28577-7
Print ISBN: 978-3-030-28576-0
DOI: https://doi.org/10.1007/978-3-030-28577-7

Springer Professional

About this book

Table of Contents

Frontmatter

History

Frontmatter

What Happened in CLEF For a While?

Full Papers

Frontmatter

Crosslingual Depression Detection in Twitter Using Bilingual Word Alignments

Studying the Variability of System Setting Effectiveness by Data Analytics and Visualization

Stance Detection in Web and Social Media: A Comparative Study

TwitCID: A Collection of Data Sets for Studies on Information Diffusion on Social Networks

Sonny, Cerca! Evaluating the Impact of Using a Vocal Assistant to Search at School

Generating Cross-Domain Text Classification Corpora from Social Media Comments

Efficient Answer-Annotation for Frequent Questions

Short Papers

Frontmatter

Improving Ranking for Systematic Reviews Using Query Adaptation

Analyzing the Adequacy of Readability Indicators to a Non-English Language

How Many Labels? Determining the Number of Labels in Multi-Label Text Classification

Using Audio Transformations to Improve Comprehension in Voice Question Answering

A User Modeling Shared Challenge Proposal

How Lexical Gold Standards Have Effects on the Usefulness of Text Analysis Tools for Digital Scholarship

Personality Facets Recognition from Text

Unsupervised System Combination for Set-Based Retrieval with Expectation Maximization

Best of CLEF 2018 Labs

Frontmatter

An Ensemble Approach to Cross-Domain Authorship Attribution

Evaluation of Deep Species Distribution Models Using Environment and Co-occurrences

Interactive Learning-Based Retrieval Technique for Visual Lifelogging

An Effective Deep Transfer Learning and Information Fusion Framework for Medical Visual Question Answering

Language Modeling in Temporal Mood Variation Models for Early Risk Detection on the Internet

Medical Image Labelling and Semantic Understanding for Clinical Applications

To Check or Not to Check: Syntax, Semantics, and Context in the Language of Check-Worthy Claims

CLEF 2019 Lab Overviews

Frontmatter

Overview of CENTRE@CLEF 2019: Sequel in the Systematic Reproducibility Realm

Overview of the CLEF-2019 CheckThat! Lab: Automatic Identification and Verification of Claims

Overview of the CLEF eHealth Evaluation Lab 2019

Overview of eRisk 2019 Early Risk Prediction on the Internet

ImageCLEF 2019: Multimedia Retrieval in Medicine, Lifelogging, Security and Nature

Overview of LifeCLEF 2019: Identification of Amazonian Plants, South & North American Birds, and Niche Prediction

Overview of PAN 2019: Bots and Gender Profiling, Celebrity Profiling, Cross-Domain Authorship Attribution and Style Change Detection

Overview of the CLEF 2019 Personalised Information Retrieval Lab (PIR-CLEF 2019)

Overview of CLEF 2019 Lab ProtestNews: Extracting Protests from News in a Cross-Context Setting

Backmatter

Premium Partner