Skip to main content
Top

2015 | Book

Experimental IR Meets Multilinguality, Multimodality, and Interaction

6th International Conference of the CLEF Association, CLEF'15, Toulouse, France, September 8-11, 2015, Proceedings

Editors: Josanne Mothe, Jacques Savoy, Jaap Kamps, Karen Pinel-Sauvagnat, Gareth Jones, Eric San Juan, Linda Capellato, Nicola Ferro

Publisher: Springer International Publishing

Book Series : Lecture Notes in Computer Science

insite
SEARCH

About this book

This book constitutes the refereed proceedings of the 6th International Conference of the CLEF Initiative, CLEF 2015, held in Toulouse, France, in September 2015.

The 31 full papers and 20 short papers presented were carefully reviewed and selected from 68 submissions. They cover a broad range of issues in the fields of multilingual and multimodal information access evaluation, also included are a set of labs and workshops designed to test different aspects of mono and cross-language information retrieval systems.

Table of Contents

Frontmatter

Experimental IR

Frontmatter
Experimental Study on Semi-structured Peer-to-Peer Information Retrieval Network

In the recent decades, retrieval systems deployed over peerto- peer (P2P) overlay networks have been investigated as an alternative to centralised search engines. Although modern search engines provide efficient document retrieval, they possess several drawbacks. In order to alleviate their problems, P2P Information Retrieval (P2PIR) systems provide an alternative architecture to the traditional centralised search engine. Users and creators of web content in such networks have full control over what information they wish to share as well as how they share it. The semi-structured P2P architecture has been proposed where the underlying approach organises similar document in a peer, often using clustering techniques, and promotes willing peers as super peers (or hubs) to traffic queries to appropriate peers with relevant content. However, no systematic evaluation study has been performed on such architectures. In this paper, we study the performance of three cluster-based semistructured P2PIR models and explain the effectiveness of several important design considerations and parameters on retrieval performance, as well as the robustness of these types of network.

Rami S. Alkhawaldeh, Joemon M. Jose
Evaluating Stacked Marginalised Denoising Autoencoders Within Domain Adaptation Methods

In this paper we address the problem of domain adaptation using multiple source domains. We extend the XRCE contribution to Clef’14 Domain Adaptation challenge [6] with the new methods and new datasets. We describe a new class of domain adaptation technique based on

stacked marginalized denoising autoencoders

(sMDA). It aims at extracting and denoising features common to both source and target domains in the unsupervised mode. Noise marginalization allows to obtain a closed form solution and to considerably reduce the training time. We build a classification system which compares sMDA combined with SVM or with Domain Specific Class Mean classifiers to the stateof- the art in both unsupervised and semi-supervised settings. We report the evaluation results for a number of image and text datasets.

Boris Chidlovskii, Gabriela Csurka, Stephane Clinchant
Language Variety Identification Using Distributed Representations of Words and Documents

Language variety identification is an author profiling subtask which aims to detect lexical and semantic variations in order to classify different varieties of the same language. In this work we focus on the use of distributed representations of words and documents using the continuous Skip-gram model. We compare this model with three recent approaches: Information GainWord-Patterns, TF-IDF graphs and Emotion-labeled Graphs, in addition to several baselines. We evaluate the models introducing the Hispablogs dataset, a new collection of Spanish blogs from five different countries: Argentina, Chile, Mexico, Peru and Spain. Experimental results show state-of-the-art performance in language variety identification. In addition, our empirical analysis provides interesting insights on the use of the evaluated approaches.

Marc Franco-Salvador, Francisco Rangel, Paolo Rosso, Mariona Taulé, M. Antònia Martít
Evaluating User Image Tagging Credibility

When looking for information on the Web, the credibility of the source plays an important role in the information seeking experience. While data source credibility has been thoroughly studied for Web pages or blogs, the investigation of source credibility in image retrieval tasks is an emerging topic. In this paper, we first propose a novel dataset for evaluating the tagging credibility of Flickr users built with the aim of covering a large variety of topics. We present the motivation behind the need for such a dataset, the methodology used for its creation and detail important statistics on the number of users, images and rater agreement scores. Next, we define both a supervised learning task in which we group the users in 5 credibility classes and a credible user retrieval problem. Besides a couple of credibility features described in previous work, we propose a novel set of credibility estimators, with an emphasis on text based descriptors. Finally, we prove the usefulness of our evaluation dataset and justify the performances of the proposed credibility descriptors by showing promising results for both of the proposed tasks.

Alexandru Lucian Ginsca, Adrian Popescu, Mihai Lupu, Adrian Iftene, Ioannis Kanellos

Web and Social Media

Frontmatter
Tweet Expansion Method for Filtering Task in Twitter

In this article we propose a supervised method for expanding tweet contents to improve the recall of tweet filtering task in online reputation management systems. Our method does not use any external resources. It consists of creating a K-NN classifier in three steps. In these steps the tweets labeled related and unrelated in the training set are expanded by extracting and adding the most discriminative terms, calculating and adding the most frequent terms, and re-weighting the original tweet terms from training set. Our experiments in RepLab 2013 data set show that our method improves the performance of filtering task, in terms of F criterion, up to 13% over state-of-the-art classifiers such as SVM. This data set consists of 61 entities from different domains of automotive, banking, universities, and music.

Payam Karisani, Farhad Oroumchian, Maseud Rahgozar
Real-Time Entity-Based Event Detection for Twitter

In recent years there has been a surge of interest in using Twitter to detect real-world events. However, many state-of-the-art event detection approaches are either too slow for real-time application, or can detect only specific types of events effectively. We examine the role of named entities and use them to enhance event detection. Specifically, we use a clustering technique which partitions documents based upon the entities they contain, and burst detection and cluster selection techniques to extract clusters related to on-going real-world events. We evaluate our approach on a large-scale corpus of 120 million tweets covering more than 500 events, and show that it is able to detect significantly more events than current state-of-the-art approaches whilst also improving precision and retaining low computational complexity. We find that nouns and verbs play different roles in event detection and that the use of hashtags and retweets lead to a decreases in effectiveness when using our entitybase approach.

Andrew J. McMinn, Joemon M. Jose
A Comparative Study of Click Models for Web Search

Click models have become an essential tool for understanding user behavior on a search engine result page, running simulated experiments and predicting relevance. Dozens of click models have been proposed, all aiming to tackle problems stemming from the complexity of user behavior or of contemporary result pages. Many models have been evaluated using proprietary data, hence the results are hard to reproduce. The choice of baseline models is not always motivated and the fairness of such comparisons may be questioned. In this study, we perform a detailed analysis of all major click models for web search ranging from very simplistic to very complex. We employ a publicly available dataset, open-source software and a range of evaluation techniques, which makes our results both representative and reproducible. We also analyze the query space to show what type of queries each model can handle best.

Artem Grotov, Aleksandr Chuklin, Ilya Markov, Luka Stout, Finde Xumara, Maarten de Rijke
Evaluation of Pseudo Relevance Feedback Techniques for Cross Vertical Aggregated Search

Cross vertical aggregated search is a special form of meta search, were multiple search engines from different domains and varying behaviour are combined to produce a single search result for each query. Such a setting poses a number of challenges, among them the question of how to best evaluate the quality of the aggregated search results. We devised an evaluation strategy together with an evaluation platform in order to conduct a series of experiments. In particular, we are interested whether pseudo relevance feedback helps in such a scenario. Therefore we implemented a number of pseudo relevance feedback techniques based on knowledge bases, where the knowledge base is either Wikipedia or a combination of the underlying search engines themselves. While conducting the evaluations we gathered a number of qualitative and quantitative results and gained insights on how different users compare the quality of search result lists. In regard to the pseudo relevance feedback we found that using Wikipedia as knowledge base generally provides a benefit, unless for entity centric queries, which are targeting single persons or organisations. Our results will enable to help steering the development of cross vertical aggregated search engines and will also help to guide large scale evaluation strategies, for example using crowd sourcing techniques.

Hermann Ziak, Roman Kern

Long Papers with Short Presentation

Frontmatter
Analysing the Role of Representation Choices in Portuguese Relation Extraction

Relation Extraction is the task of identifying and classifying the semantic relations between entities in text. This task is one of the main challenges in Natural Language Processing. In this work, the relation extraction task is treated as sequence labelling problem. We analysed the impact of different representation schemes for the relation descriptors. In particular, we analysed the BIO and IO schemes performance considering a Conditional Random Fields classifier for the extraction of any relation descriptor occurring between named entities in the Organisation domain (Person, Organisation, Place). Overall, the classifier proposed here presents the best results using the IO notation.

Sandra Collovini, Marcelo de Bairros P. Filho, Renata Vieira
An Investigation of Cross-Language Information Retrieval for User-Generated Internet Video

Increasing amounts of user-generated video content are being uploaded to online repositories. This content is often very uneven in quality and topical coverage in different languages. The lack of material in individual languages means that cross-language information retrieval (CLIR) within these collections is required to satisfy the user’s information need. Search over this content is dependent on available metadata, which includes user-generated annotations and often noisy transcripts of spoken audio. The effectiveness of CLIR depends on translation quality between query and content languages. We investigate CLIR effectiveness for the blip10000 archive of user-generated Internet video content. We examine the retrieval effectiveness using the title and free-text metadata provided by the uploader and automatic speech recognition (ASR) generated transcripts. Retrieval is carried out using the

Divergence From Randomness

models, and automatic translation using

Google translate

. Our experimental investigation indicates that different sources of evidence have different retrieval effectiveness and in particular differing levels of performance in CLIR. Specifically, we find that the retrieval effectiveness of the ASR source is significantly degraded in CLIR. Our investigation also indicates that for this task the Title source provides the most robust source of evidence for CLIR, and performs best when used in combination with other sources of evidence. We suggest areas for investigation to give most effective and robust CLIR performance for user-generated content.

Ahmad Khwileh, Debasis Ganguly, Gareth J. F. Jones
Benchmark of Rule-Based Classifiers in the News Recommendation Task

In this paper, we present experiments evaluating Association Rule Classification algorithms on on-line and off-line recommender tasks of the CLEF NewsReel 2014 Challenge. The second focus of the experimental evaluation is to investigate possible performance optimizations of the Classification Based on Associations algorithm. Our findings indicate that pruning steps in CBA reduce the number of association rules substantially while not affecting accuracy. Using only part of the data employed for the rule learning phase in the pruning phase may also reduce training time while not affecting accuracy significantly.

Tomáš Kliegr, Jaroslav Kuchař
Enhancing Medical Information Retrieval by Exploiting a Content-Based Recommender Method

Information Retrieval (IR) systems seek to find information which is relevant to a searcher’s information needs. Improving IR effectiveness using personalization has been a significant focus of research attention in recent years. However, in some situations there may be no opportunity to learn about the interests of a specific user on a certain topic. This is a particular problem for medical IR where individuals find themselves needing information on topics for which they have never previously searched. However, in all likelihood other users will have searched with the same information need previously. This presents an opportunity to IR researchers attempting to improve search effectiveness by exploiting previous user search behaviour. We describe a method to enhance IR in the medical domain based on recommender systems (RSs) by using a content-based recommender model in combination with a standard IR model. We use search behaviour data from previous users with similar interests to aid the current user to discover better search results. We demonstrate the effectiveness of this method using a test dataset collected as part of the EU FP7 Khresmoi project.

Wei Li, Gareth J. F. Jones
Summarizing Citation Contexts of Scientific Publications

As the number of publications is increasing rapidly, it becomes increasingly difficult for researchers to find existing scientific papers most relevant for their work, even when the domain is limited. To overcome this, it is common to use paper summarization techniques in specific domains. In difference to approaches that exploit the paper content itself, in this paper we perform summarization of the citation context of a paper. For this, we adjust and apply existing summarization techniques and we come up with a hybrid method, based on clustering and latent semantic analysis. We apply this on medical informatics publications and compare performance of methods that outscore other techniques on a standard database. Summarization of the citation context can be complementary to full text summarization, particularly to find candidate papers. The reached performance seems good for routine use even though it was only tested on a small database.

Sandra Mitrović, Henning Müller
A Multiple-Stage Approach to Re-ranking Medical Documents

The widespread use of the Web has radically changed the way people acquire medical information. Every day, patients, their caregivers, and doctors themselves search for medical information to resolve their medical information needs. However, search results provided by existing medical search engines often contain irrelevant or uninformative documents that are not appropriate for the purposes of the users. As a solution, this paper presents a method of reranking medical documents. The key concept of our method is to compute accurate similarity scores through multiple stages of re-ranking documents from the initial documents retrieved by a search engine. Specifically, our method combines query expansion with abbreviations, query expansion with discharge summary, clustering-based document scoring, centrality-based document scoring, and pseudo relevance feedback with relevance model. The experimental results from participating in Task 3a of the CLEF 2014 eHealth show the performance of our method.

Heung-Seon Oh, Yuchul Jung, Kwang-Young Kim
Exploring Behavioral Dimensions in Session Effectiveness

Studies in interactive information retrieval (IIR) indicate that expert searchers differ from novices in many ways. In the present paper, we identify a number of behavioral dimensions along which searchers differ (e.g. cost, gain and the accuracy of relevance assessment). We quantify these differences using simulated, multi-query search sessions. We then explore each dimension in turn to determine what differences are most effective in yielding superior retrieval performance. The more precise action probabilities in assessing snippets and documents contribute less to the overall cumulative gain during a session than gain and cost structures.

Teemu Pääkkönen, Kalervo Järvelin, Jaana Kekäläinen, Heikki Keskustalo, Feza Baskaya, David Maxwell, Leif Azzopardi

Short Papers

Frontmatter
Meta Text Aligner: Text Alignment Based on Predicted Plagiarism Relation

Text alignment is one of the main steps of plagiarism detection in textual environments. Considering the pattern in distribution of the common semantic elements of the two given documents, different strategies may be suitable for this task. In this paper we assume that the obfuscation level, i.e the plagiarism type, is a function of the distribution of the common elements in the two documents. Based on this assumption, we propose

Meta Text Aligner

which predicts plagiarism relation of two given documents and employs the prediction results to select the best text alignment strategy. Thus, it will potentially perform better than the existing methods which use a same strategy for all cases. As indicated by the experiments, we have been able to classify document pairs based on plagiarism type with the precision of 89%. Furthermore exploiting the predictions of the classifier for choosing the proper method or the optimal configuration for each type we have been able to improve the Plagdet score of the existing methods.

Samira Abnar, Mostafa Dehghani, Azadeh Shakery
Automatic Indexing of Journal Abstracts with Latent Semantic Analysis

The BioASQ “Task on Large-Scale Online Biomedical Semantic Indexing” charges participants with assigning semantic tags to biomedical journal abstracts. We present a system that takes as input a biomedical abstract and uses latent semantic analysis to identify similar documents in the MEDLINE database. The system then uses a novel ranking scheme to select a list of MeSH tags from candidates drawn from the most similar documents. Our approach achieved better than baseline performance in both precision and recall.We suggest several possible strategies to improve the system’s performance.

Joel Robert Adams, Steven Bedrick
Shadow Answers as an Intermediary in Email Answer Retrieval

A set of standard answers facilitates answering emails at customer care centers. Matching the text of user emails to the standard answers may not be productive because they do not necessarily have the same wording. Therefore we examine archived email-answer pairs and establish query-answer term co-occurrences. When a new user email arrives, we replace query words with most co-occurring answer words and obtain a “shadow answer”, which is a new query to retrieve standard answers. As a measure of term co-occurrence strength we test raw term co-occurrences and Pointwise Mutual Information.

Alyaa Alfalahi, Gunnar Eriksson, Eriks Sneiders
Are Topically Diverse Documents Also Interesting?

Text interestingness is a measure of assessing the quality of documents from users’ perspective which shows their willingness to read a document. Different approaches are proposed for measuring the interestingness of texts. Most of these approaches suppose that interesting texts are also topically diverse and estimate interestingness using topical diversity. In this paper, we investigate the relation between interestingness and topical diversity. We do this on the Dutch and Canadian parliamentary proceedings. We apply an existing measure of interestingness, which is based on structural properties of the proceedings (eg, how much interaction there is between speakers in a debate). We then compute the correlation between this measure of interestingness and topical diversity.

Our main findings are that in general there is a relatively low correlation between interestingness and topical diversity; that there are two extreme categories of documents: highly interesting, but hardly diverse (focused interesting documents) and highly diverse but not interesting documents. When we remove these two extreme types of documents there is a positive correlation between interestingness and diversity.

Hosein Azarbonyad, Ferron Saan, Mostafa Dehghani, Maarten Marx, Jaap Kamps
Modeling of the Question Answering Task in the YodaQA System

We briefly survey the current state of art in the field of Question Answering and present the YodaQA system, an open source framework for this task and a baseline pipeline with reasonable performance. We take a holistic approach, reviewing and aiming to integrate many different question answering task definitions and approaches concerning classes of knowledge bases, question representation and answer generation. To ease performance comparisons of general-purpose QA systems, we also propose an effort in building a new reference QA testing corpus which is a curated and extended version of the TREC corpus.

Petr Baudiš, Jan Šedivý
Unfair Means: Use Cases Beyond Plagiarism

The study of plagiarism and its detection is a highly popular field of research that has witnessed increased attention over recent years. In this paper we describe the range of problems that exist within academe in the area of ‘unfair means’, which encompasses a wider range of issues of attribution, ownership and originality. Unfair means offers a variety of problems that may benefit from the development of computational methods, thereby requiring appropriate evaluation resources. This may provide further areas of focus for large-scale evaluation activities, such as PAN, and researchers in the field more generally.

Paul Clough, Peter Willett, Jessie Lim
Instance-Based Learning for Tweet Monitoring and Categorization

The CLEF RepLab 2014 Track was the occasion to investigate the robustness of instance-based learning in a complete system for tweet monitoring and categorization based. The algorithm we implemented was a k-Nearest Neighbors. Dealing with the domain (automotive or banking) and the language (English or Spanish), the experiments showed that the categorizer was not affected by the choice of representation: even with all learning tweets merged into one single Knowledge Base (KB), the observed performances were close to those with dedicated KBs. Interestingly, English training data in addition to the sparse Spanish data were useful for Spanish categorization (+14% for accuracy for automotive, +26% for banking). Yet, performances suffered from an overprediction of the most prevalent category. The algorithm showed the defects of its virtues: it was very robust, but not easy to improve. BiTeM/SIBtex tools for tweet monitoring are available within the DrugsListener Project page of the BiTeM website (http://bitem.hesge.ch/).

Julien Gobeill, Arnaud Gaudinat, Patrick Ruch
Are Test Collections “Real”? Mirroring Real-World Complexity in IR Test Collections

Objective evaluation of effectiveness is a major topic in the field of information retrieval (IR), as emphasized by the numerous evaluation campaigns in this area. The increasing pervasiveness of information has lead to a large variety of IR application scenarios that involve different information types (modalities), heterogeneous documents and context-enriched queries. In this paper, we argue that even though the complexity of academic test collections has increased over the years, they are still too structurally simple in comparison to operational collections in real-world applications. Furthermore, research has brought up retrieval methods for very specific modalities, such as ratings, geographical coordinates and timestamps. However, it is still unclear how to systematically incorporate new modalities in IR systems. We therefore propose a categorization of modalities that not only allows analyzing the complexity of a collection but also helps to generalize methods to entire modality categories instead of being specific for a single modality. Moreover, we discuss how such a complex collection can methodically be built for the usage in an evaluation campaign.

Melanie Imhof, Martin Braschler
Evaluation of Manual Query Expansion Rules on a Domain Specific FAQ Collection

Frequently asked question (FAQ) knowledge bases are a convenient way to organize domain specific information. However, FAQ retrieval is challenging because the documents are short and the vocabulary is domain specific, giving rise to the lexical gap problem. To address this problem, in this paper we consider rule-based query expansion (QE) for domain specific FAQ retrieval. We build a small test collection and evaluate the potential of QE rules. While we observe some improvement for difficult queries, our results suggest that the potential of manual rule compilation is limited.

Mladen Karan, Jan Šnajder
Evaluating Learning Language Representations

Machine learning offers significant benefits for systems that process and understand natural language: (a) lower maintenance and upkeep costs than when using manually-constructed resources, (b) easier portability to new domains, tasks, or languages, and (c) robust and timely adaptation to situation-specific settings. However, the behaviour of an adaptive system is less predictable than when using an edited, stable resource, which makes quality control a continuous issue. This paper proposes an evaluation benchmark for measuring the quality, coverage, and stability of a natural language system as it learns word meaning. Inspired by existing tests for human vocabulary learning, we outline measures for the quality of semantic word representations, such as when learning word embeddings or other distributed representations. These measures highlight differences between the types of underlying learning processes as systems ingest progressively more data.

Jussi Karlgren, Jimmy Callin, Kevyn Collins-Thompson, Amaru Cuba Gyllensten, Ariel Ekgren, David Jurgens, Anna Korhonen, Fredrik Olsson, Magnus Sahlgren, Hinrich Schütze
Automatic Segmentation and Deep Learning of Bird Sounds

We present a study on automatic birdsong recognition with deep neural networks using the birdclef2014 dataset. Through deep learning, feature hierarchies are learned that represent the data on several levels of abstraction. Deep learning has been applied with success to problems in fields such as music information retrieval and image recognition, but its use in bioacoustics is rare. Therefore, we investigate the application of a common deep learning technique (deep neural networks) in a classification task using songs from Amazonian birds. We show that various deep neural networks are capable of outperforming other classification methods. Furthermore, we present an automatic segmentation algorithm that is capable of separating bird sounds from non-bird sounds.

Hendrik Vincent Koops, Jan van Balen, Frans Wiering
The Impact of Noise in Web Genre Identification

Genre detection of web documents fits an open-set classification task. The web documents not belonging to any predefined genre or where multiple genres co-exist is considered as noise. In this work we study the impact of noise on automated genre identification within an open-set classification framework. We examine alternative classification models and document representation schemes based on two corpora, one without noise and one with noise showing that the recently proposed RFSE model can remain robust with noise. Moreover, we show how that the identification of certain genres is not practically affected by the presence of noise.

Dimitrios Pritsos, Efstathios Stamatatos
On the Multilingual and Genre Robustness of EmoGraphs for Author Profiling in Social Media

Author profiling aims at identifying different traits such as age and gender of an author on the basis of her writings. We propose the novel EmoGraph graph-based approach where morphosyntactic categories are enriched with semantic and affective information. In this work we focus on testing the robustness of EmoGraphs when applied to age and gender identification. Results with PAN-AP-14 corpus show the competitiveness of the representation over genres and languages. Finally, some interesting insights are shown, for example with topic and emotion bounded genres such as hotel reviews.

Francisco Rangel, Paolo Rosso
Is Concept Mapping Useful for Biomedical Information Retrieval?

Concepts have been extensively used in biomedical information retrieval (BIR); but the experimental results have often showed limited or no improvement compared to a traditional bag-of-words method. In this paper, we analyze the problems in concept mapping, and show how they can affect the results of BIR. This suggests a flexible utilization of the identified concepts.

Wei Shen, Jian-Yun Nie
Using Health Statistics to Improve Medical and Health Search

We present a probabilistic information retrieval (IR) model that incorporates epidemiological data and simple patient profiles that are composed of a patient’s sex and age. This approach is intended to improve retrieval effectiveness in the health and medical domain. We evaluated our approach on the TREC Clinical Decision Support Track 2014. The new approach performed better than a baseline run, however at this time, we cannot report any statistically significant improvements.

Tawan Sierek, Allan Hanbury
Determining Window Size from Plagiarism Corpus for Stylometric Features

The sliding window concept is a common method for computing a profile of a document with unknown structure. This paper outlines an experiment with stylometric word-based feature in order to determine an optimal size of the sliding window. It was conducted for a vocabulary richness method called ‘average word frequency class’ using the PAN 2015 source retrieval training corpus for plagiarism detection. The paper shows the pros and cons of the stop words removal for the sliding window document profiling and discusses the utilization of the selected feature for intrinsic plagiarism detection. The experiment resulted in the recommendation of setting the sliding windows to around 100 words in length for computing the text profile using the average word frequency class stylometric feature.

Šimon Suchomel, Michal Brandejs
Effect of Log-Based Query Term Expansion on Retrieval Effectiveness in Patent Searching

In this paper we study the impact of query term expansion (

QTE

) using synonyms on patent document retrieval. We use an automatically generated lexical database from USPTO query logs, called

PatNet

, which provides synonyms and equivalents for a query term. Our experiments on the CLEF-IP 2010 benchmark dataset show that automatic query expansion using

PatNet

tends to decrease or only slightly improve the retrieval effectiveness, with no significant improvement. An analysis of the retrieval results shows that

PatNet

does not have generally a negative effect on the retrieval effectiveness. Recall is drastically improved for query topics, where the baseline queries achieve, on average, only low recall values. But we have not detected any commonality that allows us to characterize these queries. So we recommend using

PatNet

for semi-automatic

QTE

in Boolean retrieval, where expanding query terms with synonyms and equivalents with the aim of expanding the query scope is a common practice.

Wolfgang Tannebaum, Parvaz Mahdabi, Andreas Rauber
Integrating Mixed-Methods for Evaluating Information Access Systems

The evaluation of information access systems is increasingly making use of multiple evaluation methods. While such studies represent forms of mixed-methods research, they are rarely acknowledged as such. This means that researchers are potentially failing to recognise the challenges and opportunities offered by multi-phase research, particularly in terms of data integration. This paper provides a brief case study of how one framework – Bazely & Kemp’s metaphors for integrated analysis – was employed to formalise data integration for a large exploratory evaluation study.

Simon Wakeling, Paul Clough
Teaching the IR Process Using Real Experiments Supported by Game Mechanics

We present a web-based tool for teaching and learning the information retrieval process. An interactive approach helps students gain practical knowledge. Our focus is the arrangement and configuration of IR components and their evaluation. The incorporation of game mechanics counteracts an information overload and motivates progression.

Thomas Wilhelm-Stein, Maximilian Eibl
Tweet Contextualization Using Association Rules Mining and DBpedia

Tweets are short 140 characters-limited messages that do not always conform to proper spelling rules. This spelling variation makes them hard to understand without some kind of context. For these reasons, the tweet contextualization task was introduced, aiming to provide automatic contexts to explain the tweets. We present, in this paper, two tweet contextualization approaches. The first is an inter-term association rules mining-based method, the second one, however, makes use of the DBpedia ontology. These approaches allow us to augment the vocubulary of a given tweet with a set of thematically related words. We conducted an experimental study on the INEX2014 collection to prove the effectiveness of our approaches, the obtained results are very promising.

Meriem Amina Zingla, Chiraz Latiri, Yahya Slimani

Best of the Labs

Frontmatter
Search-Based Image Annotation: Extracting Semantics from Similar Images

The importance of automatic image annotation as a tool for handling large amounts of image data has been recognized for several decades. However, working tools have long been limited to narrowdomain problems with a few target classes for which precise models could be trained. With the advance of similarity searching, it now becomes possible to employ a different approach: extracting information from large amounts of noisy web data. However, several issues need to be resolved, including the acquisition of a suitable knowledge base, choosing a suitable visual content descriptor, implementation of effective and efficient similarity search engine, and extraction of semantics from similar images. In this paper, we address these challenges and present a working annotation system based on the search-based paradigm, which achieved good results in the 2014 ImageCLEF Scalable Concept Image Annotation challenge.

Petra Budikova, Michal Batko, Jan Botorek, Pavel Zezula
NLP-Based Classifiers to Generalize Expert Assessments in E-Reputation

Online Reputation Management

(ORM) is currently dominated by expert abilities. One of the great challenges is to effectively collect annotated training samples, especially to be able to generalize a small pool of expert feedback from area scale to a more global scale. One possible solution is to use advanced

Machine Learning

(ML) techniques, to select annotations from training samples, and propagate effectively and concisely. We focus on the critical issue of understanding the different levels of annotations. Using the framework proposed by the RepLab contest we present a considerable number of experiments in Reputation Monitoring and Author Profiling. The proposed methods rely on a large variety of

Natural Language Processing

(NLP) methods exploiting tweet contents and some background contextual information. We show that simple algorithms only considering tweets content are effective against state-of-the-art techniques.

Jean-Valère Cossu, Emmanuel Ferreira, Killian Janod, Julien Gaillard, Marc El-Bèze
A Method for Short Message Contextualization: Experiments at CLEF/INEX

This paper presents the approach we developed for automatic multi-document summarization applied to short message contextualization, in particular to tweet contextualization. The proposed method is based on named entity recognition, part-of-speech weighting and sentence quality measuring. In contrast to previous research, we introduced an algorithm from smoothing from the local context. Our approach exploits topic-comment structure of a text. Moreover, we developed a graph-based algorithm for sentence reordering. The method has been evaluated at INEX/CLEF tweet contextualization track. We provide the evaluation results over the 4 years of the track. The method was also adapted to snippet retrieval and query expansion. The evaluation results indicate good performance of the approach.

Liana Ermakova
Towards Automatic Large-Scale Identification of Birds in Audio Recordings

This paper presents a computer-based technique for bird species identification at large scale. It automatically identifies multiple species simultaneously in a large number of audio recordings and provides the basis for the best scoring submission to the LifeCLEF 2014 Bird Identification Task. The method achieves a Mean Average Precision of 51.1% on the test set and 53.9% on the training set with an Area Under the Curve of 91.5% during cross-validation. Besides a general description of the underlying classification approach a number of additional research questions are addressed regarding the choice of features, selection of classifier hyperparameters and method of classification.

Mario Lasseck
Optimizing and Evaluating Stream-Based News Recommendation Algorithms

Recommender algorithms are powerful tools helping users to find interesting items in the overwhelming amount available data. Classic recommender algorithms are trained based on a huge set of user-item interactions collected in the past. Since the learning of models is computationally expensive, it is difficult to integrate new knowledge into the recommender models. With the growing importance of social networks, the huge amount of data generated by the real-time web (e.g. news portals, micro-blogging services), and the ubiquity of personalized web portals stream-based recommender systems get in the focus of research.

In this paper we develop algorithms tailored to the requirements of a web-based news recommendation scenario. The algorithms address the specific challenges of news recommendations, such as a contextdependent relevance of news items and the short item lifecycle forcing the recommender algorithms to continuously adapt to the set of news articles. In addition, the scenario is characterized by a huge amount of messages (that must be processed per second) and by tight time constraints resulting from the fact that news recommendations should be embedded into webpages without a delay. For evaluating and optimizing the recommender algorithms we implement an evaluation framework, allowing us analyzing and comparing different recommender algorithms in different contexts. We discuss the strength and weaknesses both according to recommendation precision and technical complexity. We show how the evaluation framework enables us finding the optimal recommender algorithm for a specific scenarios and contexts.

Andreas Lommatzsch, Sebastian Werner
Information Extraction from Clinical Documents: Towards Disease/Disorder Template Filling

In recent years there has been an increase in the generation of electronic health records (EHRs), which lead to an increased scope for research on biomedical literature. Many research works have been using various NLP, information retrieval and machine learning techniques to extract information from these records. In this paper, we provide a methodology to extract information for understanding the status of the disease/disorder. The status of disease/disorder is based on different attributes like temporal information, severity and progression of the disease. Here, we consider ten attributes that allow us to understand the majority details regarding the status of the disease/disorder. They are Negation Indicator, Subject Class, Uncertainty Indicator, Course Class, Severity Class, Conditional Class, Generic Class, Body Location, DocTime Class, and Temporal Expression. In this paper, we present rule-based and machine learning approaches to identify each of these attributes and evaluate our system on attribute level and system level accuracies. This project was done as a part of the ShARe/CLEF eHealth Evaluation Lab 2014. We were able to achieve state-of-art accuracy (0.868) in identifying normalized values of the attributes.

Veera Raghavendra Chikka, Nestor Mariyasagayam, Yoshiki Niwa, Kamalakar Karlapalem
Adaptive Algorithm for Plagiarism Detection: The Best-Performing Approach at PAN 2014 Text Alignment Competition

The task of (monolingual) text alignment consists in finding similar text fragments between two given documents. It has applications in plagiarism detection, detection of text reuse, author identification, authoring aid, and information retrieval, to mention only a few. We describe our approach to the text alignment subtask of the plagiarism detection competition at PAN 2014, which resulted in the bestperforming system at the PAN 2014 competition and outperforms the best-performing system of the PAN 2013 competition by the cumulative evaluation measure Plagdet. Our method relies on a sentence similarity measure based on a tf-idf-like weighting scheme that permits us to consider stopwords without increasing the rate of false positives. We introduce a recursive algorithm to extend the ranges of matching sentences to maximal length passages. We also introduce a novel filtering method to resolve overlapping plagiarism cases. Our system is available as open source.

Miguel A. Sanchez-Perez, Alexander Gelbukh, Grigori Sidorov
Question Answering via Phrasal Semantic Parsing

Understanding natural language questions and converting them into structured queries have been considered as a crucial way to help users access large scale structured knowledge bases. However, the task usually involves two main challenges: recognizing users’ query intention and mapping the involved semantic items against a given knowledge base (KB). In this paper, we propose an efficient pipeline framework to model a user’s query intention as a phrase level dependency DAG which is then instantiated regarding a specific KB to construct the final structured query. Our model benefits from the efficiency of linear structured prediction models and the separation of KB-independent and KB-related modelings. We evaluate our model on two datasets, and the experimental results showed that our method outperforms the state-of-the-art methods on the Free917 dataset, and, with limited training data from Free917, our model can smoothly adapt to new challenging dataset,WebQuestion, without extra training efforts while maintaining promising performances.

Kun Xu, Yansong Feng, Songfang Huang, Dongyan Zhao

Labs Overviews

Frontmatter
Overview of the CLEF eHealth Evaluation Lab 2015

This paper reports on the 3rd CLEFeHealth evaluation lab, which continues our evaluation resource building activities for the medical domain. In this edition of the lab, we focus on easing patients and nurses in authoring, understanding, and accessing eHealth information. The 2015 CLEFeHealth evaluation lab was structured into two tasks, focusing on evaluating methods for information extraction (IE) and information retrieval (IR). The IE task introduced two new challenges. Task 1a focused on clinical speech recognition of nursing handover notes; Task 1b focused on clinical named entity recognition in languages other than English, specifically French. Task 2 focused on the retrieval of health information to answer queries issued by general consumers seeking information to understand their health symptoms or conditions.

The number of teams registering their interest was 47 in Tasks 1 (2 teams in Task 1a and 7 teams in Task 1b) and 53 in Task 2 (12 teams) for a total of 20 unique teams. The best system recognized 4, 984 out of 6, 818 test words correctly and generated 2, 626 incorrect words (i.e., 38.5% error) in Task 1a; had the F-measure of 0.756 for plain entity recognition, 0.711 for normalized entity recognition, and 0.872 for entity normalization in Task 1b; and resulted in P@10 of 0.5394 and nDCG@10 of 0.5086 in Task 2. These results demonstrate the substantial community interest and capabilities of these systems in addressing challenges faced by patients and nurses. As in previous years, the organizers have made data and tools available for future research and development.

Lorraine Goeuriot, Liadh Kelly, Hanna Suominen, Leif Hanlen, Aurélie Névéol, Cyril Grouin, João Palotti, Guido Zuccon
General Overview of ImageCLEF at the CLEF 2015 Labs

This paper presents an overview of the ImageCLEF 2015 evaluation campaign, an event that was organized as part of the CLEF labs 2015. ImageCLEF is an ongoing initiative that promotes the evaluation of technologies for annotation, indexing and retrieval for providing information access to databases of images in various usage scenarios and domains. In 2015, the 13th edition of ImageCLEF, four main tasks were proposed: 1) automatic concept annotation, localization and sentence description generation for general images; 2) identification, multi-label classification and separation of compound figures from biomedical literature; 3) clustering of x-rays from all over the body; and 4) prediction of missing radiological annotations in reports of liver CT images. The x-ray task was the only fully novel task this year, although the other three tasks introduced modifications to keep up relevancy of the proposed challenges. The participation was considerably positive in this edition of the lab, receiving almost twice the number of submitted working notes papers as compared to previous years.

Mauricio Villegas, Henning Müller, Andrew Gilbert, Luca Piras, Josiah Wang, Krystian Mikolajczyk, Alba G. Seco de Herrera, Stefano Bromuri, M. Ashraful Amin, Mahmood Kazi Mohammed, Burak Acar, Suzan Uskudarli, Neda B. Marvasti, José F. Aldana, María del Mar Roldán García
LifeCLEF 2015: Multimedia Life Species Identification Challenges

Using multimedia identification tools is considered as one of the most promising solutions to help bridging the taxonomic gap and build accurate knowledge of the identity, the geographic distribution and the evolution of living species. Large and structured communities of nature observers (e.g. eBird, Xeno-canto, Tela Botanica, etc.) as well as big monitoring equipments have actually started to produce outstanding collections of multimedia records. Unfortunately, the performance of the state-of-the-art analysis techniques on such data is still not well understood and is far from reaching the real world’s requirements. The Life- CLEF lab proposes to evaluate these challenges around 3 tasks related to multimedia information retrieval and fine-grained classification problems in 3 living worlds. Each task is based on large and real-world data and the measured challenges are defined in collaboration with biologists and environmental stakeholders in order to reflect realistic usage scenarios. This paper presents more particularly the 2015 edition of LifeCLEF. For each of the three tasks, we report the methodology and the data sets as well as the raw results and the main outcomes.

Alexis Joly, Hervé Goëau, Hervé Glotin, Concetto Spampinato, Pierre Bonnet, Willem-Pier Vellinga, Robert Planqué, Andreas Rauber, Simone Palazzo, Bob Fisher, Henning Müller
Overview of the Living Labs for Information Retrieval Evaluation (LL4IR) CLEF Lab 2015

In this paper we report on the first Living Labs for Information Retrieval Evaluation (LL4IR) CLEF Lab. Our main goal with the lab is to provide a benchmarking platform for researchers to evaluate their ranking systems in a live setting with real users in their natural task environments. For this first edition of the challenge we focused on two specific use-cases: product search and web search. Ranking systems submitted by participants were experimentally compared using interleaved comparisons to the production system from the corresponding use-case. In this paper we describe how these experiments were performed, what the resulting outcomes are, and conclude with some lessons learned.

Anne Schuth, Krisztian Balog, Liadh Kelly
Stream-Based Recommendations: Online and Offline Evaluation as a Service

Providing high-quality news recommendations is a challenging task because the set of potentially relevant news items changes continuously, the relevance of news highly depends on the context, and there are tight time constraints for computing recommendations. The CLEF NewsREEL challenge is a campaign-style evaluation lab allowing participants to evaluate and optimize news recommender algorithms online and offline. In this paper, we discuss the objectives and challenges of the NewsREEL lab. We motivate the metrics used for benchmarking the recommender algorithms and explain the challenge dataset. In addition, we introduce the evaluation framework that we have developed. The framework makes possible the reproducible evaluation of recommender algorithms for stream data, taking into account recommender precision as well as the technical complexity of the recommender algorithms.

Benjamin Kille, Andreas Lommatzsch, Roberto Turrin, András Serény, Martha Larson, Torben Brodt, Jonas Seiler, Frank Hopfgartner
Overview of the PAN/CLEF 2015 Evaluation Lab

This paper presents an overview of the PAN/CLEF evaluation lab. During the last decade, PAN has been established as the main forum of text mining research focusing on the identification of personal traits of authors left behind in texts unintentionally. PAN 2015 comprises three tasks: plagiarism detection, author identification and author profiling studying important variations of these problems. In plagiarism detection, community-driven corpus construction is introduced as a new way of developing evaluation resources with diversity. In author identification, cross-topic and cross-genre author verification (where the texts of known and unknown authorship do not match in topic and/or genre) is introduced. A new corpus was built for this challenging, yet realistic, task covering four languages. In author profiling, in addition to usual author demographics, such as gender and age, five personality traits are introduced (openness, conscientiousness, extraversion, agreeableness, and neuroticism) and a new corpus of Twitter messages covering four languages was developed. In total, 53 teams participated in all three tasks of PAN 2015 and, following the practice of previous editions, software submissions were required and evaluated within the TIRA experimentation framework.

Efstathios Stamatatos, Martin Potthast, Francisco Rangel, Paolo Rosso, Benno Stein
Overview of the CLEF Question Answering Track 2015

This paper describes the CLEF QA Track 2015. Following the scenario stated last year for the CLEF QA Track, the starting point for accessing information is always a Natural Language question. However, answering some questions may need to query Linked Data (especially if aggregations or logical inferences are required), some questions may need textual inferences and querying free-text, and finally, answering some queries may require both sources of information. In this edition, the Track was divided into four tasks: (i)

QALD

: focused on translating natural language questions into SPARQL; (ii)

Entrance Exams

: focused on answering questions to assess machine reading capabilities; (iii) BioASQ1 focused on large-scale semantic indexing and (iv) BioASQ2 for Question Answering in the biomedical domain.

Anselmo Peñas, Christina Unger, Georgios Paliouras, Ioannis Kakadiaris
Overview of the CLEF 2015 Social Book Search Lab

The Social Book Search (SBS) Lab investigates book search in scenarios where users search with more than just a query, and look for more than objective metadata. Real-world information needs are generally complex, yet almost all research focuses instead on either relatively simple search based on queries or recommendation based on profiles. The goal is to research and develop techniques to support users in complex book search tasks. The SBS Lab has two tracks. The aim of the Suggestion Track is to develop test collections for evaluating ranking effectiveness of book retrieval and recommender systems. The aim of the Interactive Track is to develop user interfaces that support users through each stage during complex search tasks and to investigate how users exploit professional metadata and user-generated content.

Marijn Koolen, Toine Bogers, Maria Gäde, Mark Hall, Hugo Huurdeman, Jaap Kamps, Mette Skov, Elaine Toms, David Walsh
Backmatter
Metadata
Title
Experimental IR Meets Multilinguality, Multimodality, and Interaction
Editors
Josanne Mothe
Jacques Savoy
Jaap Kamps
Karen Pinel-Sauvagnat
Gareth Jones
Eric San Juan
Linda Capellato
Nicola Ferro
Copyright Year
2015
Electronic ISBN
978-3-319-24027-5
Print ISBN
978-3-319-24026-8
DOI
https://doi.org/10.1007/978-3-319-24027-5

Premium Partner