Skip to main content

Über dieses Buch

In its ?rst ten years of activities (2000-2009), the Cross-Language Evaluation Forum (CLEF) played a leading role in stimulating investigation and research in a wide range of key areas in the information retrieval domain, such as cro- language question answering, image and geographic information retrieval, int- activeretrieval,and many more.It also promotedthe study andimplementation of appropriateevaluation methodologies for these diverse types of tasks and - dia. As a result, CLEF has been extremely successful in building a wide, strong, and multidisciplinary research community, which covers and spans the di?erent areasofexpertiseneededto dealwith thespreadofCLEFtracksandtasks.This constantly growing and almost completely voluntary community has dedicated an incredible amount of e?ort to making CLEF happen and is at the core of the CLEF achievements. CLEF 2010 represented a radical innovation of the “classic CLEF” format and an experiment aimed at understanding how “next generation” evaluation campaigns might be structured. We had to face the problem of how to innovate CLEFwhile still preservingits traditionalcorebusiness,namely the benchma- ing activities carried out in the various tracks and tasks. The consensus, after lively and community-wide discussions, was to make CLEF an independent four-day event, no longer organized in conjunction with the European Conference on Research and Advanced Technology for Digital Libraries (ECDL) where CLEF has been running as a two-and-a-half-day wo- shop. CLEF 2010 thus consisted of two main parts: a peer-reviewed conference – the ?rst two days – and a series of laboratories and workshops – the second two days.



Keynote Addresses

IR between Science and Engineering, and the Role of Experimentation

Evaluation has always played a major role in IR research, as a means for judging about the quality of competing models. Lately, however, we have seen an over-emphasis of experimental results, thus favoring engineering approaches aiming at tuning performance and neglecting other scientific criteria. A recent study investigated the validity of experimental results published at major conferences, showing that for 95% of the papers using standard test collections, the claimed improvements were only relative, and the resulting quality was inferior to that of the top performing systems [AMWZ09].
In this talk, it is claimed that IR is still in its scientific infancy. Despite the extensive efforts in evaluation initiatives, the scientific insights gained are still very limited – partly due to shortcomings in the design of the testbeds. From a general scientific standpoint, using test collections for evaluation only is a waste of resources. Instead, experimentation should be used for hypothesis generation and testing in general, in order to accumulate a better understanding of the retrieval process and to develop a broader theoretic foundation for the field.
Norbert Fuhr

Retrieval Evaluation in Practice

Nowadays, most research on retrieval evaluation is about comparing different systems to determine which is the best one, using a standard document collection and a set of queries with relevance judgements, such as TREC. Retrieval quality baselines are usually also standard, such as BM25. However, in an industrial setting, reality is much harder. First, real Web collections are much larger – billions of documents – and the number of all relevant answers for most queries could be several millions. Second, the baseline is the competition, so you cannot use a weak baseline. Third, good average quality is not enough if, for example, a significant fraction of the answers have quality well below average. On the other hand, search engines have hundreds of million of users and hence click-through data can and should be used for evaluation.
In this invited talk we explore important problems that arise in practice. Some of them are: Which queries are already well answered and which are the difficult queries? Which queries and how many answers per query should be judged by editors? How we can use clicks for retrieval evaluation? What retrieval measure we should use? What is the impact of culture, geography or language in these questions?
All these questions are not trivial and depend in each other, so we only give partial solutions. Hence, the main message to take away is that more research in retrieval evaluation is certainly needed.
Ricardo Baeza-Yates

Resources, Tools, and Methods

A Dictionary- and Corpus-Independent Statistical Lemmatizer for Information Retrieval in Low Resource Languages

We present a dictionary- and corpus-independent statistical lemmatizer StaLe that deals with the out-of-vocabulary (OOV) problem of dictionary-based lemmatization by generating candidate lemmas for any inflected word forms. StaLe can be applied with little effort to languages lacking linguistic resources. We show the performance of StaLe both in lemmatization tasks alone and as a component in an IR system using several datasets and query types in four high resource languages. StaLe is competitive, reaching 88-108 % of gold standard performance of a commercial lemmatizer in IR experiments. Despite competitive performance, it is compact, efficient and fast to apply to new languages.
Aki Loponen, Kalervo Järvelin

A New Approach for Cross-Language Plagiarism Analysis

This paper presents a new method for Cross-Language Plagiarism Analysis. Our task is to detect the plagiarized passages in the suspicious documents and their corresponding fragments in the source documents. We propose a plagiarism detection method composed by five main phases: language normalization, retrieval of candidate documents, classifier training, plagiarism analysis, and post-processing. To evaluate our method, we created a corpus containing artificial plagiarism offenses. Two different experiments were conducted; the first one considers only monolingual plagiarism cases, while the second one considers only cross-language plagiarism cases. The results showed that the cross-language experiment achieved 86% of the performance of the monolingual baseline. We also analyzed how the plagiarized text length affects the overall performance of the method. This analysis showed that our method achieved better results with medium and large plagiarized passages.
Rafael Corezola Pereira, Viviane P. Moreira, Renata Galante

Creating a Persian-English Comparable Corpus

Multilingual corpora are valuable resources for cross-language information retrieval and are available in many language pairs. However the Persian language does not have rich multilingual resources due to some of its special features and difficulties in constructing the corpora. In this study, we build a Persian-English comparable corpus from two independent news collections: BBC News in English and Hamshahri news in Persian. We use the similarity of the document topics and their publication dates to align the documents in these sets. We tried several alternatives for constructing the comparable corpora and assessed the quality of the corpora using different criteria. Evaluation results show the high quality of the aligned documents and using the Persian-English comparable corpus for extracting translation knowledge seems promising.
Homa Baradaran Hashemi, Azadeh Shakery, Heshaam Faili

Experimental Collections and Datasets (1)

Validating Query Simulators: An Experiment Using Commercial Searches and Purchases

We design and validate simulators for generating queries and relevance judgments for retrieval system evaluation. We develop a simulation framework that incorporates existing and new simulation strategies. To validate a simulator, we assess whether evaluation using its output data ranks retrieval systems in the same way as evaluation using real-world data. The real-world data is obtained using logged commercial searches and associated purchase decisions. While no simulator reproduces an ideal ranking, there is a large variation in simulator performance that allows us to distinguish those that are better suited to creating artificial testbeds for retrieval experiments. Incorporating knowledge about document structure in the query generation process helps create more realistic simulators.
Bouke Huurnink, Katja Hofmann, Maarten de Rijke, Marc Bron

Using Parallel Corpora for Multilingual (Multi-document) Summarisation Evaluation

We are presenting a method for the evaluation of multilingual multi-document summarisation that allows saving precious annotation time and that makes the evaluation results across languages directly comparable. The approach is based on the manual selection of the most important sentences in a cluster of documents from a sentence-aligned parallel corpus, and by projecting the sentence selection to various target languages. We also present two ways of exploiting inter-annotator agreement levels, apply them both to a baseline sentence extraction summariser in seven languages, and discuss the result differences between the two evaluation versions, as well as a preliminary analysis between languages. The same method can in principle be used to evaluate single-document summarisers or information extraction tools.
Marco Turchi, Josef Steinberger, Mijail Kabadjov, Ralf Steinberger

Experimental Collections and Datasets (2)

MapReduce for Information Retrieval Evaluation: “Let’s Quickly Test This on 12 TB of Data”

We propose to use MapReduce to quickly test new retrieval approaches on a cluster of machines by sequentially scanning all documents. We present a small case study in which we use a cluster of 15 low cost machines to search a web crawl of 0.5 billion pages showing that sequential scanning is a viable approach to running large-scale information retrieval experiments with little effort. The code is available to other researchers at:
Djoerd Hiemstra, Claudia Hauff

Which Log for Which Information? Gathering Multilingual Data from Different Log File Types

In this paper, a comparative analysis of different log file types and their potential for gathering information about user behavior in a multilingual information system is presented. It starts with a discussion of potential questions to be answered in order to form an appropriate view of user needs and requirements in a multilingual information environment and the possibilities of gaining this information from log files. Based on actual examples from the Europeana portal, we compare and contrast different types of log files and the information gleaned from them. We then present the Europeana Clickstream Logger, which logs and gathers extended information on user behavior, and show first examples of the data collection possibilities.
Maria Gäde, Vivien Petras, Juliane Stiller

Evaluation Methodologies and Metrics (1)

Examining the Robustness of Evaluation Metrics for Patent Retrieval with Incomplete Relevance Judgements

Recent years have seen a growing interest in research into patent retrieval. One of the key issues in conducting information retrieval (IR) research is meaningful evaluation of the effectiveness of the retrieval techniques applied to task under investigation. Unlike many existing well explored IR tasks where the focus is on achieving high retrieval precision, patent retrieval is to a significant degree a recall focused task. The standard evaluation metric used for patent retrieval evaluation tasks is currently mean average precision (MAP). However this does not reflect system recall well. Meanwhile, the alternative of using the standard recall measure does not reflect user search effort, which is a significant factor in practical patent search environments. In recent work we introduce a novel evaluation metric for patent retrieval evaluation (PRES) [‎13]. This is designed to reflect both system recall and user effort. Analysis of PRES demonstrated its greater effectiveness in evaluating recall-oriented applications than standard MAP and Recall. One dimension of the evaluation of patent retrieval which has not previously been studied is the effect on reliability of the evaluation metrics when relevance judgements are incomplete. We provide a study comparing the behaviour of PRES against the standard MAP and Recall metrics for varying incomplete judgements in patent retrieval. Experiments carried out using runs from the CLEF-IP 2009 datasets show that PRES and Recall are more robust than MAP for incomplete relevance sets for this task with a small preference to PRES as the most robust evaluation metric for patent retrieval with respect to the completeness of the relevance set.
Walid Magdy, Gareth J. F. Jones

On the Evaluation of Entity Profiles

Entity profiling is the task of identifying and ranking descriptions of a given entity. The task may be viewed as one where the descriptions being sought are terms that need to be selected from a knowledge source (such as an ontology or thesaurus). In this case, entity profiling systems can be assessed by means of precision and recall values of the descriptive terms produced. However, recent evidence suggests that more sophisticated metrics are needed that go beyond mere lexical matching of system-produced descriptors against a ground truth, allowing for graded relevance and rewarding diversity in the list of descriptors returned. In this note, we motivate and propose such a metric.
Maarten de Rijke, Krisztian Balog, Toine Bogers, Antal van den Bosch

Evaluation Methodologies and Metrics (2)

Evaluating Information Extraction

The issue of how to experimentally evaluate information extraction (IE) systems has received hardly any satisfactory solution in the literature. In this paper we propose a novel evaluation model for IE and argue that, among others, it allows (i) a correct appreciation of the degree of overlap between predicted and true segments, and (ii) a fair evaluation of the ability of a system to correctly identify segment boundaries. We describe the properties of this models, also by presenting the result of a re-evaluation of the results of the CoNLL’03 and CoNLL’02 Shared Tasks on Named Entity Extraction.
Andrea Esuli, Fabrizio Sebastiani

Tie-Breaking Bias: Effect of an Uncontrolled Parameter on Information Retrieval Evaluation

We consider Information Retrieval evaluation, especially at Trec with the trec_eval program. It appears that systems obtain scores regarding not only the relevance of retrieved documents, but also according to document names in case of ties (i.e., when they are retrieved with the same score). We consider this tie-breaking strategy as an uncontrolled parameter influencing measure scores, and argue the case for fairer tie-breaking strategies. A study of 22 Trec editions reveals significant differences between the Conventional unfair Trec’s strategy and the fairer strategies we propose. This experimental result advocates using these fairer strategies when conducting evaluations.
Guillaume Cabanac, Gilles Hubert, Mohand Boughanem, Claude Chrisment

Automated Component–Level Evaluation: Present and Future

Automated component–level evaluation of information retrieval (IR) is the main focus of this paper. We present a review of the current state of web–based and component–level evaluation. Based on these systems, propositions are made for a comprehensive framework for web service–based component–level IR system evaluation. The advantages of such an approach are considered, as well as the requirements for implementing it. Acceptance of such systems by researchers who develop components and systems is crucial for having an impact and requires that a clear benefit is demonstrated.
Allan Hanbury, Henning Müller


The Four Ladies of Experimental Evaluation

The goal of the panel is to present some of the main lessons that we have learned in well over a decade of experimental evaluation and to promote discussion with respect to what the future objectives in this field should be.
Donna Harman, Noriko Kando, Mounia Lalmas, Carol Peters

A PROMISE for Experimental Evaluation

Participative Research labOratory for Multimedia and Multilingual Information Systems Evaluation (PROMISE) is a Network of Excellence, starting in conjunction with this first independent CLEF 2010 conference, and designed to support and develop the evaluation of multilingual and multimedia information access systems, largely through the activities taking place in Cross-Language Evaluation Forum (CLEF) today, and taking it forward in important new ways.
PROMISE is coordinated by the University of Padua, and comprises 10 partners: the Swedish Institute for Computer Science, the University of Amsterdam, Sapienza University of Rome, University of Applied Sciences of Western Switzerland, the Information Retrieval Facility, the Zurich University of Applied Sciences, the Humboldt University of Berlin, the Evaluation and Language Resources Distribution Agency, and the Centre for the Evaluation of Language Communication Technologies.
The single most important step forward for multilingual and multimedia information access which PROMISE will work towards is to provide an open evaluation infrastructure in order to support automation and collaboration in the evaluation process.
Martin Braschler, Khalid Choukri, Nicola Ferro, Allan Hanbury, Jussi Karlgren, Henning Müller, Vivien Petras, Emanuele Pianta, Maarten de Rijke, Giuseppe Santucci


Weitere Informationen

Premium Partner