nach oben

2008 | Buch

Kapitel lesen Erstes Kapitel lesen

Advances in Information Retrieval

30th European Conference on IR Research, ECIR 2008, Glasgow, UK, March 30-April 3, 2008. Proceedings

herausgegeben von: Craig Macdonald, Iadh Ounis, Vassilis Plachouras, Ian Ruthven, Ryen W. White

Verlag: Springer Berlin Heidelberg

Buchreihe : Lecture Notes in Computer Science

Enthalten in: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

Einloggen, um Zugang zu erhalten

Über dieses Buch

This book constitutes the refereed proceedings of the 30th annual European Conference on Information Retrieval Research, ECIR 2008, held in Glasgow, UK, in March/April 2008.

The 33 revised full papers and 19 revised short papers presented together with the abstracts of 3 invited lectures and 32 poster papers were carefully reviewed and selected from 139 full article submissions. The papers are organized in topical sections on evaluation, Web IR, social media, cross-lingual information retrieval, theory, video, representation, wikipedia and e-books, as well as expert search.

Inhaltsverzeichnis

Frontmatter

Invited Presentations

Some(What) Grand Challenges for Information Retrieval

Although we see the positive results of information retrieval research embodied throughout the Internet, on our computer desktops, and in many other aspects of daily life, at the same time we notice that people still have a wide variety of difficulties in finding information that is useful in resolving their problematic situations. This suggests that there still remain substantial challenges for research in IR. Already in 1988, on the occasion of receiving the ACM SIGIR Gerard Salton Award, Karen Spärck Jones suggested that substantial progress in information retrieval was likely only to come through addressing issues associated with users (actual or potential) of IR systems, rather than continuing IR research’s almost exclusive focus on document representation and matching and ranking techniques. In recent years it appears that her message has begun to be heard, yet we still have relatively few substantive results that respond to it. In this talk, I identify a few challenges for IR research which fall within the scope of association with users, and which I believe, if properly addressed, are likely to lead to substantial increases in the usefulness, usability and pleasurability of information retrieval.

Nicholas J. Belkin

Web Search: Challenges and Directions

These are exciting times for the field of Web search. Search engines are used by millions of people every day, and the number is growing rapidly. This growth poses unique challenges for search engines: they need to operate at unprecedented scales while satisfying an incredible diversity of information needs. Furthermore, user expectations have expanded considerably, moving from “give me what I said” to “give me what I want”. Finally, with the lure of billions of dollars of commerce guided by search engines, we have entered a new world of “Adversarial Information Retrieval”. This talk will show that the world of algorithm and system design for commercial search engines can be described by two of Murphy’s Laws: a) If anything can go wrong, it will; and b) Even if nothing can go wrong, it will anyway.

Amit Singhal

You Are a Document Too: Web Mining and IR for Next-Generation Information Literacy

Information retrieval and data mining often assume a simple world: There are people with information needs who search - and find - information in sources such as documents or databases. Hence, the user-oriented goals are (a) information literacy: the users’ ability to locate, evaluate, and use effectively the needed information, and (b) tools that obviate the need for some of the technical parts of this information literacy. Examples of such tools are search-engine interfaces that direct each user’s attention to only an individualised part of the “information overload” universe.

Bettina Berendt

Evaluation

Discounted Cumulated Gain Based Evaluation of Multiple-Query IR Sessions

IR research has a strong tradition of laboratory evaluation of systems. Such research is based on test collections, pre-defined test topics, and standard evaluation metrics. While recent research has emphasized the user viewpoint by proposing user-based metrics and non-binary relevance assessments, the methods are insufficient for truly user-based evaluation. The common assumption of a single query per topic and session poorly represents real life. On the other hand, one well-known metric for multiple queries per session, instance recall, does not capture early (within session) retrieval of (highly) relevant documents. We propose an extension to the Discounted Cumulated Gain (DCG) metric, the Session-based DCG (sDCG) metric for evaluation scenarios involving multiple query sessions, graded relevance assessments, and open-ended user effort including decisions to stop searching. The sDCG metric discounts relevant results from later queries within a session. We exemplify the sDCG metric with data from an interactive experiment, we discuss how the metric might be applied, and we present research questions for which the metric is helpful.

Kalervo Järvelin, Susan L. Price, Lois M. L. Delcambre, Marianne Lykke Nielsen

Here or There

Preference Judgments for Relevance

Information retrieval systems have traditionally been evaluated over absolute judgments of relevance: each document is judged for relevance on its own, independent of other documents that may be on topic. We hypothesize that preference judgments of the form “document A is more relevant than document B” are easier for assessors to make than absolute judgments, and provide evidence for our hypothesis through a study with assessors. We then investigate methods to evaluate search engines using preference judgments. Furthermore, we show that by using inferences and clever selection of pairs to judge, we need not compare all pairs of documents in order to apply evaluation methods.

Ben Carterette, Paul N. Bennett, David Maxwell Chickering, Susan T. Dumais

Using Clicks as Implicit Judgments: Expectations Versus Observations

Clickthrough data has been the subject of increasing popularity as an implicit indicator of user feedback. Previous analysis has suggested that user click behaviour is subject to a quality bias—that is, users click at different rank positions when viewing effective search results than when viewing less effective search results. Based on this observation, it should be possible to use click data to infer the quality of the underlying search system. In this paper we carry out a user study to systematically investigate how click behaviour changes for different levels of search system effectiveness as measured by information retrieval performance metrics. Our results show that click behaviour does not vary systematically with the quality of search results. However, click behaviour does vary significantly between individual users, and between search topics. This suggests that using direct click behaviour—click rank and click frequency—to infer the quality of the underlying search system is problematic. Further analysis of our user click data indicates that the correspondence between clicks in a search result list and subsequent confirmation that the clicked resource is actually relevant is low. Using clicks as an implicit indication of relevance should therefore be done with caution.

Falk Scholer, Milad Shokouhi, Bodo Billerbeck, Andrew Turpin

Web IR

Clustering Template Based Web Documents

More and more documents on the World Wide Web are based on templates. On a technical level this causes those documents to have a quite similar source code and DOM tree structure. Grouping together documents which are based on the same template is an important task for applications that analyse the template structure and need clean training data. This paper develops and compares several distance measures for clustering web documents according to their underlying templates. Combining those distance measures with different approaches for clustering, we show which combination of methods leads to the desired result.

Thomas Gottron

Effective Pre-retrieval Query Performance Prediction Using Similarity and Variability Evidence

Query performance prediction aims to estimate the quality of answers that a search system will return in response to a particular query. In this paper we propose a new family of pre-retrieval predictors based on information at both the collection and document level. Pre-retrieval predictors are important because they can be calculated from information that is available at indexing time; they are therefore more efficient than predictors that incorporate information obtained from actual search results. Experimental evaluation of our approach shows that the new predictors give more consistent performance than previously proposed pre-retrieval methods across a variety of data types and search tasks.

Ying Zhao, Falk Scholer, Yohannes Tsegay

iCluster: A Self-organizing Overlay Network for P2P Information Retrieval

We present i

Cluster

, a self-organizing peer-to-peer overlay network for supporting full-fledged information retrieval in a dynamic environment. i

Cluster

works by organizing peers sharing common interests into clusters and by exploiting clustering information at query time for achieving low network traffic and high recall. We define the criteria for peer similarity and peer selection, and we present the protocols for organizing the peers into clusters and for searching within the clustered organization of peers. i

Cluster

is evaluated on a realistic peer-to-peer environment using real-world data and queries. The results demonstrate significant performance improvements (in terms of clustering efficiency, communication load and retrieval accuracy) over a state-of-the-art peer-to-peer clustering method. Compared to exhaustive search by flooding, i

Cluster

exchanged a small loss in retrieval accuracy for much less message flow.

Paraskevi Raftopoulou, Euripides G. M. Petrakis

Social Media

Labeling Categories and Relationships in an Evolving Social Network

Modeling and naming general entity-entity relationships is challenging in construction of social networks. Given a seed denoting a person name, we utilize Google search engine, NER (Named Entity Recognizer) parser, and CODC (Co-Occurrence Double Check) formula to construct an evolving social network. For each entity pair in the network, we try to label their categories and relationships. Firstly, we utilize an open directory project (ODP) resource, which is the largest human-edited directory of the web, to build a directed graph, and then use three ranking algorithms, PageRank, HITS, and a Markov chain random process to extract potential categories defined in the ODP. These categories capture the major contexts of the designated named entities. Finally, we combine the ranks of these categories and tf*idf scores of noun phrases to extract relationships. In our experiments, total 6 evolving social networks with 618 pairs of named entities demonstrate that the Markov chain random process is better than the other two algorithms.

Ming-Shun Lin, Hsin-Hsi Chen

Automatic Construction of an Opinion-Term Vocabulary for Ad Hoc Retrieval

We present a method to automatically generate a term-opinion lexicon. We also weight these lexicon terms and use them at real time to boost the ranking with opinionated-content documents. We define very simple models both for opinion-term extraction and document ranking. Both the lexicon model and retrieval model are assessed. To evaluate the quality of the lexicon we compare performance with a well-established manually generated opinion-term dictionary. We evaluate the effectiveness of the term-opinion lexicon using the opinion task evaluation data of the TREC 2007 blog track.

Giambattista Amati, Edgardo Ambrosi, Marco Bianchi, Carlo Gaibisso, Giorgio Gambosi

A Comparison of Social Bookmarking with Traditional Search

Social bookmarking systems allow users to store links to internet resources on a web page. As social bookmarking systems are growing in popularity, search algorithms have been developed that transfer the idea of link-based rankings in the Web to a social bookmarking system’s data structure. These rankings differ from traditional search engine rankings in that they incorporate the rating of users.

In this study, we compare search in social bookmarking systems with traditional Web search. In the first part, we compare the user activity and behaviour in both kinds of systems, as well as the overlap of the underlying sets of URLs. In the second part, we compare graph-based and vector space rankings for social bookmarking systems with commercial search engine rankings.

Our experiments are performed on data of the social bookmarking system Del.icio.us and on rankings and log data from Google, MSN, and AOL. We will show that part of the difference between the systems is due to different behaviour (e.g. the concatenation of multi-word lexems to single terms in Del.icio.us), and that real-world events may trigger similar behaviour in both kinds of systems. We will also show that a graph-based ranking approach on folksonomies yields results that are closer to the rankings of the commercial search engines than vector space retrieval, and that the correlation is high in particular for the domains that are well covered by the social bookmarking system.

Beate Krause, Andreas Hotho, Gerd Stumme

Cross-Lingual Information Retrieval

Effects of Aligned Corpus Quality and Size in Corpus-Based CLIR

Aligned corpora are often-used resources in CLIR systems. The three qualities of translation corpora that most dramatically affect the performance of a corpus-based CLIR system are: (1) topical nearness to the translated queries, (2) the quality of the alignments, and (3) the size of the corpus. In this paper, the effects of these factors are studied and evaluated. Topics of two different domains (news and genomics) are translated with corpora of varying alignment quality, ranging from a clean parallel corpus to noisier comparable corpora. Also, the sizes of the corpora are varied. The results show that of the three qualities, topical nearness is the most crucial factor, outweighing both other factors. This indicates that noisy comparable corpora should be used as complimentary resources, when parallel corpora are not available for the domain in question.

Tuomas Talvensaari

Exploring the Effects of Language Skills on Multilingual Web Search

Multilingual access is an important area of research, especially given the growth in multilingual users of online resources. A large body of research exists for Cross-Language Information Retrieval (CLIR); however, little of this work has considered the language skills of the end user, a critical factor in providing effective multilingual search functionality. In this paper we describe an experiment carried out to further understand the effects of language skills on multilingual search. Using the Google Translate service, we show that users have varied language skills that are non-trivial to assess and can impact their multilingual searching experience and search effectiveness.

Jennifer Marlow, Paul Clough, Juan Cigarrán Recuero, Javier Artiles

A Novel Implementation of the FITE-TRT Translation Method

Cross-language Information Retrieval requires good methods for translating cross-lingual spelling variants which are not covered by the available dictionary resources. FITE-TRT is an established method employing frequency-based identification of translation equivalents received from transformation rule based translation. This study further develops and evaluates the FITE-TRT method. The paper contributes on four areas. First, an efficient implementation for the FITE-TRT method is discussed. Secondly, a novel iterative FITE-TRT translation approach is developed in order to further improve the effectiveness of the method. Thirdly, the effectiveness of FITE-TRT is assessed in three classes of source-target word similarity. FITE-TRT was found to be very strong in the class of the most similar source and target words and only becomes unsuccessful when the words were dissimilar. Fourthly, in comparison to n-gram and s-gram matching methods, FITE-TRT is shown consistently stronger. All in all, FITE-TRT clearly outperforms the fuzzy string matching methods under comparable conditions. Therefore it is the method of choice for the identification of translation equivalents of cross-lingual spelling variants when the requirements for the result quality are high.

Aki Loponen, Ari Pirkola, Kalervo Järvelin, Heikki Keskustalo

Theory I

The BNB Distribution for Text Modeling

We first review in this paper the burstiness and aftereffect of future sampling phenomena, and propose a formal, operational criterion to characterize distributions according to these phenomena. We then introduce the Beta negative binomial distribution for text modeling, and show its relations to several models (in particular to the Laplace law of succession and to the

tf-itf

model used in the Divergence from Randomness framework of [2]). We finally illustrate the behavior of this distribution on text categorization and information retrieval experiments.

Stéphane Clinchant, Eric Gaussier

Utilizing Passage-Based Language Models for Document Retrieval

We show that several previously proposed

passage-based

document ranking principles, along with some new ones, can be derived from the same probabilistic model. We use language models to instantiate specific algorithms, and propose a

passage language model

that integrates information from the ambient document to an extent controlled by the estimated

document homogeneity

. Several document-homogeneity measures that we propose yield passage language models that are more effective than the standard passage model for basic document retrieval and for constructing and utilizing

passage-based relevance models

; the latter outperform a document-based relevance model. We also show that the homogeneity measures are effective means for integrating document-query and passage-query similarity information for document retrieval.

Michael Bendersky, Oren Kurland

A Statistical View of Binned Retrieval Models

Many traditional information retrieval models, such as BM25 and language modeling, give good retrieval effectiveness, but can be difficult to implement efficiently. Recently, document-centric impact models were developed in order to overcome some of these efficiency issues. However, such models have a number of problems, including poor effectiveness, and heuristic term weighting schemes. In this work, we present a statistical view of document-centric impact models. We describe how such models can be treated statistically and propose a supervised parameter estimation technique. We analyze various theoretical and practical aspects of the model and show that weights estimated using our new estimation technique are significantly better than the integer-based weights used in previous studies.

Donald Metzler, Trevor Strohman, W. Bruce Croft

Video

Video Corpus Annotation Using Active Learning

Concept indexing in multimedia libraries is very useful for users searching and browsing but it is a very challenging research problem as well. Beyond the systems’ implementations issues, semantic indexing is strongly dependent upon the size and quality of the training examples. In this paper, we describe the collaborative annotation system used to annotate the High Level Features (HLF) in the development set of TRECVID 2007. This system is web-based and takes advantage of

Active Learning

approach. We show that

Active Learning

allows simultaneously getting the most useful information from the partial annotation and significantly reducing the annotation effort per participant relatively to previous collaborative annotations.

Stéphane Ayache, Georges Quénot

Use of Implicit Graph for Recommending Relevant Videos: A Simulated Evaluation

In this paper, we propose a model for exploiting community based usage information for video retrieval. Implicit usage information from a pool of past users could be a valuable source to address the difficulties caused due to the semantic gap problem. We propose a graph-based implicit feedback model in which all the usage information can be represented. A number of recommendation algorithms were suggested and experimented. A simulated user evaluation is conducted on the TREC VID collection and the results are presented. Analyzing the results we found some common characteristics on the best performing algorithms, which could indicate the best way of exploiting this type of usage information.

David Vallet, Frank Hopfgartner, Joemon Jose

Representation I

Using Terms from Citations for IR: Some First Results

We present the results of experiments using terms from citations for scientific literature search. To index a given document, we use terms used by citing documents to describe that document, in combination with terms from the document itself. We find that the combination of terms gives better retrieval performance than standard indexing of the document terms alone and present a brief analysis of our results. This paper marks the first experimental results from a new test collection of scientific papers, created by us in order to study citation-based methods for IR.

Anna Ritchie, Simone Teufel, Stephen Robertson

Automatic Extraction of Domain-Specific Stopwords from Labeled Documents

Automatic extraction of domain-specific stopword list from a large labeled corpus is discussed. Most researches remove the stopwords using a standard stopword list, and high and low document frequencies. In this paper, a new approach for stopword extraction based on the notion of backward filter level performance and sparsity measure of training data, is proposed. First, we discuss the motivation for updating existing lists or building new ones. Second, based on the proposed backward filter-level performance, we examine the effectiveness of high document frequency filtering for stopword reduction. Finally, a new method for building general and domain-specific stopwords is proposed. The method assumes that a set of candidate stopwords must have minimum information content and prediction capacity, which can be estimated by a classifier performance. The proposed approach is extensively compared with other methods including inverse document frequency and information gain. According to the comparative study, the proposed approach offers more promising results, which guarantee minimum information loss by filtering out most stopwords.

Masoud Makrehchi, Mohamed S. Kamel

Wikipedia and E-Books

Book Search Experiments: Investigating IR Methods for the Indexing and Retrieval of Books

Through mass-digitization projects and with the use of OCR technologies, digitized books are becoming available on the Web and in digital libraries. The unprecedented scale of these efforts, the unique characteristics of the digitized material as well as the unexplored possibilities of user interactions make full-text book search an exciting area of information retrieval (IR) research. Emerging research questions include: How appropriate and effective are traditional IR models when applied to books? What book specific features (e.g., back-of-book index) should receive special attention during the indexing and retrieval processes? How can we tackle scalability? In order to answer such questions, we developed an experimental platform to facilitate rapid prototyping of a book search system as well as to support large-scale tests. Using this system, we performed experiments on a collection of 10 000 books, evaluating the efficiency of a novel multi-field inverted index and the effectiveness of the BM25F retrieval model adapted to books, using book-specific fields.

Hengzhi Wu, Gabriella Kazai, Michael Taylor

Using a Task-Based Approach in Evaluating the Usability of BoBIs in an E-book Environment

This paper reports on a usability evaluation of BoBIs (Back-of-the-book Indexes) as searching and browsing tools in an e-book environment. This study employed a task-based approach and within-subject design. The retrieval performance of a BoBI was compared with a ToC and Full-Text Search tool in terms of their respective effectiveness and efficiency for finding information in e-books. The results demonstrated that a BoBI was significantly more efficient (faster) and useful compared to a ToC or Full-Text Search tool for finding information in an e-book environment.

Noorhidawati Abdullah, Forbes Gibb

Exploiting Locality of Wikipedia Links in Entity Ranking

Information retrieval from web and XML document collections is ever more focused on returning entities instead of web pages or XML elements. There are many research fields involving named entities; one such field is known as entity ranking, where one goal is to rank entities in response to a query supported with a short list of entity examples. In this paper, we describe our approach to ranking entities from the Wikipedia XML document collection. Our approach utilises the known categories and the link structure of Wikipedia, and more importantly, exploits link co-occurrences to improve the effectiveness of entity ranking. Using the broad context of a full Wikipedia page as a baseline, we evaluate two different algorithms for identifying narrow contexts around the entity examples: one that uses predefined types of elements such as paragraphs, lists and tables; and another that dynamically identifies the contexts by utilising the underlying XML document structure. Our experiments demonstrate that the locality of Wikipedia links can be exploited to significantly improve the effectiveness of entity ranking.

Jovan Pehcevski, Anne-Marie Vercoustre, James A. Thom

The Importance of Link Evidence in Wikipedia

Wikipedia is one of the most popular information sources on the Web. The free encyclopedia is densely linked. The link structure in Wikipedia differs from the Web at large: internal links in Wikipedia are typically based on words naturally occurring in a page, and link to another semantically related entry. Our main aim is to find out if Wikipedia’s link structure can be exploited to improve ad hoc information retrieval. We first analyse the relation between Wikipedia links and the relevance of pages. We then experiment with use of link evidence in the focused retrieval of Wikipedia content, based on the test collection of INEX 2006. Our main findings are: First, our analysis of the link structure reveals that the Wikipedia link structure is a (possibly weak) indicator of relevance. Second, our experiments on INEX ad hoc retrieval tasks reveal that if the link evidence is made sensitive to the local context we see a significant improvement of retrieval effectiveness. Hence, in contrast with earlier TREC experiments using crawled Web data, we have shown that Wikipedia’s link structure can help improve the effectiveness of ad hoc retrieval.

Jaap Kamps, Marijn Koolen

Expert Search

High Quality Expertise Evidence for Expert Search

In an Enterprise setting, an expert search system can assist users with their “expertise need” by suggesting people with relevant expertise to the topic of interest. These systems typically work by associating documentary evidence of expertise to each candidate expert, and then ranking the candidates by the extent to which the documents in their profile are about the query. There are three important factors that affect the retrieval performance of an expert search system - firstly, the selection of the candidate profiles (the documents associated with each candidate), secondly, how the topicality of the documents is measured, and thirdly how the evidence of expertise from the associated documents is combined. In this work, we investigate a new dimension to expert finding, namely whether some documents are better indicators of expertise than others in each candidate’s profile. We apply five techniques to predict the quality documents in candidate profiles, which are likely to be good indicators of expertise. The techniques applied include the identification of possible candidate homepages, and of clustering the documents in each profile to determine the candidate’s main areas of expertise. The proposed approaches are evaluated on three expert search task from recent TREC Enterprise tracks and provide conclusions.

Craig Macdonald, David Hannah, Iadh Ounis

Associating People and Documents

Since the introduction of the Enterprise Track at TREC in 2005, the task of finding experts has generated a lot of interest within the research community. Numerous models have been proposed that rank candidates by their level of expertise with respect to some topic. Common to all approaches is a component that estimates the strength of the association between a document and a person. Forming such associations, then, is a key ingredient in expertise search models. In this paper we introduce and compare a number of methods for building document-people associations. Moreover, we make underlying assumptions explicit, and examine two in detail: (i) independence of candidates, and (ii) frequency is an indication of strength. We show that our refined ways of estimating the strength of associations between people and documents leads to significant improvements over the state-of-the-art in the end-to-end expert finding task.

Krisztian Balog, Maarten de Rijke

Modeling Documents as Mixtures of Persons for Expert Finding

In this paper we address the problem of searching for knowledgeable persons within the enterprise, known as the expert finding (or expert search) task. We present a probabilistic algorithm using the assumption that terms in documents are produced by people who are mentioned in them. We represent documents retrieved to a query as mixtures of candidate experts language models. Two methods of personal language models extraction are proposed, as well as the way of combining them with other evidences of expertise. Experiments conducted with the TREC Enterprise collection demonstrate the superiority of our approach in comparison with the best one among existing solutions.

Pavel Serdyukov, Djoerd Hiemstra

Ranking Users for Intelligent Message Addressing

Finding persons who are knowledgeable on a given topic (i.e. Expert Search) has become an active area of recent research [1,2,3] . In this paper we investigate the related task of

Intelligent Message Addressing

, i.e., finding persons who are potential recipients of a message under composition given its current contents, its previously-specified recipients or a few initial letters of the intended recipient contact (intelligent auto-completion). We begin by providing quantitative evidence, from a very large corpus, of how frequently email users are subject to message addressing problems. We then propose several techniques for this task, including adaptations of well-known formal models of Expert Search. Surprisingly, a simple model based on the K-Nearest-Neighbors algorithm consistently outperformed all other methods. We also investigated combinations of the proposed methods using fusion techniques, which leaded to significant performance improvements over the baselines models. In auto-completion experiments, the proposed models also outperformed all standard baselines. Overall, the proposed techniques showed ranking performance of more than 0.5 in MRR over 5202 queries from 36 different email users, suggesting intelligent message addressing can be a welcome addition to email.

Vitor R. Carvalho, William W. Cohen

Representation II

Facilitating Query Decomposition in Query Language Modeling by Association Rule Mining Using Multiple Sliding Windows

This paper presents a novel framework to further advance the recent trend of using query decomposition and high-order term relationships in query language modeling, which takes into account terms implicitly associated with different subsets of query terms. Existing approaches, most remarkably the language model based on the Information Flow method are however unable to capture multiple levels of associations and also suffer from a high computational overhead. In this paper, we propose to compute association rules from pseudo feedback documents that are segmented into variable length chunks via multiple sliding windows of different sizes. Extensive experiments have been conducted on various TREC collections and our approach significantly outperforms a baseline Query Likelihood language model, the Relevance Model and the Information Flow model.

Dawei Song, Qiang Huang, Stefan Rüger, Peter Bruza

Viewing Term Proximity from a Different Perspective

This paper extends the state-of-the-art probabilistic model BM25 to utilize term proximity from a new perspective. Most previous work only consider dependencies between pairs of terms, and regard phrases as additional independent evidence. It is difficult to estimate the importance of a phrase and its extra contribution to a relevance score, as the phrase actually overlaps with the component terms. This paper proposes a new approach. First, query terms are grouped locally into non-overlapping phrases that may contain one or more query terms. Second, these phrases are not scored independently but are instead treated as providing a context for the component query terms. The relevance contribution of a term occurrence is measured by how many query terms occur in the context phrase and how compact they are. Third, we replace term frequency by the accumulated relevance contribution. Consequently, term proximity is easily integrated into the probabilistic model. Experimental results on TREC-10 and TREC-11 collections show stable improvements in terms of average precision and significant improvements in terms of top precisions.

Ruihua Song, Michael J. Taylor, Ji-Rong Wen, Hsiao-Wuen Hon, Yong Yu

Extending Probabilistic Data Fusion Using Sliding Windows

Recent developments in the field of data fusion have seen a focus on techniques that use training queries to estimate the probability that various documents are relevant to a given query and use that information to assign scores to those documents on which they are subsequently ranked. This paper introduces SlideFuse, which builds on these techniques, introducing a sliding window in order to compensate for situations where little relevance information is available to aid in the estimation of probabilities.

SlideFuse is shown to perform favourably in comparison with CombMNZ, ProbFuse and SegFuse. CombMNZ is the standard baseline technique against which data fusion algorithms are compared whereas ProbFuse and SegFuse represent the state-of-the-art for probabilistic data fusion methods.

David Lillis, Fergus Toolan, Rem Collier, John Dunnion

Theory II

Semi-supervised Document Classification with a Mislabeling Error Model

This paper investigates a new extension of the Probabilistic Latent Semantic Analysis (PLSA) model [6] for text classification where the training set is partially labeled. The proposed approach iteratively labels the unlabeled documents and estimates the probabilities of its labeling errors. These probabilities are then taken into account in the estimation of the new model parameters before the next round. Our approach outperforms an earlier semi-supervised extension of PLSA introduced by [9] which is based on the use of

fake labels

. However, it maintains its simplicity and ability to solve multiclass problems. In addition, it gives valuable information about the most uncertain and difficult classes to label. We perform experiments over the

20Newsgroups

WebKB

and

Reuters

document collections and show the effectiveness of our approach over two other semi-supervised algorithms applied to these text classification problems.

Anastasia Krithara, Massih R. Amini, Jean-Michel Renders, Cyril Goutte

Improving Term Frequency Normalization for Multi-topical Documents and Application to Language Modeling Approaches

Term frequency normalization is a serious issue since lengths of documents are various. Generally, documents become long due to two different reasons - verbosity and multi-topicality. First, verbosity means that the same topic is repeatedly mentioned by terms related to the topic, so that term frequency is more increased than the well-summarized one. Second, multi-topicality indicates that a document has a broad discussion of multi-topics, rather than single topic. Although these document characteristics should be differently handled, all previous methods of term frequency normalization have ignored these differences and have used a simplified length-driven approach which decreases the term frequency by only the length of a document, causing an unreasonable penalization. To attack this problem, we propose a novel TF normalization method which is a type of partially-axiomatic approach. We first formulate two formal constraints that the retrieval model should satisfy for documents having verbose and multi-topicality characteristic, respectively. Then, we modify language modeling approaches to better satisfy these two constraints, and derive novel smoothing methods. Experimental results show that the proposed method increases significantly the precision for keyword queries, and substantially improves MAP (Mean Average Precision) for verbose queries.

Seung-Hoon Na, In-Su Kang, Jong-Hyeok Lee

Probabilistic Document Length Priors for Language Models

This paper addresses the issue of devising a new document prior for the language modeling (LM) approach for Information Retrieval. The prior is based on term statistics, derived in a probabilistic fashion and portrays a novel way of considering document length. Furthermore, we developed a new way of combining document length priors with the query likelihood estimation based on the risk of accepting the latter as a score. This prior has been combined with a document retrieval language model that uses Jelinek-Mercer (JM), a smoothing technique which does not take into account document length. The combination of the prior boosts the retrieval performance, so that it outperforms a LM with a document length dependent smoothing component (Dirichlet prior) and other state of the art high-performing scoring function (BM25). Improvements are significant, robust across different collections and query sizes.

Roi Blanco, Alvaro Barreiro

Short Papers

Applying Maximum Entropy to Known-Item Email Retrieval

It is becoming increasingly common in information retrieval to combine evidence from multiple resources to compute the retrieval status value of documents. Although this has led to considerable improvements in several retrieval tasks, one of the outstanding issues is estimation of the respective weights that should be associated with the different sources of evidence. In this paper we propose to use maximum entropy in combination with the limited memory LBFG algorithm to estimate feature weights. Examining the effectiveness of our approach with respect to the known-item finding task of enterprise track of TREC shows that it significantly outperforms a standard retrieval baseline and leads to competitive performance.

Sirvan Yahyaei, Christof Monz

Computing Information Retrieval Performance Measures Efficiently in the Presence of Tied Scores

The Information Retrieval community uses a variety of performance measures to evaluate the effectiveness of scoring functions. In this paper, we show how to adapt six popular measures — precision, recall, F1, average precision, reciprocal rank, and normalized discounted cumulative gain — to cope with scoring functions that are likely to assign many tied scores to the results of a search. Tied scores impose only a partial ordering on the results, meaning that there are multiple possible orderings of the result set, each one performing differently. One approach to cope with ties would be to average the performance values across all possible result orderings; but unfortunately, generating result permutations requires super-exponential time. The approach presented in this paper computes precisely the same performance value as the approach of averaging over all permutations, but does so as efficiently as the original, tie-oblivious measures.

Frank McSherry, Marc Najork

Towards Characterization of Actor Evolution and Interactions in News Corpora

The natural way to model a news corpus is as a directed graph where stories are linked to one another through a variety of relationships. We formalize this notion by viewing each news story as a set of actors, and by viewing links between stories as transformations these actors go through. We propose and model a simple and comprehensive set of transformations:

create, merge, split, continue,

and

cease

. These transformations capture evolution of a single actor and interactions among multiple actors. We present algorithms to rank each transformation and show how ranking helps us to infer important relationships between actors and stories in a corpus. We demonstrate the effectiveness of our notions by experimenting on large news corpora.

Rohan Choudhary, Sameep Mehta, Amitabha Bagchi, Rahul Balakrishnan

The Impact of Semantic Class Identification and Semantic Role Labeling on Natural Language Answer Extraction

In satisfying an information need by a Question Answering (QA) system, there are text understanding approaches which can enhance the performance of final answer extraction. Exploiting the FrameNet lexical resource in this process inspires analysis of the levels of semantic representation in the automated practice where the task of semantic class and role labeling takes place. In this paper, we analyze the impact of different levels of semantic parsing on answer extraction with respect to the individual sub-tasks of frame evocation and frame element assignment.

Bahadorreza Ofoghi, John Yearwood, Liping Ma

Improving Complex Interactive Question Answering with Wikipedia Anchor Text

When the objective of an information retrieval task is to return a nugget rather than a document, query terms that exist in a document will often not be used in the most relevant information nugget in the document. In this paper, a new method of query expansion is proposed based on the Wikipedia link structure surrounding the most relevant articles selected automatically. Evaluated with the Nuggeteer automatic scoring software, an increase in the F-scores is found from the TREC Complex Interactive Question Answering task when integrating this expansion into an already high-performing baseline system.

Ian MacKinnon, Olga Vechtomova

A Cluster-Sensitive Graph Model for Query-Oriented Multi-document Summarization

In this paper, we develop a novel cluster-sensitive graph model for query-oriented multi-document summarization. Upon it, an iterative algorithm, namely QoCsR, is built. As there is existence of natural clusters in the graph in the case that a document comprises a collection of sentences, we suggest distinguishing intra- and inter-document sentence relations in order to take into consideration the influence of cluster (i.e. document) global information on local sentence evaluation. In our model, five kinds of relations are involved among the three objects, i.e. document, sentence and query. Three of them are new and normally ignored in previous graph-based models. All these relations are then appropriately formulated in the QoCsR algorithm though in different ways. ROUGE evaluations shows that QoCsR can outperform the best DUC 2005 participating systems.

Furu Wei, Wenjie Li, Qin Lu, Yanxiang He

Evaluating Text Representations for Retrieval of the Best Group of Documents

Cluster retrieval assumes that the probability of relevance of a document should depend on the relevance of other similar documents to the same query. The goal is to find the best

group

of documents. Many studies have examined the effectiveness of this approach, by employing different retrieval methods or clustering algorithms, but few have investigated text representations. This paper revisits the problem of retrieving the best group of documents, from the language-modeling perspective. We analyze the advantages and disadvantages of a range of representation techniques, derive features that characterize the good document groups, and experiment with a new probabilistic representation as a first step toward incorporating these features. Empirical evaluation demonstrates that the relationship between documents can be leveraged in retrieval when a good representation technique is available, and that retrieving the best group of documents can be more effective than retrieving individual documents.

Xiaoyong Liu, W. Bruce Croft

Enhancing Relevance Models with Adaptive Passage Retrieval

Passage retrieval and pseudo relevance feedback/query expansion have been reported as two effective means for improving document retrieval in literature. Relevance models, while improving retrieval in most cases, hurts performance on some heterogeneous collections. Previous research has shown that combining passage-level evidence with pseudo relevance feedback brings added benefits. In this paper, we study passage retrieval with relevance models in the language-modeling framework for document retrieval. An

adaptive passage retrieval

approach is proposed to document ranking based on the best passage of a document given a query. The proposed passage ranking method is applied to two relevance-based language models: the Lavrenko-Croft relevance model and our

robust relevance model

. Experiments are carried out with three query sets on three different collections from TREC. Our experimental results show that combining adaptive passage retrieval with relevance models (particularly the robust relevance model) consistently outperforms solely applying relevance models on full-length document retrieval.

Xiaoyan Li, Zhigang Zhu

Ontology Matching Using Vector Space

Interoperability of heterogeneous systems on the Web will be achieved through an agreement between the underlying ontologies. Ontology matching is an operation that takes two ontologies and determines their semantic mapping. This paper presents a method of ontology matching which is based on modeling ontologies in a vector space and estimating their similarity degree by matching their concept vectors. The proposed method is successfully applied to the test suit of Ontology Alignment Evaluation Initiative 2005 [10] and compared to the results reported by other methods. In terms of precision and recall, the results look promising.

Zahra Eidoon, Nasser Yazdani, Farhad Oroumchian

Accessibility in Information Retrieval

This paper introduces the concept of accessibility from the field of transportation planning and adopts it within the context of Information Retrieval (IR). An analogy is drawn between the fields, which motivates the development of document accessibility measures for IR systems. Considering the accessibility of documents within a collection given an IR System provides a different perspective on the analysis and evaluation of such systems which could be used to inform the design, tuning and management of current and future IR systems.

Leif Azzopardi, Vishwa Vinay

Semantic Relationships in Multi-modal Graphs for Automatic Image Annotation

It is important to integrate contextual information in order to improve the inaccurate results of current approaches for automatic image annotation. Graph based representations allow incorporation of such information. However, their behaviour has not been studied in this context. We conduct extensive experiments to show the properties of such representations using semantic relationships as a type of contextual information. We also experimented with different similarity measures for semantic features and results are presented.

Vassilios Stathopoulos, Jana Urban, Joemon Jose

Conversation Detection in Email Systems

This work explores a novel approach for conversation detection in email mailboxes. This approach clusters messages into coherent conversations by using a similarity function among messages that takes into consideration all relevant email attributes, such as message subject, participants, date of submission, and message content. The detection algorithm is evaluated against a manual partition of two email mailboxes into conversations. Experimental results demonstrate the superiority of our detection algorithm over several other alternative approaches.

Shai Erera, David Carmel

Efficient Multimedia Time Series Data Retrieval Under Uniform Scaling and Normalisation

As the world has shifted towards manipulation of information and its technology, we have been increasingly overwhelmed by the amount of available multimedia data while having higher expectations to fully exploit these data at hands. One of the attempts is to develop content-based multimedia information retrieval systems, which greatly facilitate us to intuitively search by its contents; a classic example is a Query-by-Humming system. Nevertheless, typical content-based search for multimedia data usually requires a large amount of storages and is computationally intensive. Recently, time series representation has been successfully applied to a wide variety of research, including multimedia retrieval due to the great reduction in time and space complexity. Besides, an enhancement, Uniform Scaling, has been proposed and applied prior to distance calculation, as well as it has been demonstrated that Uniform Scaling can outperform Euclidean distance. These previous work on Uniform Scaling, nonetheless, overlook the importance and effects of normalisation, which make their frameworks impractical for real world data. Therefore, in this paper, we justify this importance of normalisation in multimedia data and propose an efficient solution for searching multimedia time series data under Uniform Scaling and normalisation.

Waiyawuth Euachongprasit, Chotirat Ann Ratanamahatana

Integrating Structure and Meaning: A New Method for Encoding Structure for Text Classification

Current representation schemes for automatic text classification treat documents as syntactically unstructured collections of words or ‘concepts’. Past attempts to encode syntactic structure have treated part-of-speech information as another word-like feature, but have been shown to be less effective than non-structural approaches. We propose a new representation scheme using Holographic Reduced Representations (HRRs) as a technique to encode both semantic and syntactic structure. This method improves on previous attempts in the literature by encoding the structure across all features of the document vector while preserving text semantics. Our method does not increase the dimensionality of the document vectors, allowing for efficient computation and storage. We present classification results of our HRR text representations versus Bag-of-Concepts representations and show that our method of including structure improves text classification results.

Jonathan M. Fishbein, Chris Eliasmith

A Wikipedia-Based Multilingual Retrieval Model

This paper introduces CL-ESA, a new multilingual retrieval model for the analysis of cross-language similarity. The retrieval model exploits the multilingual alignment of Wikipedia: given a document

written in language

we construct a concept vector

for

, where each dimension

quantifies the similarity of

with respect to a document

$d^*_i$

chosen from the “

-subset” of Wikipedia. Likewise, for a second document

′ written in language

′,

$L\not=L'$

, we construct a concept vector

′, using from the

′-subset of the Wikipedia the topic-aligned counterparts

$d'^*_i$

of our previously chosen documents.

Since the two concept vectors

and

′ are

collection-relative representations

and

′ they are language-independent. I. e., their similarity can directly be computed with the cosine similarity measure, for instance.

We present results of an extensive analysis that demonstrates the power of this new retrieval model: for a query document

the topically most similar documents from a corpus in another language are properly ranked. Salient property of the new retrieval model is its robustness with respect to both the size and the quality of the index document collection.

Martin Potthast, Benno Stein, Maik Anderka

Filaments of Meaning in Word Space

Word space models, in the sense of vector space models built on distributional data taken from texts, are used to model semantic relations between words. We argue that the high dimensionality of typical vector space models lead to unintuitive effects on modeling likeness of meaning and that the local structure of word spaces is where interesting semantic relations reside. We show that the local structure of word spaces has substantially different dimensionality and character than the global space and that this structure shows potential to be exploited for further semantic analysis using methods for local analysis of vector space structure rather than globally scoped methods typically in use today such as singular value decomposition or principal component analysis.

Jussi Karlgren, Anders Holst, Magnus Sahlgren

Finding the Best Picture: Cross-Media Retrieval of Content

We query the pictures of Yahoo! News for persons and objects by using the accompanying news captions as an indexing annotation. Our aim is to find these pictures on top of the answer list in which the sought persons or objects are most prominently present. We demonstrate that an appearance or content model based on syntactic, semantic and discourse analysis of the short news text is only useful for finding the best picture of a person of object if the database contains photos each picturing many entities. In other circumstances a simpler bag-of-nouns representation has a good performance. The appearance models are tested in a probabilistic ranking function.

Koen Deschacht, Marie-Francine Moens

Robust Query-Specific Pseudo Feedback Document Selection for Query Expansion

In document retrieval using pseudo relevance feedback, after initial ranking, a fixed number of top-ranked documents are selected as feedback to build a new expansion query model. However, very little attention has been paid to an intuitive but critical fact that the retrieval performance for different queries is sensitive to the selection of different numbers of feedback documents. In this paper, we explore two approaches to incorporate the factor of query-specific feedback document selection in an automatic way. The first is to determine the “optimal” number of feedback documents with respect to a query by adopting the clarity score and cumulative gain. The other approach is that, instead of capturing the optimal number, we hope to weaken the effect of the numbers of feedback document, i.e., to improve the robustness of the pseudo relevance feedback process, by a mixture model. Our experimental results show that both approaches improve the overall retrieval performance.

Qiang Huang, Dawei Song, Stefan Rüger

Expert Search Evaluation by Supporting Documents

An expert search system assists users with their “expertise need” by suggesting people with relevant expertise to their query. Most systems work by ranking documents in response to the query, then ranking the candidates using information from this initial document ranking and known associations between documents and candidates. In this paper, we aim to determine whether we can approximate an evaluation of the expert search system using the underlying document ranking. We evaluate the accuracy of our document ranking evaluation by assessing how closely each measure correlates to the ground truth evaluation of the candidate ranking. Interestingly, we find that improving the underlying ranking of documents does not necessarily result in an improved candidate ranking.

Craig Macdonald, Iadh Ounis

Posters

Ranking Categories for Web Search

In the context of Web Search, clustering based engines are emerging as an alternative for the classical ones. In this paper we analyse different possible ranking algorithms for ordering clusters of documents within a search result. More specifically, we investigate approaches based on document rankings, on the similarities between the user query and the search results, on the quality of the produced clusters, as well as some document independent approaches. Even though we use a topic based hierarchy for categorizing the URLs, our metrics can be applied to other clusters as well. An empirical analysis with a group of 20 subjects showed that the average similarity between the user query and the documents within each category yields the best cluster ranking.

Gianluca Demartini, Paul-Alexandru Chirita, Ingo Brunkhorst, Wolfgang Nejdl

Key Design Issues with Visualising Images Using Google Earth

Using map visualisation tools and earth browsers to display images in a spatial context is integral to many photo-sharing sites and commercial image archives, yet little academic research has been conducted into the utility and functionality of such systems. In developing a prototype system to explore the use of Google Earth in the visualisation of news photos, we have elicited key design issues based on user evaluations of Panoramio and two custom-built spatio-temporal image browsing prototypes. We discuss the implications of these design issues, with particular emphasis on visualising news photos.

Paul Clough, Simon Read

Methods for Augmenting Semantic Models with Structural Information for Text Classification

Current representation schemes for automatic text classification treat documents as syntactically unstructured collections of words or ‘concepts’. Past attempts to encode syntactic structure have treated part-of-speech information as another word-like feature, but have been shown to be less effective than non-structural approaches. Here, we investigate three methods to augment semantic modelling with syntactic structure, which encode the structure across all features of the document vector while preserving text semantics. We present classification results for these methods versus the Bag-of-Concepts semantic modelling representation to determine which method best improves classification scores.

Jonathan M. Fishbein, Chris Eliasmith

Use of Temporal Expressions in Web Search

While trying to understand and characterize users’ behavior online, the temporal dimension has received little attention by the research community. This exploratory study uses two collections of web search queries to investigate the use of temporal information needs. Using state-of-the-art information extraction techniques we identify temporal expressions in these queries. We find that temporal expressions are rarely used (1.5% of queries) and, when used, they are related to current and past events. Also, there are specific topics where the use of temporal expressions is more visible.

Sérgio Nunes, Cristina Ribeiro, Gabriel David

Towards an Automatically Generated Music Information System Via Web Content Mining

This paper presents first steps towards building a music information system like

last.fm

, but with the major difference that the data is automatically retrieved from the WWW using web content mining techniques. We first review approaches to some major problems of music information retrieval (MIR), which are required to achieve the ultimate aim, and we illustrate how these approaches can be put together to create the

automatically generated music information system

(AGMIS). The problems addressed in this paper are

similar and prototypical artist detection

album cover retrieval

band member and instrumentation detection

automatic tagging of artists

, and

browsing/exploring web pages related to a music artist

. Finally, we elaborate on the currently ongoing work of evaluating the methods on a large dataset of more than 600,000 music artists and on a first prototypical implementation of AGMIS.

Markus Schedl, Peter Knees, Tim Pohle, Gerhard Widmer

Investigating the Effectiveness of Clickthrough Data for Document Reordering

User clicks—also known as clickthrough data—have been cited as an implicit form of relevance feedback. Previous work suggests that relative preferences between documents can be accurately derived from user clicks. In this paper, we analyze the impact of document reordering—based on clickthrough—on search effectiveness, measured using both TREC and user relevance judgments. We also propose new strategies for document reordering that can outperform current techniques. Preliminary results show that current reordering methods do not lead to consistent improvements of search quality, but may even lead to poorer results if not used with care.

Milad Shokouhi, Falk Scholer, Andrew Turpin

Analysis of Link Graph Compression Techniques

Links between documents have been shown to be useful in various Information Retrieval (IR) tasks - for example, Google has been telling us for many years now that the PageRank authority measure is at the heart of its relevance calculations. To use such link analysis techniques in a search engine, special tools are required to store the link matrix of the collection of documents, due to the high number of links typically involved. This work is concerned with the application of compression to the link graph. We compare several techniques of compressing link graphs, and conclude on speed and space metrics, using various standard IR test collections.

David Hannah, Craig Macdonald, Iadh Ounis

An Evaluation and Analysis of Incorporating Term Dependency for Ad-Hoc Retrieval

Although many retrieval models incorporating term dependency have been developed, it is still unclear whether term dependency information can consistently enhance retrieval performance for

different

queries. We present a novel model that captures the main components of a topic and the relationship between those components and the power of term dependency to improve retrieval performance. Experimental results demonstrate that the power of term dependency strongly depends on the relationship between these components. Without relevance information, the model is still useful by predicting the components based on global statistical information. We show the applicability of the model for adaptively incorporating term dependency for individual queries.

Hao Lang, Bin Wang, Gareth Jones, Jintao Li, Yang Xu

An Evaluation Measure for Distributed Information Retrieval Systems

This paper is concerned with the evaluation of distributed and peer-to-peer information retrieval systems. A new measure is introduced that compares results of a distributed retrieval system to those of a centralised system, fully exploiting the ranking of the latter as an indicator of gradual relevance. Problems with existing evaluation approaches are verified experimentally.

Hans Friedrich Witschel, Florian Holz, Gregor Heinrich, Sven Teresniak

Optimizing Language Models for Polarity Classification

This paper investigates the usage of various types of language models on polarity text classification – a subtask in opinion mining which deals with distinguishing between positive and negative opinions in natural language. We focus on the intrinsic benefit of different types of language models. This means that we try to find the optimal settings of a language model by examining different types of normalization, their interaction with smoothing and the benefit of class-based modeling.

Michael Wiegand, Dietrich Klakow

Improving Web Image Retrieval Using Image Annotations and Inference Network

Currently text-based retrieval approaches, which utilize web textual information to index and access images, are still widely used by many modern prevalent search engines due to the nature of simplicity and effectiveness. However, page documents often include texts irrelevant to image contents, becoming an obstacle for high-quality image retrieval. In this paper we propose a novel model to improve traditional text-based image retrieval by integrating weighted image annotation keywords and web texts seamlessly. Different from traditional text-based image retrieval models, the proposed model retrieves and ranks images depending on not only texts of web document but also image annotations. To verify the proposed model, some term-based queries are performed on three models, and results have shown that our model performs best.

Peng Huang, Jiajun Bu, Chun Chen, Guang Qiu

Slide-Film Interface: Overcoming Small Screen Limitations in Mobile Web Search

It is well known that alongside with search engine performance improvements and functionality enhancements one of the determinant factors of user acceptance of any search service is the interface. This factor is particularly important for mobile Web search mostly due to small screen limitations of handheld devices. In this paper we propose scrolless mobile Web search interface to decrease search efforts that are multiplied due to these limitations, and discuss its potential advantages and drawbacks over conventional one.

Roman Y. Shtykh, Jian Chen, Qun Jin

A Document-Centered Approach to a Natural Language Music Search Engine

We propose a new approach to a music search engine that can be accessed via natural language queries. As with existing approaches, we try to gather as much contextual information as possible for individual pieces in a (possibly large) music collection by means of Web retrieval. While existing approaches use this textual information to construct representations of music pieces in a vector space model, in this paper, we propose a document-centered technique to retrieve music pieces relevant to arbitrary natural language queries. This technique improves the quality of the resulting document rankings substantially. We report on the current state of the research and discuss current limitations, as well as possible directions to overcome them.

Peter Knees, Tim Pohle, Markus Schedl, Dominik Schnitzer, Klaus Seyerlehner

Collaborative Topic Tracking in an Enterprise Environment

Business users in an enterprise need to keep track of relevant information available on the Web for strategic decisions like mergers and acquisitions. Traditionally this is done by the user performing standing queries or alert mechanisms based on topics. A much richer tracking can be done by providing a way for users to initiate and share topics in a single place. In this paper we present an alternative model and prototype for tracking topics of interest based on a continuous user collaboration.

Conny Franke, Omar Alonso

Graph-Based Profile Similarity Calculation Method and Evaluation

Collaborative Information Retrieval (CIR) is a new technique for resolving the current problem of information retrieval systems. A CIR system registers the previous user interactions to response to the subsequent user queries more efficiently. But, the goals and the characteristics of two users may be different; so when they send the same query to a CIR system, they may be interested in two different lists of documents. To resolve this problem, we have developed a personalized CIR system, called PERCIRS, which is based on the similarity between two user profiles. In this paper, we propose a new method for User Profile Similarity Calculation UPSC. Finally, we introduce a mechanism for evaluating UPSC methods.

Hassan Naderi, Béatrice Rumpler

The Good, the Bad, the Difficult, and the Easy: Something Wrong with Information Retrieval Evaluation?

TREC-like evaluations do not consider topic ease and difficulty. However, it seems reasonable to reward good effectiveness on difficult topics more than good effectiveness on easy topics, and to penalize bad effectiveness on easy topics more than bad effectiveness on difficult topics. This paper shows how this approach leads to evaluation results that could be more reasonable, and that are different to some extent. I provide a general analysis of this issue, propose a novel framework, and experimentally validate a part of it.

Stefano Mizzaro

Hybrid Method for Personalized Search in Digital Libraries

In this paper we present our work about personalized search in digital libraries. The search results could be reranked while taking into account specific information needs of different people. We study many methods for this purpose: citation-based method, content-based method and hybrid method. We conducted experiments to compare performances of these methods. Experimental results show that our approaches are promising and applicable in digital libraries.

Thanh-Trung Van, Michel Beigbeder

Exploiting Session Context for Information Retrieval - A Comparative Study

Hard queries are known to benefit from relevance feedback provided by users. It is, however, also known that users are generally reluctant to provide feedback when searching for information. A natural resort not demanding any active user participation is to exploit implicit feedback from the previous user search behavior, i.e., from the context of the current search session. In this work, we present a comparative study on the performance of the three most prominent retrieval models, the

vector-space

probabilistic

, and

language-model based

retrieval frameworks, when additional session context is incorporated.

Gaurav Pandey, Julia Luxenburger

Structural Re-ranking with Cluster-Based Retrieval

Re-ranking (RR) and Cluster-based Retrieval (CR) have been polar methods for improving retrieval effectiveness by using inter-document similarities. However, RR and CR improve precision and recall respectively, not simultaneously. Thus, the improvement through RR and CR may be different according to whether a query is recall-deficient or not. However, previous researchers missed out this point, and separately investigated individual approaches, causing a limited improvement. To reflect all of positive effects by RR and CR, this paper proposes RCR, the re-ranking with cluster-based retrieval where RR is applied to initially-retrieved results of CR. Experimental results show that RCR significantly improves the baseline, while CR or RR sometimes does not significantly improve the baseline.

Seung-Hoon Na, In-Su Kang, Jong-Hyeok Lee

Automatic Vandalism Detection in Wikipedia

We present results of a new approach to detect destructive article revisions, so-called vandalism, in Wikipedia. Vandalism detection is a one-class classification problem, where vandalism edits are the target to be identified among all revisions. Interestingly, vandalism detection has not been addressed in the Information Retrieval literature by now. In this paper we discuss the characteristics of vandalism as humans recognize it and develop features to render vandalism detection as a machine learning task. We compiled a large number of vandalism edits in a corpus, which allows for the comparison of existing and new detection approaches. Using logistic regression we achieve 83% precision at 77% recall with our model. Compared to the rule-based methods that are currently applied in Wikipedia, our approach increases the F-Measure performance by 49% while being faster at the same time.

Martin Potthast, Benno Stein, Robert Gerling

Evaluating Paragraph Retrieval for why-QA

We implemented a baseline approach to

why

-question answering based on paragraph retrieval. Our implementation incorporates the QAP ranking algorithm with addition of a number of surface features (cue words and XML markup). With this baseline system, we obtain an accuracy-at-10 of 57.0% with an MRR of 0.31. Both the baseline and the proposed evaluation method are good starting points for the current research and for other researchers working on the problem of

why

-QA.

We also experimented with the addition of smart question analysis features to our baseline system (answer type and informational value of the subject). This however did not give significant improvement to our baseline. In the near future, we will investigate what other linguistic features can facilitate re-ranking in order to increase accuracy.

Suzan Verberne, Lou Boves, Nelleke Oostdijk, Peter-Arno Coppen

Revisit of Nearest Neighbor Test for Direct Evaluation of Inter-document Similarities

Recently, cluster-based retrieval has been successfully applied to improve retrieval effectiveness. The core part of cluster-based retrieval is inter-document similarities. Although inter-document similarities can be investigated independently of cluster-based retrieval and be further improved in various ways, their direct evaluation has not been seriously considered. Considering that there are many cluster-based retrieval methods, such a direct evaluation method can separate the work of inter-document similarities from the work of cluster-based retrieval. For this purpose, this paper revisits Voorhee’s nearest neighbor test as such a direct evaluation, by mainly focusing on whether or not the test is correlated to the retrieval effectiveness. Experimental results consistently verify the use of the nearest neighbor test. As a result, we conclude that the improvement of retrieval effectiveness can be well-predictable from direct evaluation, even without performing runs of cluster-based retrieval.

Seung-Hoon Na, In-Su Kang, Jong-Hyeok Lee

A Comparison of Named Entity Patterns from a User Analysis and a System Analysis

This paper investigates the detection of named entity (NE) patterns by comparing the results of NE patterns resulting from a user analysis and a system analysis. Findings revealed that there are difference in NE patterns detected by system and user, something that may affect the performance of a TDT system based on NE detection.

Masnizah Mohd, Fabio Crestani, Ian Ruthven

Query-Based Inter-document Similarity Using Probabilistic Co-relevance Model

Inter-document similarity is the critical information which determines whether or not the cluster-based retrieval improves the baseline. However, a theoretical work on inter-document similarity has not been investigated, even though such work can provide a principle to define a more improved similarity in a well-motivated direction. To support this theory, this paper starts from pursuing an ideal inter-document similarity that optimally satisfies the cluster-hypothesis. We propose a probabilistic principle of inter-document similarities; the optimal similarity of two documents should be proportional to the probability that they are co-relevant to an arbitrary query. Based on this principle, the study of the inter-document similarity is formulated to attack the estimation problem of the co-relevance model of documents. Furthermore, we obtain that the optimal inter-document similarity should be defined using queries as its basic unit, not terms, namely a query-based similarity. We strictly derive a novel query-based similarity from the co-relevance model, without any heuristics. Experimental results show that the new query-based inter-document similarity significantly improves the previously-used term-based similarity in the context of Voorhee’s evaluation measure.

Seung-Hoon Na, In-Su Kang, Jong-Hyeok Lee

Using Coherence-Based Measures to Predict Query Difficulty

We investigate the potential of coherence-based scores to predict query difficulty. The coherence of a document set associated with each query word is used to capture the quality of a query topic aspect. A simple query coherence score, QC-1, is proposed that requires the average coherence contribution of individual query terms to be high. Two further query scores, QC-2 and QC-3, are developed by constraining QC-1 in order to capture the semantic similarity among query topic aspects. All three query coherence scores show the correlation with average precision necessary to make them good predictors of query difficulty. Simple and efficient, the measures require no training data and are competitive with language model-based clarity scores.

Jiyin He, Martha Larson, Maarten de Rijke

Efficient Processing of Category-Restricted Queries for Web Directories

We show that a cluster-skipping inverted index (CS-IIS) is a practical and efficient file structure to support category-restricted queries for searching Web directories. The query processing strategy with CS-IIS improves CPU time efficiency without imposing any limitations on the directory size.

Ismail Sengor Altingovde, Fazli Can, Özgür Ulusoy

Focused Browsing: Providing Topical Feedback for Link Selection in Hypertext Browsing

When making decisions about whether to navigate to a linked page, users of standard browsers of hypertextual documents returned by an information retrieval search engine are entirely reliant on the content of the anchortext associated with links and the surrounding text. This information is often insufficient for them to make reliable decisions about whether to open a linked page, and they can find themselves following many links to pages which are not helpful with subsequent return to the previous page. We describe a prototype

focused browsing

application which provides feedback on the likely usefulness of each page linked from the current one, and a

term cloud

preview of the contents of each linked page. Results from an exploratory experiment suggest that users can find this useful in improving their search efficiency.

Gareth J. F. Jones, Quixiang Li

The Impact of Named Entity Normalization on Information Retrieval for Question Answering

In the named entity normalization task, a system identifies a canonical unambiguous referent for names like

Bush

Alabama

. Resolving synonymy and ambiguity of such names can benefit end-to-end information access tasks. We evaluate two entity normalization methods based on Wikipedia in the context of both passage and document retrieval for question anwering. We find that even a simple normalization method leads to improvements of early precision, both for document and passage retrieval. Moreover, better normalization results in better retrieval performance.

Mahboob Alam Khalid, Valentin Jijkoun, Maarten de Rijke

Workshop Summaries

Efficiency Issues in Information Retrieval Workshop

Today’s technological advancements allow for vast amounts of information to be widely generated, disseminated, and stored. This exponentially increasing amount of information renders the retrieval of relevant information a necessary and cumbersome task. The field of Information Retrieval (IR) addresses this task by developing systems in an effective and efficient way. Specifically, IR effectiveness deals with retrieving the most relevant information to a user need, while IR efficiency deals with providing fast and ordered access to large amounts of information.

Roi Blanco, Fabrizio Silvestri

Exploiting Semantic Annotations in Information Retrieval

The goal of this workshop is to create a forum for researchers interested in the use of semantic annotations for information retrieval. By semantic annotations we refer to linguistic annotations (such as named entities, semantic classes, etc.) as well as user annotations such as microformats, RDF, tags, etc. The aim of this workshop is not semantic annotation itself, but rather the applications of semantic annotation to information retrieval tasks such as ad-hoc retrieval, classification, browsing, textual mining, summarization, question answering, etc.

Omar Alonso, Hugo Zaragoza

Workshop on Novel Methodologies for Evaluation in Information Retrieval

Objectives.

Information retrieval is an empirical science; the field cannot move forward unless there are means of evaluating the innovations devised by researchers. However the methodologies conceived in the early years of IR and used in the campaigns of today are starting to show their age and new research is emerging to understand how to overcome the twin challenges of scale and diversity.

Mark Sanderson, Martin Braschler, Nicola Ferro, Julio Gonzalo

Tutorials

ECIR 2008 Tutorials

Advanced Language Modeling Approaches (Case Study: Expert Search)

This tutorial gives a clear and detailed overview of advanced language modeling approaches and tools, including the use of document priors, translation models, relevance models, parsimonious models and expectation maximization training. Expert search will be used as a case study to explain the consequences of modeling assumptions. For more details, you can access http://www.cs.utwente.nl/ hiemstra/ecir2008.

Djoerd Hiemstra is assistant professor at the University of Twente. He wrote a Ph.D. thesis on language models for information retrieval and contributed to over 90 research papers in the field of IR. His research interests include formal models of information retrieval, XML retrieval and multimedia retrieval.

Search and Discovery in User-Generated Text Content

Maarten de Rijke and Wouter Weerkamp

ISLA, University of Amsterdam, The Netherlands

We increasingly live our lives online: Blogs, forums, commenting tools, and many other sharing sites offer possibilities to users to make any information available online. For the first time in history, we are able to collect huge amounts of usergenerated content (UGC) within “a blink of an eye”. The rapidly increasing amount of UGC poses challenges to the IR community, but also offers many previously unthinkable possibilities. In this tutorial we discuss different aspects of accessing (i.e., searching, tracking, and analyzing) UGC. Our focus will be on textual content, and most of the methods that we will consider for ranking UGC (by relevancy, quality, opinionatedness) are based on language modeling. For more details, you can access http://ecir2008.dcs.gla.ac.uk/tutorial_sd.html.

Maarten de Rijke is professor of information processing and internet at the Intelligent Systems Lab Amsterdam (ISLA) of the University of Amsterdam. His group has been researching search and discovery tools for UGC for a number of years now, with numerous publications and various demonstrators as tangible outcomes. Wouter Weerkamp is a PhD student at ISLA, working on language modeling and intelligent access to UGC.

Researching and Building IR Applications Using Terrier

Craig Macdonald and Ben He

University of Glasgow, UK

This tutorial introduces the main design of an IR system, and uses the Terrier platform as an example of how one should be built. We detail the architecture and data structures of Terrier, as well as the weighting models included, and describe, with examples, how Terrier can be used to perform experiments and extended to facilitate new research and applications. For more details, you can access http://ecir2008.dcs.gla.ac.uk/tutorial_rb.html.

Craig Macdonald is a PhD research student at the University of Glasgow. His research interests includes Information Retrieval in Enterprise, Web and Blog settings, and has over 20 publications with research based on the Terrier platform. He has been a co-ordinator of the Blog track at TREC since 2006, and is a developer of the Terrier platform.

Ben He is a post-doctoral research assistant at the University of Glasgow. His research interests are centered around document weighting models, and particularly concerned about document length normalisation and query expansion. He has been a developer of the Terrier platform since its initial development and has more than 20 publications performed with Terrier.

Djoerd Hiemstra

Backmatter

Titel: Advances in Information Retrieval
herausgegeben von: Craig Macdonald
Iadh Ounis
Vassilis Plachouras
Ian Ruthven
Ryen W. White
Verlag: Springer Berlin Heidelberg
Electronic ISBN: 978-3-540-78646-7
Print ISBN: 978-3-540-78645-0
DOI: https://doi.org/10.1007/978-3-540-78646-7