Skip to main content

2009 | Buch

Advances in Information Retrieval Theory

Second International Conference on the Theory of Information Retrieval, ICTIR 2009 Cambridge, UK, September 10-12, 2009 Proceedings

herausgegeben von: Leif Azzopardi, Gabriella Kazai, Stephen Robertson, Stefan Rüger, Milad Shokouhi, Dawei Song, Emine Yilmaz

Verlag: Springer Berlin Heidelberg

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

This book constitutes the refereed proceedings of the Second International Conference on the Theory of Information Retrieval, ICTIR 2009, held in Cambridge, UK, in September 2009. The 18 revised full papers, 14 short papers, and 11 posters presented together with one invited talk were carefully reviewed and selected from 82 submissions. The papers are categorized into four main themes: novel IR models, evaluation, efficiency, and new perspectives in IR. Twenty-one papers fall into the general theme of novel IR models, ranging from various retrieval models, query and term selection models, Web IR models, developments in novelty and diversity, to the modeling of user aspects. There are four papers on new evaluation methodologies, e.g., modeling score distributions, evaluation over sessions, and an axiomatic framework for XML retrieval evaluation. Three papers focus on the issue of efficiency and offer solutions to improve the tractability of PageRank, data cleansing practices for training classifiers, and approximate search for distributed IR. Finally, four papers look into new perspectives of IR and shed light on some new emerging areas of interest, such as the application and adoption of quantum theory in IR.

Inhaltsverzeichnis

Frontmatter

Invited Talk

Is There Something Quantum-Like about the Human Mental Lexicon?

This talk proceeds from the premise that IR should engage in a more substantial dialogue with cognitive science. After all, how users decide relevance, or how they chose terms to modify a query are processes rooted in human cognition. Recently, there has been a growing literature applying quantum theory (QT) to model cognitive phenomena. This talk will survey recent research, in particular, modelling interference effects in human decision making. One aspect of QT will be illustrated - how quantum entanglement can be used to model word associations in human memory. The implications of this will be briefly discussed in terms of a new approach for modelling concept combinations. Tentative links to human abductive reasoning will also be drawn. The basic theme behind this talk is QT can potentially provide a new genre of information processing models (including search) more aligned with human cognition.

Peter Bruza

Regular Papers

Efficiency

Probably Approximately Correct Search

We consider the problem of searching a document collection using a set of independent computers. That is, the computers do

not

cooperate with one another either (i) to acquire their local index of documents or (ii) during the retrieval of a document. During the acquisition phase, each computer is assumed to randomly sample a subset of the entire collection. During retrieval, the query is issued to a random subset of computers, each of which returns its results to the query-issuer, who consolidates the results. We examine how the number of computers, and the fraction of the collection that each computer indexes, affects performance in comparison to a traditional deterministic configuration. We provide analytic formulae that, given the number of computers and the fraction of the collection each computer indexes, provide the probability of an approximately correct search, where a “correct search” is defined to be the result of a deterministic search on the entire collection. We show that the randomized distributed search algorithm can have acceptable performance under a range of parameters settings. Simulation results confirm our analysis.

Ingemar J. Cox, Ruoxun Fu, Lars Kai Hansen
PageRank: Splitting Homogeneous Singular Linear Systems of Index One

The PageRank algorithm is used today within web information retrieval to provide a content-neutral ranking metric over web pages. It employs power method iterations to solve for the steady-state vector of a DTMC. The defining one-step probability transition matrix of this DTMC is derived from the hyperlink structure of the web and a model of web surfing behaviour which accounts for user bookmarks and memorised URLs.

In this paper we look to provide a more accessible, more broadly applicable explanation than has been given in the literature of how to make PageRank calculation more tractable through removal of the dangling-page matrix. This allows web pages without outgoing links to be removed before we employ power method iterations. It also allows decomposition of the problem according to irreducible subcomponents of the original transition matrix. Our explanation also covers a PageRank extension to accommodate TrustRank. In setting out our alternative explanation, we introduce and apply a general linear algebraic theorem which allows us to map homogeneous singular linear systems of index one to inhomogeneous non-singular linear systems with a shared solution vector. As an aside, we show in this paper that irreducibility is not required for PageRank to be well-defined.

Douglas V. de Jager, Jeremy T. Bradley
Training Data Cleaning for Text Classification

In text classification (TC) and other tasks involving supervised learning, labelled data may be scarce or expensive to obtain; strategies are thus needed for maximizing the effectiveness of the resulting classifiers while minimizing the required amount of training effort.

Training data cleaning

(TDC) consists in devising ranking functions that sort the original training examples in terms of how likely it is that the human annotator has misclassified them, thereby providing a convenient means for the human annotator to revise the training set so as to improve its quality. Working in the context of boosting-based learning methods we present three different techniques for performing TDC and, on two widely used TC benchmarks, evaluate them by their capability of spotting misclassified texts purposefully inserted in the training set.

Andrea Esuli, Fabrizio Sebastiani

Retrieval Models

Semi-parametric and Non-parametric Term Weighting for Information Retrieval

Most of the previous research on term weighting for information retrieval has focused on developing specialized parametric term weighting functions. Examples include

TF

.

IDF

vector-space formulations, BM25, and language modeling weighting. Each of these term weighting functions takes on a specific parametric form. While these weighting functions have proven to be highly effective, they impose strict constraints on the functional form of the term weights. Such constraints may possibly degrade retrieval effectiveness. In this paper we propose two new classes of term weighting schemes that we call semi-parametric and non-parametric weighting. These weighting schemes make fewer assumptions about the underlying term weights and allow the data to speak for itself. We argue that these robust weighting schemes have the potential to be significantly more effective compared to existing parametric schemes, especially with the growing amount of training data becoming available.

Donald Metzler, Hugo Zaragoza
Bridging Language Modeling and Divergence from Randomness Models: A Log-Logistic Model for IR

We are interested in this paper in revisiting the Divergence from Randomness (DFR) approach to Information Retrieval (IR), so as to better understand the different contributions it relies on, and thus be able to simplify it. To do so, we first introduce an analytical characterization of heuristic retrieval constraints and review several DFR models wrt this characterization. This review shows that the first normalization principle of DFR is necessary to make the model compliant with retrieval constraints. We then show that the log-logistic distribution can be used to derive a simplified DFR model. Interestingly, this simplified model contains Language Models (LM) with Jelinek-Mercer smoothing. The relation we propose here is, to our knowledge, the first connection between the DFR and LM approaches. Lastly, we present experimental results obtained on several standard collections which validate the simplification and the model we propose.

Stéphane Clinchant, Eric Gaussier
Ordinal Regression Based Model for Personalized Information Retrieval

Retrieving relevant items as a response to a user query is the aim of each information retrieval system. But ‘without an understanding of what relevance means to users, it is difficult to imagine how a system can retrieve relevant information for users’ [1]. In this paper, we try to capture what relevance is for a particular user and model his profile implicitly considering his non declared preferences that are inferred from a ranking of a

reduced

set of retrieved documents that he produces. We propose an ordinal regression based model for interactive ranking which uses both the information given by this subjective ranking, as well as the multicriteria evaluation of these ranked documents, to adjust optimally the parameters of a ranking model. This model consists of a set of additive value functions which are built so as they are as compatible as possible with the subjective ranking. The preference information used in our model requires reasonable cognitive effort from the user.

Mohamed Farah
Navigating in the Dark: Modeling Uncertainty in Ad Hoc Retrieval Using Multiple Relevance Models

We develop a novel probabilistic approach to ad hoc retrieval that explicitly addresses the uncertainty about the information need underlying a given query. In doing so, we account for the special role of the corpus in the retrieval process. The derived retrieval method integrates multiple

relevance models

by using estimates of their

faithfulness

to the presumed information need. Empirical evaluation demonstrates the performance merits of the proposed approach.

Natali Soskin, Oren Kurland, Carmel Domshlak

Query and Term Models

A Belief Model of Query Difficulty That Uses Subjective Logic

The difficulty of a user query can affect the performance of Information Retrieval (IR) systems. This work presents a formal model for quantifying and reasoning about query difficulty as follows: Query difficulty is considered to be a subjective belief, which is formulated on the basis of various types of evidence. This allows us to define a belief model and a set of operators for combining evidence of query difficulty. The belief model uses

subjective logic

, a type of probabilistic logic for modeling uncertainties. An application of this model with semantic and pragmatic evidence about 150 TREC queries illustrates the potential flexibility of this framework in expressing and combining evidence. To our knowledge, this is the first application of subjective logic to IR.

Christina Lioma, Roi Blanco, Raquel Mochales Palau, Marie-Francine Moens
“A term is known by the company it keeps”: On Selecting a Good Expansion Set in Pseudo-Relevance Feedback

It is well known that pseudo-relevance feedback (PRF) improves the retrieval performance of Information Retrieval (IR) systems in general. However, a recent study by Cao et al [3] has shown that a non-negligible fraction of expansion terms used by PRF algorithms are harmful to the retrieval. In other words, a PRF algorithm would be better off if it were to use only a subset of the feedback terms. The challenge then is to find a good expansion set from the set of all candidate expansion terms. A natural approach to solve the problem is to make term independence assumption and use one or more term selection criteria or a statistical classifier to identify good expansion terms independent of each other. In this work, we challenge this approach and show empirically that a feedback term is neither good nor bad in itself in general; the behavior of a term depends very much on other expansion terms. Our finding implies that a good expansion set can not be found by making term independence assumption in general. As a principled solution to the problem, we propose spectral partitioning of expansion terms using a specific term-term interaction matrix. We demonstrate on several test collections that expansion terms can be partitioned into two sets and the best of the two sets gives substantial improvements in retrieval performance over model-based feedback.

Raghavendra Udupa, Abhijit Bhole, Pushpak Bhattacharyya
An Effective Approach to Verbose Queries Using a Limited Dependencies Language Model

Intuitively, any ‘bag of words’ approach in IR should benefit from taking term dependencies into account. Unfortunately, for years the results of exploiting such dependencies have been mixed or inconclusive. To improve the situation, this paper shows how the natural language properties of the target documents can be used to transform and enrich the term dependencies to more useful statistics. This is done in three steps. The term co-occurrence statistics of queries and documents are each represented by a Markov chain. The paper proves that such a chain is ergodic, and therefore its asymptotic behavior is unique, stationary, and independent of the initial state. Next, the stationary distribution is taken to model queries and documents, rather than their initial distributions. Finally, ranking is achieved following the customary language modeling paradigm. The main contribution of this paper is to argue why the asymptotic behavior of the document model is a better representation then just the document’s initial distribution. A secondary contribution is to investigate the practical application of this representation in case the queries become increasingly verbose. In the experiments (based on Lemur’s search engine substrate) the default query model was replaced by the stable distribution of the query. Just modeling the query this way already resulted in significant improvements over a standard language model baseline. The results were on a par or better than more sophisticated algorithms that use fine-tuned parameters or extensive training. Moreover, the more verbose the query, the more effective the approach seems to become.

Eduard Hoenkamp, Peter Bruza, Dawei Song, Qiang Huang
Time-Sensitive Language Modelling for Online Term Recurrence Prediction

We address the problem of online term recurrence prediction: for a stream of terms, at each time point predict what term is going to recur next in the stream given the term occurrence history so far. It has many applications, for example, in Web search and social tagging. In this paper, we propose a time-sensitive language modelling approach to this problem that effectively combines term frequency and term recency information, and describe how this approach can be implemented efficiently by an online learning algorithm. Our experiments on a real-world Web query log dataset show significant improvements over standard language modelling.

Dell Zhang, Jinsong Lu, Robert Mao, Jian-Yun Nie

Evaluation

Score Distributions in Information Retrieval

We review the history of modeling score distributions, focusing on the mixture of normal-exponential by investigating the theoretical as well as the empirical evidence supporting its use. We discuss previously suggested conditions which valid binary mixture models should satisfy, such as the Recall-Fallout Convexity Hypothesis, and formulate two new hypotheses considering the component distributions under some limiting conditions of parameter values. From all the mixtures suggested in the past, the current theoretical argument points to the two gamma as the most-likely universal model, with the normal-exponential being a usable approximation. Beyond the theoretical contribution, we provide new experimental evidence showing vector space or geometric models, and BM25, as being “friendly” to the normal-exponential, and that the non-convexity problem that the mixture possesses is practically not severe.

Avi Arampatzis, Stephen Robertson, Jaap Kamps
Modeling the Score Distributions of Relevant and Non-relevant Documents

Empirical modeling of the score distributions associated with retrieved documents is an essential task for many retrieval applications. In this work, we propose modeling the relevant documents’ scores by a mixture of Gaussians and modeling the non-relevant scores by a Gamma distribution. Applying variational inference we automatically trade-off the goodness-of-fit with the complexity of the model. We test our model on traditional retrieval functions and actual search engines submitted to TREC. We demonstrate the utility of our model in inferring precision-recall curves. In all experiments our model outperforms the dominant exponential-Gaussian model.

Evangelos Kanoulas, Virgil Pavlu, Keshi Dai, Javed A. Aslam
Modeling Expected Utility of Multi-session Information Distillation

An open challenge in information distillation is the evaluation and optimization of the utility of ranked lists with respect to flexible user interactions over multiple sessions. Utility depends on both the relevance and novelty of documents, and the novelty in turn depends on the user interaction history. However, user behavior is non-deterministic. We propose a new probabilistic framework for stochastic modeling of user behavior when browsing multi-session ranked lists, and a novel approximation method for efficient computation of the expected utility over numerous user-interaction patterns. Using this framework, we present the first utility-based evaluation over multi-session search scenarios defined on the TDT4 corpus of news stories, using a state-of-the-art information distillation system. We demonstrate that the distillation system obtains a 56.6% utility enhancement by combining multi-session adaptive filtering with novelty detection and utility-based optimization of system parameters for optimal ranked list lengths.

Yiming Yang, Abhimanyu Lad
Specificity Aboutness in XML Retrieval

This paper presents a theoretical methodology to evaluate filters in XML retrieval. Theoretical evaluation is concerned with the formal investigation of qualitative properties of retrieval models. XML retrieval deals with retrieving those document components that specifically answer a query, and filters are a method of delivering the most focused answers. Our theoretical evaluation will critically analyse how filters achieve this.

Tobias Blanke, Mounia Lalmas

Novelty and Diversity

An Effectiveness Measure for Ambiguous and Underspecified Queries

Building upon simple models of user needs and behavior, we propose a new measure of novelty and diversity for information retrieval evaluation. We combine ideas from three recently proposed effectiveness measures in an attempt to achieve a balance between the complexity of genuine users needs and the simplicity required for feasible evaluation.

Charles L. A. Clarke, Maheedhar Kolla, Olga Vechtomova
An Analysis of NP-Completeness in Novelty and Diversity Ranking

A useful ability for search engines is to be able to rank objects with

novelty

and

diversity

: the top

k

documents retrieved should cover possible interpretations of a query with some distribution, or should contain a diverse set of subtopics related to the user’s information need, or contain nuggets of information with little redundancy. Evaluation measures have been introduced to measure the effectiveness of systems at this task, but these measures have worst-case NP-complete computation time. We use simulation to investigate the implications of this for optimization and evaluation of retrieval systems.

Ben Carterette
From “Identical” to “Similar”: Fusing Retrieved Lists Based on Inter-document Similarities

We present a novel approach to

fusing

document lists that are retrieved in response to a query. Our approach is based on utilizing information induced from

inter-document similarities

. Specifically, the key insight guiding the derivation of our methods is that similar documents from different lists can provide

relevance-status

support to each other. We use a graph-based method to model relevance-status propagation between documents. The propagation is governed by inter-document-similarities and by retrieval scores of documents in the lists. Empirical evaluation shows the effectiveness of our methods in fusing TREC

runs

.

Anna Khudyak Kozorovitzky, Oren Kurland

Short Papers

New Perspectives in IR

A Quantum-Based Model for Interactive Information Retrieval

Even the best information retrieval model cannot always identify the most useful answers to a user query. This is in particular the case with web search systems, where it is known that users tend to minimise their effort to access relevant information. It is, however, believed that the interaction between users and a retrieval system, such as a web search engine, can be exploited to provide better answers to users. Interactive Information Retrieval (IR) systems, in which users access information through a series of interactions with the search system, are concerned with building models for IR, where interaction plays a central role. In this paper, we propose a general framework for interactive IR that is able to capture the full interaction process in a principled way. Our approach relies upon a generalisation of the probability framework of quantum physics.

Benjamin Piwowarski, Mounia Lalmas
The Quantum Probability Ranking Principle for Information Retrieval

While the Probability Ranking Principle for Information Retrieval provides the basis for formal models, it makes a very strong assumption regarding the dependence between documents. However, it has been observed that in real situations this assumption does not always hold. In this paper we propose a reformulation of the Probability Ranking Principle based on quantum theory. Quantum probability theory naturally includes interference effects between events. We posit that this interference captures the dependency between the judgement of document relevance. The outcome is a more sophisticated principle, the Quantum Probability Ranking Principle, that provides a more sensitive ranking which caters for interference/dependence between documents’ relevance.

Guido Zuccon, Leif A. Azzopardi, Keith van Rijsbergen
Written Texts as Statistical Mechanical Problem

In this article we present a model of human written text based on statistical mechanics consideration. The empirical derivation of the potential energy for the parts of the text and the calculation of the thermodynamic parameters of the system, show that the “specific heat” corresponds to the semantic classification of the words in the text, separating keywords, function words and common words. This can give advantages when the model is used in text searching mechanisms.

Kostadin Koroutchev, Elka Korutcheva, Jian Shen
What Happened to Content-Based Information Filtering?

Personalisation can have a significant impact on the way information is disseminated on the web today. Information Filtering can be a significant ingredient towards a personalised web. Collaborative Filtering is already being applied successfully for generating personalised recommendations of music tracks, books, movies and more. The same is not true for Content-Based Filtering. In this paper, we identify some possible reasons for the notable absence of a broad range of personalised information delivery and dissemination services on the web today. We advocate that a more holistic approach to user profiling is required and we discuss the series of still open, challenging research issues raised.

Nikolaos Nanas, Anne De Roeck, Manolis Vavalis

Retrieval Models

Prior Information and the Determination of Event Spaces in Probabilistic Information Retrieval Models

A mismatch between different event spaces has been used to argue against rank equivalence of classic probabilistic models of information retrieval and language models. We question the effectiveness of this strategy and we argue that a convincing solution should be sought in a correct procedure to design adequate priors for probabilistic reasoning. Acknowledging our solution of the event space issue invites to rethink the relation between probabilistic models, statistics and logic in the context of IR.

Corrado Boscarino, Arjen P. de Vries
Robust Word Similarity Estimation Using Perturbation Kernels

We introduce

perturbation kernels

, a new class of similarity measure for information retrieval that casts word similarity in terms of multi-task learning. Perturbation kernels model uncertainty in the user’s query by choosing a small number of variations in the relative weights of the query terms to build a more complete picture of the query context, which is then used to compute a form of

expected distance

between words. Our approach has a principled mathematical foundation, a simple analytical form, and makes few assumptions about the underlying retrieval model, making it easy to apply in a broad family of existing query expansion and model estimation algorithms.

Kevyn Collins-Thompson
Possibilistic Similarity Estimation and Visualization

In this paper, we present a very general and powerful approach to represent and to visualize the similarity between the objects that contain heterogeneous, imperfect and missing attributes in order to easily achieve efficient analysis and retrieval of information by organizing and gathering these objects into meaningful groups. Our method is essentially based on possibility theory to estimate the similarity and on the spatial, the graphical, and the clustering-based representational models to visualize and represent its structure. Our approach will be applied to a real digestive image database (http://i3se009d.univ-brest.fr/ password view2006 [4]). Without any a priori medical knowledge concerning the key attributes of the pathologies, and without any complicated preprocessing of the imperfect data, results show that we are capable to visualize and to organize the different categories of the digestive pathologies. These results were validated by the doctor.

Anas Dahabiah, John Puentes, Basel Solaiman
A New Measure of the Cluster Hypothesis

We have found that the nearest neighbor (NN) test is an insufficient measure of the cluster hypothesis. The NN test is a local measure of the cluster hypothesis. Designers of new document-to-document similarity measures may incorrectly report effective clustering of relevant documents if they use the NN test alone. Utilizing a measure from network analysis, we present a new, global measure of the cluster hypothesis: normalized mean reciprocal distance. When used together with a local measure, such as the NN test, this new global measure allows researchers to better measure the cluster hypothesis.

Mark D. Smucker, James Allan

User Aspects

Explaining User Performance in Information Retrieval: Challenges to IR Evaluation

The paper makes three points of significance for IR research: (1) The Cranfield paradigm of IR evaluation seems to lose power when one looks at human instead of system performance. (2) Searchers using IR systems in real-life use rather short queries, which individually often have poor performance. However, when used in sessions, they may be surprisingly effective. The searcher’s strategies have not been sufficiently described and cannot therefore be properly understood, supported nor evaluated. (3) Searchers in real-life seek to optimize the entire information access process, not just result quality. Evaluation of output alone is insufficient to explain searcher behavior.

Kalervo Järvelin
A Four-Factor User Interaction Model for Content-Based Image Retrieval

In order to bridge the “Semantic gap”, a number of relevance feedback (RF) mechanisms have been applied to content-based image retrieval (CBIR). However current RF techniques in most existing CBIR systems still lack satisfactory user interaction although some work has been done to improve the interaction as well as the search accuracy. In this paper, we propose a four-factor user interaction model and investigate its effects on CBIR by an empirical evaluation. Whilst the model was developed for our research purposes, we believe the model could be adapted to any content-based search system.

Haiming Liu, Victoria Uren, Dawei Song, Stefan Rüger
Predicting Query Performance by Query-Drift Estimation

Predicting

query performance

, that is, the effectiveness of a search performed in response to a query, is a highly important and challenging problem. Our novel approach to addressing this challenge is based on estimating the potential amount of

query drift

in the result list, i.e., the presence (and dominance) of aspects or topics not related to the query in top-retrieved documents. We argue that query-drift can potentially be estimated by measuring the

diversity

(e.g., standard deviation) of the retrieval scores of these documents. Empirical evaluation demonstrates the prediction effectiveness of our approach for several retrieval models. Specifically, the prediction success is better, over most tested TREC corpora, than that of state-of-the-art prediction methods.

Anna Shtok, Oren Kurland, David Carmel

Web Models

What’s in a Link? From Document Importance to Topical Relevance

Web information retrieval is best known for its use of the Web’s link structure as a source of evidence. Global link evidence is by nature query-independent, and is therefore no direct indicator of the topical relevance of a document for a given search request. As a result, link information is usually considered to be useful to identify the ‘importance’ of documents. Local link evidence, in contrast, is query-dependent and could in principle be related to the topical relevance. We analyse the link evidence in Wikipedia using a large set of ad hoc retrieval topics and relevance judgements to investigate the relation between link evidence and topical relevance.

Marijn Koolen, Jaap Kamps
Avoiding Bias in Text Clustering Using Constrained K-means and May-Not-Links

In this paper we present a new clustering algorithm which extends the traditional batch k-means enabling the introduction of domain knowledge in the form of Must, Cannot, May and May-Not rules between the data points. Besides, we have applied the presented method to the task of avoiding bias in clustering. Evaluation carried out in standard collections showed considerable improvements in effectiveness against previous constrained and non-constrained algorithms for the given task.

M. Eduardo Ares, Javier Parapar, Álvaro Barreiro
Optimizing WebPage Interest

In the rapidly evolving and growing environment of the internet, web site owners aim to maximize interest for their web site. In this article we propose a model, which combines the static structure of the internet with activity based data, to compute an interest based ranking. This ranking can be used to gain more insight into the flow of users over the internet, optimize the position of a web site and improve strategic decisions and investments. The model consists of a static centrality based component and a dynamic activity based component. The components are used to create a Markov Model in order to compute a ranking.

Willem Elbers, Theo van der Weide

Posters

The “Beautiful” in Information
Philosophy of Aesthetics and Information Visualization

At the intersection of retrieval and visualization are opportunities to learn more about IR’s view of knowledge and evidence by considering Kantian and post-modern philosophies of aesthetics.

Gerald Benoit
IR Evaluation without a Common Set of Topics

Usually, system effectiveness evaluation in a TREC-like environment is performed on a common set of topics. We show that even when using different topics for different systems, a reliable evaluation can be obtained, and that reliability increases by using appropriate topic selection strategies and metric normalizations.

Matteo Cattelan, Stefano Mizzaro
An Ad Hoc Information Retrieval Perspective on PLSI through Language Model Identification

This paper proposes a new document–query similarity for PLSI that allows queries to be used in PLSI without folding-in. We compare this similarity to Fisher kernels, the state-of-the-art approach for PLSI, on a corpus of 1M+ word occurrences coming from TREC–AP.

Jean-Cédric Chappelier, Emmanuel Eckard
Less Is More: Maximal Marginal Relevance as a Summarisation Feature

Summarisation approaches aim to provide the most salient concepts of a text in a condensed representation. Repetition of extracted material in the generated summary should be avoided. Carbonell and Goldstein proposed Maximal Marginal Relevance as a measure to increase the diversity of documents retrieved by an IR system, and developed a summariser based on MMR. In this paper, we look at the viability of MMR as a feature in the traditional feature-based summarisation approach proposed by Edmundson.

Jan Frederik Forst, Anastasios Tombros, Thomas Roelleke
On the Notion of “An Information Need”

‘Information need’ is a notion in IR that is ubiquitous, important, and intuitively clear. So far, surprisingly, the term seems to have defied formal definition. Of course, IR can continue to prosper without a formalization of ‘information need’. Yet when a field gets more mature there comes a time that frequently used notions should be formalized to make them susceptible of scrutiny. For IR such formalization should (1) be independent of a particular query language or document model, (2) allow that users formulate a need for information that may be unavailable or even nonexistent, and (3) allow that users try to circumscribe the very information they do not possess. To this end, the paper uses lattice theory to define a ‘formal information need’, which, we argue, coincides with the intuitive notion precisely when a user’s need for information can actually be filled.

Eduard Hoenkamp
A Logical Inference Approach to Query Expansion with Social Tags

Query Expansion (QE) refers to the Information Retrieval (IR) technique of adding assumed relevant terms to a query in order to render it more informative, and hence more likely to retrieve relevant documents. A key problem is how to identify the terms to be added, and how to integrate them into the original query. We address this problem by using as expansion terms social tags that are freely available on the Web. We integrate these tags into the query by treating the QE process as a logical inference (initially proposed in [3]) and by considering the addition of tags as an extra deduction to this process. This work extends Nie’s logical inference formalisation of QE to process social tags, and proposes an estimation of tag salience, which is experimentally shown to yield competitive retrieval performance.

Christina Lioma, Roi Blanco, Marie-Francine Moens
Evaluating Mobile Proactive Context-Aware Retrieval: An Incremental Benchmark

We present the evaluation of a novel application for Web content perusal by means of context-aware mobile devices that proactively query an external search engine. To this aim, we develop a TREC-like benchmark and we use it to evaluate different strategies for automatic query construction on the basis of user’s current context. We discuss both the methodology and the results.

Davide Menegon, Stefano Mizzaro, Elena Nazzi, Luca Vassena
Predicting the Usefulness of Collection Enrichment for Enterprise Search

Query Expansion (QE) often improves the retrieval performance of an Information Retrieval (IR) system. However, as enterprise intranets are often sparse in nature, with limited use of alternative lexical representations between authors, it can be advantageous to use Collection Enrichment (CE) to gather higher quality pseudo-feedback documents. In this paper, we propose the use of query performance predictors to selectively apply CE on a per-query basis. We thoroughly evaluate our approach on the CERC standard test collection and its corresponding topic sets from the TREC 2007 & 2008 Enterprise track document search tasks. We experiment with 3 different external resources and 3 different query performance predictors. Our experimental results demonstrate that our proposed approach leads to a significant improvement in retrieval performance.

Jie Peng, Ben He, Iadh Ounis
Ranking List Dispersion as a Query Performance Predictor

In this paper we introduce a novel approach for query performance prediction based on ranking list scores dispersion. Starting from the hypothesis that different score distributions appear for good and poor performance queries, we introduce a set of measures that capture these differences between both types of distributions. The use of measures based on standard deviation of ranking list scores, as a prediction value, shows a significant correlation degree in terms of average precision.

Joaquín Pérez-Iglesias, Lourdes Araujo
Semi-subsumed Events: A Probabilistic Semantics of the BM25 Term Frequency Quantification

Through BM25, the asymptotic term frequency quantification TF = tf/(tf+

K

), where

${\textmd{tf}}$

is the within-document term frequency and

K

is a normalisation factor, became popular. This paper reports a finding regarding the meaning of the TF quantification: in the triangle of independence and subsumption, the TF quantification forms the altitude, that is, the middle between independent and subsumed events. We refer to this new assumption as semi-subsumed. While this finding of a well-defined probabilistic assumption solves the probabilistic interpretation of the BM25 TF quantification, it is also of wider impact regarding probability theory.

Hengzhi Wu, Thomas Roelleke
Batch-Mode Computational Advertising Based on Modern Portfolio Theory

The research on computational advertising so far has focused on finding the single best ad. However, in many real situations, more than one ad can be presented. Although it is possible to address this problem myopically by using a single-ad optimisation technique in serial-mode, i.e., one at a time, this approach can be ineffective and inefficient because it ignores the correlation between ads. In this paper, we make a leap forward to address the problem of finding the best ads in batch-mode, i.e., assembling the optimal set of ads to be presented altogether. The key idea is to achieve maximum revenue while controlling the level of risk by diversifying the set of ads. We show how the Modern Portfolio Theory can be applied to this problem to provide elegant solutions and deep insights.

Dell Zhang, Jinsong Lu
Backmatter
Metadaten
Titel
Advances in Information Retrieval Theory
herausgegeben von
Leif Azzopardi
Gabriella Kazai
Stephen Robertson
Stefan Rüger
Milad Shokouhi
Dawei Song
Emine Yilmaz
Copyright-Jahr
2009
Verlag
Springer Berlin Heidelberg
Electronic ISBN
978-3-642-04417-5
Print ISBN
978-3-642-04416-8
DOI
https://doi.org/10.1007/978-3-642-04417-5

Neuer Inhalt