nach oben

2009 | Buch

Kapitel lesen Erstes Kapitel lesen

Advances in Information Retrieval

31th European Conference on IR Research, ECIR 2009, Toulouse, France, April 6-9, 2009. Proceedings

herausgegeben von: Mohand Boughanem, Catherine Berrut, Josiane Mothe, Chantal Soule-Dupuy

Verlag: Springer Berlin Heidelberg

Buchreihe : Lecture Notes in Computer Science

Enthalten in: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

Einloggen, um Zugang zu erhalten

Über dieses Buch

This book constitutes the refereed proceedings of the 30th annual European Conference on Information Retrieval Research, ECIR 2009, held in Toulouse, France in April 2009. The 42 revised full papers and 18 revised short papers presented together with the abstracts of 3 invited lectures and 25 poster papers were carefully reviewed and selected from 188 submissions. The papers are organized in topical sections on retrieval model, collaborative IR / filtering, learning, multimedia - metadata, expert search - advertising, evaluation, opinion detection, web IR, representation, clustering / categorization as well as distributed IR.

Inhaltsverzeichnis

Frontmatter

Invited Presentations

Query Evolution

Search engine queries have evolved over the past 30 years from complex Boolean formulations to short lists of “keywords.” Despite the apparent simplicity of short queries, choosing the right keywords can be difficult, and understanding user intentions is a major challenge. Techniques such as query expansion and context-based profiles have been developed to address these problems, but with limited success. Rather than trying to infer user intentions from very short queries, another approach is to improve query processing and retrieval models for long queries. In particular, query transformation is a new approach to improving search that appears to have considerable potential. In this approach, queries are transformed into one or more new queries using probabilistic models for generation or search of query archives. I will describe various transformation models and the role of a retrieval model in using these transformations. Examples will be given from applications such as collaborative question answering and forum search.

W. Bruce Croft

Searching User Generated Content: What’s Next?

In recent years, blog search has received a lot of attention. Since the launch of dedicated evaluation efforts and the release of blog data sets, our understanding of blog search has deepened considerably. But there’s more to user generated content than blogs and there’s more to searching user generated content than looking for material that is relevant or opinionated or highly rated by readers or authors. In this talk, a number of user generated content search scenarios from the media analysis and intelligence domains will be detailed. From these, recurring themes such as credibility, people finding, impact prediction, unusual event detection, report and summary generation—all on user generated content—will be identified as highly relevant research directions.

Maarten de Rijke

Upcoming Industrial Needs for Search

Enterprise search and web searching have different goals and characteristics. Whereas internet browsing can sometimes be seen as a form of entertainment, enterprise search involves activities in which search is mainly a tool. People have work they want to get done. In this context, the idea of relevance in documents is different. Community can become as important as content in search. Work-related search engines of the future will provide much greater analysis and structuring of documents at index time, and searchers will have more powerful tools at retrieval time. We will discuss these and other trends, and show what new methods and techniques should be targeted to improve enterprise search.

Gregory Grefenstette

Retrieval Model I

Mean-Variance Analysis: A New Document Ranking Theory in Information Retrieval

This paper concerns document ranking in information retrieval. In information retrieval systems, the widely accepted probability ranking principle (PRP) suggests that, for optimal retrieval, documents should be ranked in order of decreasing probability of relevance. In this paper, we present a new document ranking paradigm, arguing that a better, more general solution is to optimize top-

ranked documents as a whole, rather than ranking them independently. Inspired by the Modern Portfolio Theory in finance, we quantify a ranked list of documents on the basis of its expected overall relevance (mean) and its variance; the latter serves as a measure of risk, which was rarely studied for document ranking in the past. Through the analysis of the mean and variance, we show that an optimal rank order is the one that maximizes the overall relevance (mean) of the ranked list at a given risk level (variance). Based on this principle, we then derive an efficient document ranking algorithm. It extends the PRP by considering both the uncertainty of relevance predictions and correlations between retrieved documents. Furthermore, we quantify the benefits of diversification, and theoretically show that diversifying documents is an effective way to reduce the risk of document ranking. Experimental results on the collaborative filtering problem confirms the theoretical insights with improved recommendation performance, e.g., achieved over 300% performance gain over the PRP-based ranking on the user-based recommendation.

Jun Wang

Risk-Aware Information Retrieval

Probabilistic retrieval models usually rank documents based on a scalar quantity. However, such models lack any estimate for the uncertainty associated with a document’s rank. Further, such models seldom have an explicit utility (or cost) that is optimized when ranking documents. To address these issues, we take a Bayesian perspective that explicitly considers the uncertainty associated with the estimation of the probability of relevance, and propose an asymmetric cost function for document ranking. Our cost function has the advantage of adjusting the risk in document retrieval via a single parameter for any probabilistic retrieval model. We use the

logit

model to transform the document posterior distribution with probability space [0,1] into a normal distribution with variable space ( − ∞ , + ∞ ). We apply our risk adjustment approach to a language modelling framework for risk adjustable document ranking. Our experimental results show that our risk-aware model can significantly improve the performance of language models, both with and without background smoothing. When our method is applied to a language model without background smoothing, it can perform as well as the Dirichlet smoothing approach.

Jianhan Zhu, Jun Wang, Michael Taylor, Ingemar J. Cox

A Comparative Study of Utilizing Topic Models for Information Retrieval

We explore the utility of different types of topic models for retrieval purposes. Based on prior work, we describe several ways that topic models can be integrated into the retrieval process. We evaluate the effectiveness of different types of topic models within those retrieval approaches. We show that: (1) topic models are effective for document smoothing; (2) more rigorous topic models such as Latent Dirichlet Allocation provide gains over cluster-based models; (3) more elaborate topic models that capture topic dependencies provide no additional gains; (4) smoothing documents by using their similar documents is as effective as smoothing them by using topic models; (5) doing query expansion should utilize topics discovered in the top feedback documents instead of coarse-grained topics from the whole corpus; (6) generally, incorporating topics in the feedback documents for building relevance models can benefit the performance more for queries that have more relevant documents.

Xing Yi, James Allan

Collaborative IR/Filtering

Synchronous Collaborative Information Retrieval: Techniques and Evaluation

Synchronous Collaborative Information Retrieval refers to systems that support multiple users searching together at the same time in order to satisfy a shared information need. To date most SCIR systems have focussed on providing various awareness tools in order to enable collaborating users to coordinate the search task. However, requiring users to both search and coordinate the group activity may prove too demanding. On the other hand without effective coordination policies the group search may not be effective. In this paper we propose and evaluate novel system-mediated techniques for coordinating a group search. These techniques allow for an effective

division of labour

across the group whereby each group member can explore a subset of the search space. We also propose and evaluate techniques to support automated

sharing of knowledge

across searchers in SCIR, through novel

collaborative

and

complementary

relevance feedback techniques. In order to evaluate these techniques, we propose a framework for SCIR evaluation based on simulations. To populate these simulations we extract data from TREC interactive search logs. This work represent the first simulations of SCIR to date and the first such use of this TREC data.

Colum Foley, Alan F. Smeaton

Movie Recommender: Semantically Enriched Unified Relevance Model for Rating Prediction in Collaborative Filtering

Collaborative recommender systems aim to recommend items to a user based on the information gathered from

other

users who have similar interests. The current state-of-the-art systems fail to consider the underlying semantics involved when rating an item. This in turn contributes to many false recommendations. These models hinder the possibility of explaining

why a user has a particular interest

why a user likes a particular item

. In this paper, we develop an approach incorporating the underlying semantics involved in the rating. Experiments on a movie database show that this improves the accuracy of the model.

Yashar Moshfeghi, Deepak Agarwal, Benjamin Piwowarski, Joemon M. Jose

Revisiting IR Techniques for Collaborative Search Strategies

This paper revisits some of the established Information Retrieval (IR) techniques to investigate effective collaborative search strategies. We devised eight search strategies that divided labour and shared knowledge in teams using relevance feedback and clustering. We evaluated the performance of strategies with a user simulation enhanced by a query-pooling method. Our results show that relevance feedback is successful at formulating effective collaborative strategies while further effort is needed for clustering. We also measured the extent to which additional members improved the performance and an effect of search progress on the improvement.

Hideo Joho, David Hannah, Joemon M. Jose

Learning

Active Sampling for Rank Learning via Optimizing the Area under the ROC Curve

Learning ranking functions is crucial for solving many problems, ranging from document retrieval to building recommendation systems based on an individual user’s preferences or on collaborative filtering. Learning-to-rank is particularly necessary for adaptive or personalizable tasks, including email prioritization, individualized recommendation systems, personalized news clipping services and so on. Whereas the learning-to-rank challenge has been addressed in the literature, little work has been done in an active-learning framework, where requisite user feedback is minimized by selecting only the most informative instances to train the rank learner. This paper addresses active rank-learning head on, proposing a new sampling strategy based on minimizing hinge rank loss, and demonstrating the effectiveness of the active sampling method for rankSVM on two standard rank-learning datasets. The proposed method shows convincing results in optimizing three performance metrics, as well as improvement against four baselines including entropy-based, divergence- based, uncertainty-based and random sampling methods.

Pinar Donmez, Jaime G. Carbonell

Regression Rank: Learning to Meet the Opportunity of Descriptive Queries

We present a new

learning to rank

framework for estimating context-sensitive term weights without use of feedback. Specifically, knowledge of effective term weights on past queries is used to estimate term weights for new queries. This generalization is achieved by introducing secondary features correlated with term weights and applying regression to predict term weights given features. To improve support for more focused retrieval like question answering, we conduct document retrieval experiments with TREC description queries on three document collections. Results show significantly improved retrieval accuracy.

Matthew Lease, James Allan, W. Bruce Croft

Active Learning Strategies for Multi-Label Text Classification

Active learning

refers to the task of devising a ranking function that, given a classifier trained from relatively few training examples, ranks a set of additional unlabeled examples in terms of how much further information they would carry, once manually labeled, for retraining a (hopefully) better classifier. Research on active learning in text classification has so far concentrated on single-label classification; active learning for multi-label classification, instead, has either been tackled in a simulated (and, we contend, non-realistic) way, or neglected

tout court

. In this paper we aim to fill this gap by examining a number of realistic strategies for tackling active learning for multi-label classification. Each such strategy consists of a rule for combining the outputs returned by the individual binary classifiers as a result of classifying a given unlabeled document. We present the results of extensive experiments in which we test these strategies on two standard text classification datasets.

Andrea Esuli, Fabrizio Sebastiani

Joint Ranking for Multilingual Web Search

Ranking for multilingual information retrieval (MLIR) is a task to rank documents of different languages solely based on their relevancy to the query regardless of query’s language. Existing approaches are focused on combining relevance scores of different retrieval settings, but do not learn the ranking function directly. We approach Web MLIR ranking within the learning-to-rank (L2R) framework. Besides adopting popular L2R algorithms to MLIR, a joint ranking model is created to exploit the correlations among documents, and induce the joint relevance probability for all the documents. Using this method, the relevant documents of one language can be leveraged to improve the relevance estimation for documents of different languages. A probabilistic graphical model is trained for the joint relevance estimation. Especially, a hidden layer of nodes is introduced to represent the salient topics among the retrieved documents, and the ranks of the relevant documents and topics are determined collaboratively while the model approaching to its thermal equilibrium. Furthermore, the model parameters are trained under two settings: (1) optimize the accuracy of identifying relevant documents; (2) directly optimize information retrieval evaluation measures, such as mean average precision. Benchmarks show that our model significantly outperforms the existing approaches for MLIR tasks.

Wei Gao, Cheng Niu, Ming Zhou, Kam-Fai Wong

Multimedia - Metadata

Diversity, Assortment, Dissimilarity, Variety: A Study of Diversity Measures Using Low Level Features for Video Retrieval

In this paper we present a number of methods for re-ranking video search results in order to introduce diversity into the set of search results. The usefulness of these approaches is evaluated in comparison with similarity based measures, for the TRECVID 2007 collection and tasks [11]. For the MAP of the search results we find that some of our approaches perform as well as similarity based methods. We also find that some of these results can improve the P@N values for some of the lower N values. The most successful of these approaches was then implemented in an interactive search system for the TRECVID 2008 interactive search tasks. The responses from the users indicate that they find the more diverse search results extremely useful.

Martin Halvey, P. Punitha, David Hannah, Robert Villa, Frank Hopfgartner, Anuj Goyal, Joemon M. Jose

Bayesian Mixture Hierarchies for Automatic Image Annotation

Previous research on automatic image annotation has shown that accurate estimates of the class conditional densities in generative models have a positive effect in annotation performance. We focus on the problem of density estimation in the context of automatic image annotation and propose a novel Bayesian hierarchical method for estimating mixture models of Gaussian components. The proposed methodology is examined in a well-known benchmark image collection and the results demonstrate its competitiveness with the state of the art.

Vassilios Stathopoulos, Joemon M. Jose

XML Multimedia Retrieval: From Relevant Textual Information to Relevant Multimedia Fragments

In this paper, we are interested in XML multimedia document retrieval, whose aim is to find relevant multimedia components (i.e. XML fragments containing another media than text) that focus on the user needs. The work described here is carried out with images, but can be extended to any other media. We propose an XML multimedia fragment retrieval approach based on two steps. In a first step, we search for relevant images and then we retrieve the best multimedia fragments corresponding to these images. Image retrieval is done using textual and structural information from ascendant, sibling and direct descendant nodes in the XML tree, while multimedia fragment retrieval is done by evaluating the score of ancestors of images retrieved in the first step. Experiments were done on the INEX 2006 and 2007 Multimedia Fragments task and show the interest of our method.

Mouna Torjmen, Karen Pinel-Sauvagnat, Mohand Boughanem

Effectively Searching Maps in Web Documents

Maps are an important source of information in archaeology and other sciences. Users want to search for historical maps to determine recorded history of the political geography of regions at different eras, to find out where exactly archaeological artifacts were discovered, etc. Currently, they have to use a generic search engine and add the term map along with other keywords to search for maps. This crude method will generate a significant number of false positives that the user will need to cull through to get the desired results. To reduce their manual effort, we propose an automatic map identification, indexing, and retrieval system that enables users to search and retrieve maps appearing in a large corpus of digital documents using simple keyword queries. We identify features that can help in distinguishing maps from other figures in digital documents and show how a Support-Vector-Machine-based classifier can be used to identify maps. We propose map-level-metadata e.g., captions, references to the maps in text, etc. and document-level metadata, e.g., title, abstract, citations, how recent the publication is, etc. and show how they can be automatically extracted and indexed. Our novel ranking algorithm weights different metadata fields differently and also uses the document-level metadata to help rank retrieved maps. Empirical evaluations show which features should be selected and which metadata fields should be weighted more. We also demonstrate improved retrieval results in comparison to adaptations of existing methods for map retrieval. Our map search engine has been deployed in an online map-search system that is part of the Blind-Review digital library system.

Qingzhao Tan, Prasenjit Mitra, C. Lee Giles

Expert Search - Advertising

Enhancing Expert Finding Using Organizational Hierarchies

The task in expert finding is to identify members of an organization with relevant expertise on a given topic. In existing expert finding systems, profiles are constructed from sources such as email or documents, and used as the basis for expert identification. In this paper, we leverage the organizational hierarchy (depicting relationships between managers, subordinates, and peers) to find members for whom we have little or no information. We propose an algorithm to improve expert finding performance by considering not only the expertise of the member, but also the expertise of his or her neighbors. We show that providing this additional information to an expert finding system improves its retrieval performance.

Maryam Karimzadehgan, Ryen W. White, Matthew Richardson

A Vector Space Model for Ranking Entities and Its Application to Expert Search

Entity Ranking has recently become an important search task in Information Retrieval. The goal is not to find documents matching query terms, but, instead, finding entities. In this paper we propose a formal model to search entities as well as a complete Entity Ranking system, providing examples of its application to the enterprise context. We experimentally evaluate our system on the Expert Search task in order to show how it can be adapted to different scenarios. The results show that combining simple IR techniques we improve of 53% in terms of P@10 over our baseline.

Gianluca Demartini, Julien Gaugaz, Wolfgang Nejdl

Sentiment-Oriented Contextual Advertising

Web advertising (Online advertising) is a form of promotion that uses the World Wide Web for the expressed purpose of delivering marketing messages to attract customers. This paper addresses the mechanism of Content-Oriented advertising (Contextual advertising), which refers to the assignment of relevant ads within the content of a generic web page, e.g. blogs. As blogs become a platform for expressing personal opinion, they naturally contain various kinds of expressions, including both facts and comments of both a positive and negative nature. In this paper, we propose the utilization of sentiment detection to improve Web-based contextual advertising. The proposed

SOCA

(Sentiment-Oriented Contextual Advertising) framework aims to combine contextual advertising matching with sentiment analysis to select ads that are related to the positive (and neutral) aspects of a blog and rank them according to their relevance. We experimentally validate our approach using a set of data that includes both real ads and actual blog pages. The results clearly indicate that our proposed method can effectively identify those ads that are positively correlated with the given blog pages.

Teng-Kai Fan, Chia-Hui Chang

Lexical Graphs for Improved Contextual Ad Recommendation

Contextual advertising is a form of online advertising presenting consistent revenue growth since its inception. In this work, we study the problem of recommending a small set of ads to a user based solely on the currently viewed web page, often referred to as content-targeted advertising. Matching ads with web pages is a challenging task for traditional information retrieval systems due to the brevity and sparsity of advertising text, which leads to the widely recognized

vocabulary impedance

problem. To this end, we propose the use of lexical graphs created from web corpora as a means of computing improved content similarity metrics between ads and web pages. The results of our experimental study provide evidence of significant improvement in the perceived relevance of the recommended ads.

Symeon Papadopoulos, Fotis Menemenis, Yiannis Kompatsiaris, Ben Bratu

Retrieval Model II

A Probabilistic Retrieval Model for Semistructured Data

Retrieving semistructured (XML) data typically requires either a structured query such as XPath, or a keyword query that does not take structure into account. In this paper, we infer structural information automatically from keyword queries and incorporate this into a retrieval model. More specifically, we propose the concept of a mapping probability, which maps each query word into a related field (or XML element). This mapping probability is used as a weight to combine the language models estimated from each field. Experiments on two test collections show that our retrieval model based on mapping probabilities outperforms baseline techniques significantly.

Jinyoung Kim, Xiaobing Xue, W. Bruce Croft

Model Fusion in Conceptual Language Modeling

We study in this paper the combination of different concept detection methods for conceptual indexing. Conceptual indexing shows effective results when large knowledge bases are available. But concept detection is not always accurate and errors limit interest of concept usage. A solution to solve this problem is to combine different concept detection methods. In this paper, we investigate several ways to combine concept detection methods, both on queries and documents, within the framework of the language modeling approach to IR. Our experiments show that our model fusion improves the standard language model by up to 17% on mean average precision.

Loic Maisonnasse, Eric Gaussier, Jean-Pierre Chevallet

Graded-Inclusion-Based Information Retrieval Systems

This paper investigates the use of fuzzy logic mechanisms coming from the database community, namely graded inclusions, to model the information retrieval process. In this framework, documents and queries are represented by fuzzy sets, which are paired with operations like fuzzy implications and T-norms. Through different experiments, it is shown that only some among the wide range of fuzzy operations are relevant for information retrieval. When appropriate settings are chosen, it is possible to mimic classical systems, thus yielding results rivaling those of state-of-the-art systems. These positive results validate the proposed approach, while negative ones give some insights on the properties needed by such a model. Moreover, this paper shows the added-value of this graded inclusion-based model, which gives new and theoretically grounded ways for a user to easily weight his query terms, to include negative information in his queries, or to expand them with related terms.

Patrick Bosc, Vincent Claveau, Olivier Pivert, Laurent Ughetto

Multidimensional Relevance: A New Aggregation Criterion

In this paper, a new model for aggregating multiple criteria evaluations for relevance assessment is proposed. An information retrieval context is considered, where relevance is modelled as a multidimensional property of documents. In the paper, the proposed aggregation operator is applied to define a model for personalized Information Retrieval (IR), in which four criteria are considered in order to assess document relevance:

aboutness

coverage

appropriateness

and

reliability

The originality of this approach lies in the aggregation of the considered criteria in a prioritized way, by considering the existence of a prioritization relationship over the criteria. Such a prioritization is modeled by making the weights associated with a criterion dependent upon the satisfaction of the higher-priority criteria. This way, it is possible to take into account the fact that the weight of a less important criterion should be proportional to the satisfaction degree of the more important criterion.

In the paper, some preliminary experimental results are also reported.

Célia da Costa Pereira, Mauro Dragoni, Gabriella Pasi

Evaluation

Using Multiple Query Aspects to Build Test Collections without Human Relevance Judgments

Collecting relevance judgments (qrels) is an especially challenging part of building an information retrieval test collection. This paper presents a novel method for creating test collections by offering a substitute for relevance judgments. Our method is based on an old idea in IR: a single information need can be represented by many query articulations. We call different articulations of a particular need

query aspects

. By combining the top

documents retrieved by a single system for multiple query aspects, we build judgment-free qrels whose rank ordering of IR systems correlates highly with rankings based on human relevance judgments.

Miles Efron

If I Had a Million Queries

As document collections grow larger, the information needs and relevance judgments in a test collection must be well-chosen within a limited budget to give the most reliable and robust evaluation results. In this work we analyze a sample of queries categorized by length and corpus-appropriateness to determine the right proportion needed to distinguish between systems. We also analyze the appropriate division of labor between developing topics and making relevance judgments, and show that only a small, biased sample of queries with sparse judgments is needed to produce the same results as a much larger sample of queries.

Ben Carterette, Virgil Pavlu, Evangelos Kanoulas, Javed A. Aslam, James Allan

The Combination and Evaluation of Query Performance Prediction Methods

In this paper, we examine a number of newly applied methods for combining pre-retrieval query performance predictors in order to obtain a better prediction of the query’s performance. However, in order to adequately and appropriately compare such techniques, we critically examine the current evaluation methodology and show how using linear correlation coefficients (i) do not provide an intuitive measure indicative of a method’s quality, (ii) can provide a misleading indication of performance, and (iii) overstate the performance of combined methods. To address this, we extend the current evaluation methodology to include cross validation, report a more intuitive and descriptive statistic, and apply statistical testing to determine significant differences. During the course of a comprehensive empirical study over several TREC collections, we evaluate nineteen pre-retrieval predictors and three combination methods.

Claudia Hauff, Leif Azzopardi, Djoerd Hiemstra

Opinion Detection

Investigating Learning Approaches for Blog Post Opinion Retrieval

Blog post opinion retrieval is the problem of identifying posts which express an opinion about a particular topic. Usually the problem is solved using a 3 step process in which relevant posts are first retrieved, then opinion scores are generated for each document, and finally the opinion and relevance scores are combined to produce a single ranking. In this paper, we study the effectiveness of classification and rank learning techniques for solving the blog post opinion retrieval problem. We have chosen not to rely on external lexicons of opinionated terms, but investigate to what extent the list of opinionated terms can be mined from the same corpus of relevance/opionion assessments that are used to train the retrieval system. We compare popular feature selection methods such as the

weighted log likelihood ratio

and

mutual information

for use both in selecting terms for training an opinionated document classifier and also as term weights for generating simpler (not learning based) aggregate opinion scores for documents. We thereby analyze what performance gains result from learning in the opinion detection phase. Furthermore we compare different learning and not learning based methods for combining relevance and opinion information in order to generate a ranked list of opinionated posts, thereby investigating the effect of learning on the ranking phase.

Shima Gerani, Mark J. Carman, Fabio Crestani

Integrating Proximity to Subjective Sentences for Blog Opinion Retrieval

Opinion finding is a challenging retrieval task, where it has been shown that it is especially difficult to improve over a strongly performing topic-relevance baseline. In this paper, we propose a novel approach for opinion finding, which takes into account the proximity of query terms to subjective sentences in a document. We adapt two state-of-the-art opinion detection techniques to identify subjective sentences from the retrieved documents. Our first technique uses the OpinionFinder toolkit to classify the subjectiveness of sentences in a document. Our second technique uses an automatically generated dictionary of subjective terms derived from the document collection itself to identify the most subjective sentences in a document. We extend the Divergence From Randomness (DFR) proximity model to integrate the proximity of query terms to the subjective sentences identified by either of the proposed techniques. We evaluate these techniques on five different strong baselines across two different query datasets from the TREC Blog track. We show that we can significantly improve over the baselines and that, in several settings, our proposed techniques can at least match the top performing systems at the TREC Blog track.

Rodrygo L. T. Santos, Ben He, Craig Macdonald, Iadh Ounis

Adapting Naive Bayes to Domain Adaptation for Sentiment Analysis

In the community of sentiment analysis, supervised learning techniques have been shown to perform very well. When transferred to another domain, however, a supervised sentiment classifier often performs extremely bad. This is so-called domain-transfer problem. In this work, we attempt to attack this problem by making the maximum use of both the old-domain data and the unlabeled new-domain data. To leverage knowledge from the old-domain data, we proposed an effective measure, i.e., Frequently Co-occurring Entropy (FCE), to pick out generalizable features that occur frequently in both domains and have similar occurring probability. To gain knowledge from the new-domain data, we proposed Adapted Naïve Bayes (ANB), a weighted transfer version of Naive Bayes Classifier. The experimental results indicate that proposed approach could improve the performance of base classifier dramatically, and even provide much better performance than the transfer-learning baseline, i.e. the Naïve Bayes Transfer Classifier (NTBC).

Songbo Tan, Xueqi Cheng, Yuefen Wang, Hongbo Xu

Web IR

PathRank: Web Page Retrieval with Navigation Path

This paper describes a path-based method to use the multi-step navigation information discovered from website structures for web page ranking. Use of hyperlinks to enhance page ranking has been widely studied. The underlying assumption is that hyperlinks convey recommendations. Although this technique has been used successfully in global web search, it produces poor results for website search, because the majority of the hyperlinks in local websites are used to organize information and convey no recommendations. This paper defines the Hierarchical Navigation Path (HNP) as a new resource to exploit these hyperlinks for improved web search. HNP is composed of multi-step hyperlinks in visitors’ website navigation. It provides indications of the content of the destination page. The HierPathExt algorithm is given to extract HNPs in local websites. Then, the PathRank algorithm is created to use HNPs for web page retrieval. The experiments show that our approach results in significant improvements over existing solutions.

Jianqiang Li, Yu Zhao

Query Expansion Using External Evidence

Automatic query expansion may be used in document retrieval to improve search effectiveness. Traditional query expansion methods are based on the document collection itself. For example, pseudo-relevance feedback (PRF) assumes that the top retrieved documents are relevant, and uses the terms extracted from those documents for query expansion. However, there are other sources of evidence that can be used for expansion, some of which may give better search results with greater efficiency at query time. In this paper, we use the external evidence, especially the hints obtained from external web search engines to expand the original query. We explore 6 different methods using search engine query log, snippets and search result documents. We conduct extensive experiments, with state of the art PRF baselines and careful parameter tuning, on three TREC collections: AP, WT10g, GOV2. Log-based methods do not show consistent significant gains, despite being very efficient at query-time. Snippet-based expansion, using the summaries provided by an external search engine, provides significant effectiveness gains with good efficiency at query-time.

Zhijun Yin, Milad Shokouhi, Nick Craswell

Selective Application of Query-Independent Features in Web Information Retrieval

The application of query-independent features, such as PageRank, can boost the retrieval effectiveness of a Web Information Retrieval (IR) system. In some previous works, a query-independent feature is uniformly applied to all queries. Other works predict the most useful feature based on the query type. However, the accuracy of the current query type prediction methods is not high. In this paper, we investigate a novel approach that applies the most appropriate query-independent feature on a per-query basis, and does not require the knowledge of the query type. The approach is based on an estimate of the divergence between the retrieved document scores’ distributions prior to, and after the integration of a query-independent feature. We evaluate our approach on the TREC .GOV Web test collection and the mixed topic sets from TREC 2003 & 2004 Web search tasks. Our experimental results demonstrate that the selective application of a query-independent feature on a per-query basis is very effective and robust. In particular, it outperforms a query type prediction-based method, even when this method is simulated with a 100% query type prediction accuracy.

Jie Peng, Iadh Ounis

Measuring the Search Effectiveness of a Breadth-First Crawl

Previous scalability experiments found that early precision improves as collection size increases. However, that was under the assumption that a collection’s documents are all sampled with uniform probability from the same population. We contrast this to a large breadth-first web crawl, an important scenario in real-world Web search, where the early documents have quite different characteristics from the later documents. Having observed that NDCG@100 (measured over a set of reference queries) begins to plateau in the initial stages of the crawl, we investigate a number of possible reasons for this behaviour. These include the web-pages themselves, the metric used to measure retrieval effectiveness as well as the set of relevance judgements used.

Dennis Fetterly, Nick Craswell, Vishwa Vinay

Representation

Using Contextual Information to Improve Search in Email Archives

In this paper we address the task of finding topically relevant email messages in public discussion lists. We make two important observations. First, email messages are not isolated, but are part of a larger online environment. This context, existing on different levels, can be incorporated into the retrieval model. We explore the use of thread, mailing list, and community content levels, by expanding our original query with term from these sources. We find that query models based on contextual information improve retrieval effectiveness. Second, email is a relatively informal genre, and therefore offers scope for incorporating techniques previously shown useful in searching user-generated content. Indeed, our experiments show that using query-independent features (email length, thread size, and text quality), implemented as priors, results in further improvements.

Wouter Weerkamp, Krisztian Balog, Maarten de Rijke

Part of Speech Based Term Weighting for Information Retrieval

Automatic language processing tools typically assign to terms so-called ‘weights’ corresponding to the contribution of terms to information content. Traditionally, term weights are computed from lexical statistics, e.g., term frequencies. We propose a new type of term weight that is computed from part of speech (POS) n-gram statistics. The proposed POS-based term weight represents how informative a term is in general, based on the ‘POS contexts’ in which it generally occurs in language. We suggest five different computations of POS-based term weights by extending existing statistical approximations of term information measures. We apply these POS-based term weights to information retrieval, by integrating them into the model that matches documents to queries. Experiments with two TREC collections and 300 queries, using TF-IDF & BM25 as baselines, show that integrating our POS-based term weights to retrieval always leads to gains (up to +33.7% from the baseline). Additional experiments with a different retrieval model as baseline (Language Model with Dirichlet priors smoothing) and our best performing POS-based term weight, show retrieval gains always and consistently across the whole smoothing range of the baseline.

Christina Lioma, Roi Blanco

Word Particles Applied to Information Retrieval

Document retrieval systems conventionally use words as the basic unit of representation, a natural choice since words are primary carriers of semantic information. In this paper we propose the use of a different, phonetically defined unit of representation that we call “particles”. Particles are phonetic sequences that do not possess meaning. Both documents and queries are converted from their standard word-based form into sequences of particles. Indexing and retrieval is performed with particles. Experiments show that this scheme is capable of achieving retrieval performance that is comparable to that from words when the text in the documents and queries are clean, and can result in significantly improved retrieval when they are noisy.

Evandro B. Gouvêa, Bhiksha Raj

“They Are Out There, If You Know Where to Look”: Mining Transliterations of OOV Query Terms for Cross-Language Information Retrieval

It is well known that the use of a good Machine Transliteration system improves the retrieval performance of Cross-Language Information Retrieval (CLIR) systems when the query and document languages have different orthography and phonetic alphabets. However, the effectiveness of a Machine Transliteration system in CLIR is limited by its ability to produce relevant transliterations, i.e. those transliterations which are actually present in the relevant documents. In this work, we propose a new approach to the problem of finding transliterations for out-of-vocabulary query terms. Instead of “generating” the transliterations using a Machine Transliteration system, we “mine” them, using a transliteration similarity model, from the top CLIR results for the query. We treat the query and each of the top results as “comparable” documents and search for transliterations in these comparable document pairs. We demonstrate the effectiveness of our approach using queries in two languages from two different linguistic families to retrieve English documents from two standard CLEF collections. We also compare our results with those of a state-of-the-art Machine Transliteration system.

Raghavendra Udupa, Saravanan K, Anton Bakalov, Abhijit Bhole

Clustering / Categorization

E-Mail Classification for Phishing Defense

We discuss a classification-based approach for filtering phishing messages in an e-mail stream. Upon arrival, various features of every e-mail are extracted. This forms the basis of a classification process which detects potentially harmful phishing messages. We introduce various new features for identifying phishing messages and rank established as well as newly introduced features according to their significance for this classification problem. Moreover, in contrast to classical binary classification approaches (spam vs. not spam), a more refined ternary classification approach for filtering e-mail data is investigated which automatically distinguishes three message types: ham (solicited e-mail), spam, and phishing.

Experiments with representative data sets illustrate that our approach yields better classification results than existing phishing detection methods. Moreover, the direct ternary classification proposed is compared to a sequence of two binary classification processes. Direct one-step ternary classification is not only more efficient, but is also shown to achieve better accuracy than repeated binary classification.

Wilfried N. Gansterer, David Pölz

Multi-facet Rating of Product Reviews

Online product reviews are becoming increasingly available, and are being used more and more frequently by consumers in order to choose among competing products. Tools that rank competing products in terms of the satisfaction of consumers that have purchased the product before, are thus also becoming popular. We tackle the problem of rating (i.e., attributing a numerical score of satisfaction to) consumer reviews based on their textual content. We here focus on

multi-facet

review rating, i.e., on the case in which the review of a product (e.g., a hotel) must be rated several times, according to several aspects of the product (for a hotel: cleanliness, centrality of location, etc.). We explore several aspects of the problem, with special emphasis on how to generate vectorial representations of the text by means of POS tagging, sentiment analysis, and feature selection for ordinal regression learning. We present the results of experiments conducted on a dataset of more than 15,000 reviews that we have crawled from a popular hotel review site.

Stefano Baccianella, Andrea Esuli, Fabrizio Sebastiani

Exploiting Surface Features for the Prediction of Podcast Preference

Podcasts display an unevenness characteristic of domains dominated by user generated content, resulting in potentially radical variation of the user preference they enjoy. We report on work that uses easily extractable surface features of podcasts in order to achieve solid performance on two podcast preference prediction tasks: classification of preferred vs. non-preferred podcasts and ranking podcasts by level of preference. We identify features with good discriminative potential by carrying out manual data analysis, resulting in a refinement of the indicators of an existent podcast preference framework. Our preference prediction is useful for topic-independent ranking of podcasts, and can be used to support download suggestion or collection browsing.

Manos Tsagkias, Martha Larson, Maarten de Rijke

Distributed IR

A Topic-Based Measure of Resource Description Quality for Distributed Information Retrieval

The aim of query-based sampling is to obtain a sufficient, representative sample of an underlying (text) collection. Current measures for assessing sample quality are too coarse grain to be informative. This paper outlines a measure of finer granularity based on probabilistic topic models of text. The assumption we make is that a representative sample should capture the broad themes of the underlying text collection. If these themes are not captured, then resource selection will be affected in terms of performance, coverage and reliability. For example, resource selection algorithms that require extrapolation from a small sample of indexed documents to determine which collections are most likely to hold relevant documents may be affected by samples which do not reflect the topical density of a collection. To address this issue we propose to measure the relative entropy between topics obtained in a sample with respect to the complete collection. Topics are both modelled from the collection and inferred in the sample using latent Dirichlet allocation. The paper outlines an analysis and evaluation of this methodology across a number of collections and sampling algorithms.

Mark Baillie, Mark J. Carman, Fabio Crestani

Simple Adaptations of Data Fusion Algorithms for Source Selection

Source selection deals with the problem of selecting the most appropriate information sources from the set of, usually non-intersecting, available document collections. On the other hand, data fusion techniques (also known as metasearch techniques) deal with the problem of aggregating the results from multiple, usually completely or partly intersecting, document sources in order to provide a wider coverage and a more effective retrieval result. In this paper we study some simple adaptations to traditional data fusion algorithms for the task of source selection in uncooperative distributed information retrieval environments. The experiments demonstrate that the performance of data fusion techniques at source selection tasks is comparable with that of state-of-the-art source selection algorithms and they are often able to surpass them.

Georgios Paltoglou, Michail Salampasis, Maria Satratzemi

Document Compaction for Efficient Query Biased Snippet Generation

Current web search engines return query-biased snippets for each document they list in a result set. For efficiency, search engines operating on large collections need to cache snippets for common queries, and to cache documents to allow fast generation of snippets for uncached queries. To improve the hit rate on a document cache during snippet generation, we propose and evaluate several schemes for reducing document size, hence increasing the number of documents in the cache. In particular, we argue against further improvements to document compression, and argue for schemes that prune documents based on the

a priori

likelihood that a sentence will be used as part of a snippet for a given document. Our experiments show that if documents are reduced to less than half their original size, 80% of snippets generated are identical to those generated from the original documents. Moreover, as the pruned, compressed surrogates are smaller, 3-4 times as many documents can be cached.

Yohannes Tsegay, Simon J. Puglisi, Andrew Turpin, Justin Zobel

Short Papers

Organizing Suggestions in Autocompletion Interfaces

We describe two user studies that investigate organization strategies of autocompletion in a known-item search task: searching for terms taken from a thesaurus. In Study 1, we explored ways of grouping term suggestions from two different thesauri (TGN and WordNet) and found that different thesauri may require different organization strategies. Users found

Group

organization more appropriate for location names from TGN, while

Alphabetical

works better for object names from WordNet. In Study 2, we compared three different organization strategies (

Alphabetical

Group

and

Composite

) for location name search tasks. The results indicate that for TGN autocompletion interfaces help improve the quality of keywords,

Group

and

Composite

organization help users search faster, and is perceived easier to understand and to use than

Alphabetical

Alia Amin, Michiel Hildebrand, Jacco van Ossenbruggen, Vanessa Evers, Lynda Hardman

Building a Graph of Names and Contextual Patterns for Named Entity Classification

An algorithm that bootstraps the acquisition of large dictionaries of entity types (names) and pattern types from a few seeds and a large unannotated corpora is presented. The algorithm iteratively builds a bigraph of entities and collocated patterns by querying the text. Several classes simultaneously compete to label the entity types. Different experiments have been carried to acquire resources from a 1GB corpus of Spanish news. The usefulness of the acquired list of entity types for the task of Name Classification has also been evaluated with good results for a weakly supervised method.

César de Pablo-Sánchez, Paloma Martínez

Combination of Documents Features Based on Simulated Click-through Data

Many different ranking algorithms based on content and context have been used in web search engines to find pages based on a user query. Furthermore, to achieve better performance some new solutions combine different algorithms. In this paper we use simulated click-through data to learn how to combine many content and context features of web pages. This method is simple and practical to use with actual click-through data in a live search engine. The proposed approach is evaluated using the LETOR benchmark and we found it is competitive to Ranking SVM based on user judgments.

Ali Mohammad Zareh Bidoki, James A. Thom

Discovering Association Rules on Experiences from Large-Scale Blog Entries

This paper proposes a method for discovering association rules on peoples’ experiences extracted from a large-scale set of blog entries. In our definition, a person’s experience can be expressed by five attributes: time, location, activity, opinion and emotion. The system implementing our proposed method actually generates and ranks association rules between attributes by applying several interestingness measures proposed in the area of data mining to the experiences extracted from 48 million blog entries. An experiment shows that the system successfully mines peoples’ activities and emotions which are specific to location and time period.

Takeshi Kurashima, Ko Fujimura, Hidenori Okuda

Extracting Geographic Context from the Web: GeoReferencing in MyMoSe

Many Web pages are clearly related to specific locations. Identifying this geographic focus is the cornerstone of the next generation of geographic context aware search services. This paper shows a multistage method for assigning a geographic focus to Web pages (GeoReferencing), using several heuristics for toponym disambiguation and a scoring function for focus determination. We provide an experimental methodology for evaluating the accuracy of the system with Web pages in English and Spanish. Finally, we have obtained promising results, reaching an accuracy of over 70% with a town-level resolution.

Álvaro Zubizarreta, Pablo de la Fuente, José M. Cantera, Mario Arias, Jorge Cabrero, Guido García, César Llamas, Jesús Vegas

What Else Is There? Search Diversity Examined

This paper describes a study on diversity in image search results. One of the first test collections explicitly built to study diversity – the ImageCLEFPhoto 2008 collection – was used in an evaluation exercise in the summer of 2008. Analyzing 200 of the runs submitted by 24 research groups enabled the relationship between precision and result diversity to be studied. In addition, the level of diversity present in search results produced by retrieval systems built without explicit support for diversity was computed. The remaining potential to improve on diversity was calculated and finally, a significant preference by users for diverse search results was shown.

Mark Sanderson, Jiayu Tang, Thomas Arni, Paul Clough

Using Second Order Statistics to Enhance Automated Image Annotation

We examine whether a traditional automated annotation system can be improved by using background knowledge. Traditional means any machine learning approach together with image analysis techniques. We use as a baseline for our experiments the work done by Yavlinsky et al. [1] who deployed non-parametric density estimation. We observe that probabilistic image analysis by itself is not enough to describe the rich semantics of an image. Our hypothesis is that more accurate annotations can be produced by introducing additional knowledge in the form of statistical co-occurrence of terms. This is provided by the context of images that otherwise independent keyword generation would miss. We test our algorithm with two different datasets: Corel 5k and ImageCLEF 2008. For the Corel 5k dataset, we obtain significantly better results while our algorithm appears in the top quartile of all methods submitted in ImageCLEF 2008.

Ainhoa Llorente, Stefan Rüger

Classifying and Characterizing Query Intent

Understanding the intent underlying users’ queries may help personalize search results and improve user satisfaction. In this paper, we develop a methodology for using ad clickthrough logs, query specific information, and the content of search engine result pages to study characteristics of query intents, specially commercial intent. The findings of our study suggest that ad clickthrough features, query features, and the content of search engine result pages are together effective in detecting query intent. We also study the effect of query type and the number of displayed ads on the average clickthrough rate. As a practical application of our work, we show that modeling query intent can improve the accuracy of predicting ad clickthrough for previously unseen queries.

Azin Ashkan, Charles L. A. Clarke, Eugene Agichtein, Qi Guo

Design and Evaluation of a University-Wide Expert Search Engine

We present an account of designing and evaluating a university-wide expert search engine. We performed system-based evaluation to determine the optimal retrieval settings and an extensive user-based evaluation with three different user groups: scientific researchers, students, and outside visitors of the website looking for experts. Our search engine significantly outperformed the old search system in terms of effectiveness, efficiency, and user satisfaction.

Ruud Liebregts, Toine Bogers

A Study of the Impact of Index Updates on Distributed Query Processing for Web Search

Query processing in Web search engines today is mainly performed within a single site or data center, which is required to scale as the Web grows and users require fast answers to their queries. Constraints in the size and cost of data centers, however, may limit the scalability of search engines. Multi-site search engines that perform distributed query processing represent one way to overcome such constraints. Each site processes locally as many queries as possible, keeping latency low without contacting remote sites. Forwarding a query to remote sites depends on the document collection of remote sites. Multi-site search engines pose several new challenges. When a site updates its index, it has to inform other sites. The updates, however, are not instantaneous due to the volume of data exchanged or possible network failures. During the period of time that there are index inconsistencies across sites, queries may not be forwarded optimally. In this work, we investigate the impact of index inconsistencies on a distributed query processing algorithm, when there are index updates, and we observe that delayed index information propagation reduces the effectiveness of query processing, because queries are less likely to be routed optimally.

Charalampos Sarigiannis, Vassilis Plachouras, Ricardo Baeza-Yates

Generic and Spatial Approaches to Image Search Results Diversification

We propose a generic diversity and two novel spatial diversity algorithms for (image) search result diversification. The outputs of the algorithms are compared with the standard search results (which contains no diversity implementation) and found to be promising. In particular, the geometric mean spatial diversity algorithm manages to achieve good geographical diversity while not significantly reducing precision. To the best of our knowledge, such a quantitive evaluation of spatial diversity algorithms for context based image retrieval is new to the community.

Monica Lestari Paramita, Jiayu Tang, Mark Sanderson

Studying Query Expansion Effectiveness

Query expansion is an effective technique in improving the retrieval performance for ad-hoc retrieval. However, query expansion can also fail, leading to a degradation of the retrieval performance. In this paper, we aim to provide a better understanding of query expansion by an empirical study on what factors can affect query expansion, and how these factors affect query expansion. We examine how the quality of the query, measured by the first-pass retrieval performance, is related to the effectiveness of query expansion. Our experimental results only show a moderate relation between them, indicating that the first-pass retrieval has only a moderate impact on the effectiveness of query expansion. Our results also show that the feedback documents should not only be relevant, but should also have a dedicated interest in the topic.

Ben He, Iadh Ounis

Correlation of Term Count and Document Frequency for Google N-Grams

For bounded datasets such as the TREC Web Track (WT10g) the computation of term frequency (TF) and inverse document frequency (IDF) is not difficult. However, when the corpus is the entire web, direct IDF calculation is impossible and values must instead be estimated. Most available datasets provide values for

term count (TC)

meaning the number of times a certain term occurs in the entire corpus. Intuitively this value is different from

document frequency (DF)

, the number of documents (e.g., web pages) a certain term occurs in. We investigate the relationship between

and

values of terms occurring in the Web as Corpus (WaC) and also the similarity between

values obtained from the WaC and the Google N-gram dataset. A strong correlation between the two would gives us confidence in using the Google N-grams to estimate accurate IDF values which for example is the foundation to generate well performing lexical signatures based on the TF-IDF scheme. Our results show a very strong correlation between

and

within the WaC with Spearman’s

≥ 0.8 (

≤ 2.2×10

− 16

) and a high similarity between

values from the WaC and the Google N-grams.

Martin Klein, Michael L. Nelson

A Cost-Aware Strategy for Query Result Caching in Web Search Engines

Search engines and large scale IR systems need to cache query results for efficiency and scalability purposes. In this study, we propose to explicitly incorporate the query costs in the static caching policy. To this end, a query’s cost is represented by its execution time, which involves CPU time to decompress the postings and compute the query-document similarities to obtain the final top-

answers. Simulation results using a large Web crawl data and a real query log reveal that the proposed strategy improves overall system performance in terms of the total query execution time.

Ismail Sengor Altingovde, Rifat Ozcan, Özgür Ulusoy

Quality-Oriented Search for Depression Portals

The problem of low-quality information on the Web is nowhere more important than in the domain of health, where unsound information and misleading advice can have serious consequences. The quality of health web sites can be rated by subject experts against evidence-based guidelines. We previously developed an automated quality rating technique (AQA) for depression websites and showed that it correlated 0.85 with such expert ratings.

In this paper, we use AQA to filter or rerank Google results returned in response to queries relating to depression. We compare this to an unrestricted quality-oriented (AQA based) focused crawl starting from an Open Directory category and a conventional crawl with manually constructed seedlist and inclusion rules. The results show that post-processed Google outperforms other forms of search engine restricted to the domain of depressive illness on both relevance and quality.

Thanh Tang, David Hawking, Ramesh Sankaranarayana, Kathleen M. Griffiths, Nick Craswell

Evaluation of Text Clustering Algorithms with N-Gram-Based Document Fingerprints

This paper presents a new approach designed to reduce the computational load of the existing clustering algorithms by trimming down the documents size using fingerprinting methods. Thorough evaluation was performed over three different collections and considering four different metrics. The presented approach to document clustering achieved good values of effectiveness with considerable save in memory space and computation time.

Javier Parapar, Álvaro Barreiro

Exploiting Flickr Tags and Groups for Finding Landmark Photos

Many people take pictures of different city landmarks and post them to photo-sharing systems like Flickr. They also add tags and place photos in Flickr groups, created around particular themes. Using tags, other people can search for representative landmark images of places of interest. Searching for landmarks using tags results into many non-landmark photos and provides poor landmark summary for a city. In this paper we propose a new method to identify landmark photos using tags and social Flickr groups. In contrast to similar modern systems, our approach is also applicable when GPS-coordinates for photos are not available. Presented user study shows that the proposed method outperforms state-of-the-art systems for landmark finding.

Rabeeh Abbasi, Sergey Chernov, Wolfgang Nejdl, Raluca Paiu, Steffen Staab

Refining Keyword Queries for XML Retrieval by Combining Content and Structure

The structural heterogeneity and complexity of XML repositories makes query formulation challenging for users who have little knowledge of XML. To assist its users, an XML retrieval system can have a keyword-based interface, relegating the task of combining textual and structural clues to the retrieval algorithm. In this work, we propose an automatic query refinement method to transform a keyword query into structured XML queries that capture the original information need and conform to the underlying XML data. We formulate query generation as a search problem, and show the effectiveness of the method in generating accurate content-and-structure queries.

Desislava Petkova, W. Bruce Croft, Yanlei Diao

Posters

Cover Coefficient-Based Multi-document Summarization

In this paper we present a generic, language independent multi-document summarization system forming extracts using the cover coefficient concept. Cover Coefficient-based Summarizer (CCS) uses similarity between sentences to determine representative sentences. Experiments indicate that CCS is an efficient algorithm that is able to generate quality summaries online.

Gonenc Ercan, Fazli Can

A Practitioner’s Guide for Static Index Pruning

We compare the term- and document-centric static index pruning approaches as described in the literature and investigate their sensitivity to the scoring functions employed during the pruning and actual retrieval stages.

Ismail Sengor Altingovde, Rifat Ozcan, Özgür Ulusoy

Revisiting N-Gram Based Models for Retrieval in Degraded Large Collections

The traditional retrieval models based on term matching are not effective in collections of degraded documents (output of OCR or ASR systems for instance). This paper presents a n-gram based distributed model for retrieval on degraded text large collections. Evaluation was carried out with both the TREC Confusion Track and Legal Track collections showing that the presented approach outperforms in terms of effectiveness the classical term centred approach and the most of the participant systems in the TREC Confusion Track.

Javier Parapar, Ana Freire, Álvaro Barreiro

A Simple Linear Ranking Algorithm Using Query Dependent Intercept Variables

The LETOR website contains three information retrieval datasets used as a benchmark for testing machine learning ideas for ranking. Participating algorithms are measured using standard IR ranking measures (NDCG, precision, MAP). Similarly to other participating algorithms, we train a linear classifier. In contrast, we define an additional free benchmark variable for each query. This allows expressing the fact that results for different queries are incomparable for the purpose of determining relevance. The results are slightly better yet significantly simpler than the reported participating algorithms.

Nir Ailon

Measurement Techniques and Caching Effects

Overall query execution time consists of the time spent transferring data from disk to memory, and the time spent performing actual computation. In any measurement of overall time on a given hardware configuration, the two separate costs are aggregated. This makes it hard to reproduce results and to infer which of the two costs is actually affected by modifications proposed by researchers. In this paper we show that repeated submissions of the same query provides a means to estimate the computational fraction of overall query execution time. The advantage of separate measurements is exemplified for a particular optimization that is, as it turns out, reducing computational costs only. Finally, by exchange of repeated query terms with surrogates that have similar document-frequency, we are able to measure the natural caching effects that arise as a consequence of term repetitions in query logs.

Stefan Pohl, Alistair Moffat

On Automatic Plagiarism Detection Based on n-Grams Comparison

When automatic plagiarism detection is carried out considering a reference corpus, a suspicious text is compared to a set of original documents in order to relate the plagiarised text fragments to their potential source. One of the biggest difficulties in this task is to locate plagiarised fragments that have been modified (by rewording, insertion or deletion, for example) from the source text.

The definition of proper text chunks as comparison units of the suspicious and original texts is crucial for the success of this kind of applications. Our experiments with the METER corpus show that the best results are obtained when considering low level word

-grams comparisons (

= {2,3}).

Alberto Barrón-Cedeño, Paolo Rosso

Exploiting Visual Concepts to Improve Text-Based Image Retrieval

In this paper, we study how to automatically exploit visual concepts in a text-based image retrieval task. First, we use Forest of Fuzzy Decision Trees (FFDTs) to automatically annotate images with visual concepts. Second, using optionally WordNet, we match visual concepts and textual query. Finally, we filter the text-based image retrieval result list using the FFDTs. This study is performed in the context of two tasks of the CLEF2008 international campaign: the Visual Concept Detection Task (VCDT) (17 visual concepts) and the photographic retrieval task (ImageCLEFphoto) (39 queries and 20k images). Our best VCDT run is the 4th best of the 53 submitted runs. The ImageCLEFphoto results show that there is a clear improvement, in terms of precision at 20, when using the visual concepts explicitly appearing in the query.

Sabrina Tollari, Marcin Detyniecki, Christophe Marsala, Ali Fakeri-Tabrizi, Massih-Reza Amini, Patrick Gallinari

Choosing the Best MT Programs for CLIR Purposes – Can MT Metrics Be Helpful?

This paper describes usage of MT metrics in choosing the best candidates for MT-based query translation resources. Our main metrics is METEOR, but we also use NIST and BLEU. Language pair of our evaluation is English → German, because MT metrics still do not offer very many language pairs for comparison. We evaluated translations of CLEF 2003 topics of four different MT programs with MT metrics and compare the metrics evaluation results to results of CLIR runs. Our results show, that for long topics the correlations between achieved MAPs and MT metrics is high (0.85-0.94), and for short topics lower but still clear (0.63-0.72). Overall it seems that MT metrics can easily distinguish the worst MT programs from the best ones, but smaller differences are not so clearly shown. Some of the intrinsic properties of MT metrics do not also suit for CLIR resource evaluation purposes, because some properties of translation metrics, especially evaluation of word order, are not significant in CLIR.

Kimmo Kettunen

Entropy-Based Static Index Pruning

We propose a new entropy-based algorithm for static index pruning. The algorithm computes an importance score for each document in the collection based on the entropy of each term. A threshold is set according to the desired level of pruning and all postings associated with documents that score below this threshold are removed from the index, i.e. documents are removed from the collection. We compare this entropy-based approach with previous work by Carmel

et al.

[1], for both the Financial Times (FT) and Los Angeles Times (LA) collections. Experimental results reveal that the entropy-based approach has superior performance on the FT collection, for both precision at 10 (P@10) and mean average precision (MAP). However, for the LA collection, Carmel’s method is generally superior with MAP. The variation in performance across collections suggests that a hybrid algorithm that incorporates elements of both methods might have more stable performance across collections. A simple hybrid method is tested, in which a first 10% pruning is performed using the entropy-based method, and further pruning is performed by Carmel’s method. Experimental results show that the hybird algorithm can slightly improve that of Carmel’s, but performs significantly worse than the entropy-based method on the FT collection.

Lei Zheng, Ingemar J. Cox

Representing User Navigation in XML Retrieval with Structural Summaries

This poster presents a novel way to represent user navigation in XML retrieval using collection statistics from XML summaries. Currently, developing user navigation models in XML retrieval is costly and the models are specific to collected user assessments. We address this problem by proposing summary navigation models which describe user navigation in terms of XML summaries. We develop our proposal using assessments collected in the interactive track at INEX 2006. Our preliminary results suggest that summary navigation models can represent user navigation in a way that is effective for evaluation and allows economic re-use of assessments for new tasks and collections.

Mir Sadek Ali, Mariano P. Consens, Birger Larsen

ESUM: An Efficient System for Query-Specific Multi-document Summarization

In this paper, we address the problem of generating a query-specific extractive summary in a an efficient manner for a given set of documents. In many of the current solutions, the entire collection of documents is modeled as a single graph which is used for summary generation. Unlike these approaches, in this paper, we model each individual document as a graph and generate a query-specific summary for it. These individual summaries are then intelligently combined to produce the final summary. This approach greatly reduces the computational complexity.

C. Ravindranath Chowdary, P. Sreenivasa Kumar

Using WordNet’s Semantic Relations for Opinion Detection in Blogs

The Opinion Detection from blogs has always been a challenge for researchers. One of the challenges faced is to find such documents that specifically contain opinion on users’ information need. This requires text processing on sentence level rather than on document level. In this paper, we have proposed an opinion detection approach. The proposed approach focuses on above problem by processing documents on sentence level using different semantic similarity relations of WordNet between sentence words and list of weighted query words expanded through encyclopedia Wikipedia. According to initial results, our approach performs well with MAP of 0.28 and P@10 of 0.64 with improvement of 27% over baseline results. TREC Blog 2006 data is used as test data collection.

Malik Muhammad Saad Missen, Mohand Boughanem

Improving Opinion Retrieval Based on Query-Specific Sentiment Lexicon

Lexicon-based approaches have been widely used for opinion retrieval due to their simplicity. However, no previous work has focused on the domain-dependency problem in opinion lexicon construction. This paper proposes simple feedback-style learning for query-specific opinion lexicon using the set of top-retrieved documents in response to a query. The proposed learning starts from the initial domain-independent general lexicon and creates a query-specific lexicon by re-updating the opinion probability of the initial lexicon based on top-retrieved documents. Experimental results on recent TREC test sets show that the query-specific lexicon provides a significant improvement over previous approaches, especially in BLOG-06 topics.

Seung-Hoon Na, Yeha Lee, Sang-Hyob Nam, Jong-Hyeok Lee

Automatically Maintained Domain Knowledge: Initial Findings

This paper explores the use of implicit user feedback in adapting the underlying domain model of an intranet search system. The domain model, a Formal Concept Analysis (FCA) lattice, is used as an interactive interface to allow user exploration of the context of an intranet query. Implicit user feedback is harnessed here to surmount the difficulty of achieving optimum document descriptors, essential for a browsable lattice. We present the results of a first user study of query refinements proposed by our adapted lattice.

Deirdre Lungley, Udo Kruschwitz

A Framework of Evaluation for Question-Answering Systems

Evaluating complex system is a complex task. Evaluation campaigns are organized each year to test different systems on global results, but they do not evaluate the relevance of the criteria used. Our purpose consist in modifying the intermediate results created by the components and inserting the new results into the process, without modifying the components. We will describe our framework of glass-box evaluation.

Sarra El Ayari, Brigitte Grau

Combining Content and Context Similarities for Image Retrieval

CBIR has been a challenging problem and its performance relies on the underlying image similarity (distance) metric. Most existing metrics evaluate pairwise image similarity based only on image content, which is denoted as

content similarity

. In this study we propose a novel similarity metric to make use of the image contexts in an image collection. The context of an image is built by constructing a vector with each dimension representing the content similarity between the image and any image in the image collection. The

context similarity

between two images is obtained by computing the similarity between the corresponding context vectors using the vector similarity functions. The content similarity and the context similarity are then combined to evaluate the overall image similarity. Experimental results demonstrate that the use of the context similarity can significantly improve the retrieval performance.

Xiaojun Wan

Investigating the Global Semantic Impact of Speech Recognition Error on Spoken Content Collections

Errors in speech recognition transcripts have a negative impact on effectiveness of content-based speech retrieval and present a particular challenge for collections containing conversational spoken content. We propose a Global Semantic Distortion (GSD) metric that measures the collection-wide impact of speech recognition error on spoken content retrieval in a query-independent manner. We deploy our metric to examine the effects of speech recognition substitution errors. First, we investigate frequent substitutions, cases in which the recognizer habitually mis-transcribes one word as another. Although habitual mistakes have a large global impact, the long tail of rare substitutions has a more damaging effect. Second, we investigate semantically similar substitutions, cases in which the word spoken and the word recognized do not diverge radically in meaning. Similar substitutions are shown to have slightly less global impact than semantically dissimilar substitutions.

Martha Larson, Manos Tsagkias, Jiyin He, Maarten de Rijke

Supervised Semantic Indexing

We present a class of models that are discriminatively trained to directly map from the word content in a query-document or document- document pair to a ranking score. Like Latent Semantic Indexing (LSI), our models take account of correlations between words (synonymy, pol- ysemy). However, unlike LSI our models are trained with a supervised signal directly on the task of interest, which we argue is the reason for our superior results. We provide an empirical study on Wikipedia documents, using the links to define document-document or query-document pairs, where we obtain state-of-the-art performance using our method.

Bing Bai, Jason Weston, Ronan Collobert, David Grangier

Split and Merge Based Story Segmentation in News Videos

Segmenting videos into smaller, semantically related segments which ease the access of the video data is a challenging open research. In this paper, we present a scheme for semantic story segmentation based on anchor person detection. The proposed model makes use of a split and merge mechanism to find story boundaries. The approach is based on visual features and text transcripts. The performance of the system was evaluated using TRECVid 2003 CNN and ABC videos. The results show that the system is in par with state-of-the-art classifier based systems.

Anuj Goyal, P. Punitha, Frank Hopfgartner, Joemon M. Jose

Encoding Ordinal Features into Binary Features for Text Classification

We propose a method by means of which supervised learning algorithms that only accept binary input can be extended to use ordinal (i.e., integer-valued) input. This is much needed in text classification, since it becomes thus possible to endow these learning devices with term frequency information, rather than just information on the presence/absence of the term in the document. We test two different learners based on “boosting”, and show that the use of our method allows them to obtain effectiveness gains. We also show that one of these boosting methods, once endowed with the representations generated by our method, outperforms an SVM learner with tfidf-weighted input.

Andrea Esuli, Fabrizio Sebastiani

Topic and Trend Detection in Text Collections Using Latent Dirichlet Allocation

Algorithms that enable the process of automatically mining distinct topics in document collections have become increasingly important due to their applications in many fields and the extensive growth of the number of documents in various domains. In this paper, we propose a generative model based on latent Dirichlet allocation that integrates the temporal ordering of the documents into the generative process in an iterative fashion. The document collection is divided into time segments where the discovered topics in each segment is propagated to influence the topic discovery in the subsequent time segments. Our experimental results on a collection of academic papers from CiteSeer repository show that segmented topic model can effectively detect distinct topics and their evolution over time.

Levent Bolelli, Şeyda Ertekin, C. Lee Giles

Measuring Similarity of Geographic Regions for Geographic Information Retrieval

Representations of geographic regions play a decisive role in geographic information retrieval, where the query is specified by a conceptual part and a geographic part. One aspect is to use them as query footprint which is then applied for the geographic ranking of documents. Users often specify textual descriptions of geographic regions that are not contained in the underlying gazetteer or geographic database. Approaches that automatically determine a geographic footprint for those locations have a strong need for measuring the quality of this footprint, for evaluation as well as for automatical parameter learning. This quality is determined by the ’similarity’ between the footprint and a correct representation of that region.

In this paper we introduce three domain-specific points of view for measuring the similarity between representations of geographic regions for geographic information retrieval. For each point of view (strict similarity, visual similarity and similarity in ranking) we introduce a dedicated measure, two of which are novel measures that we propose in this paper.

Andreas Henrich, Volker Lüdecke

Towards the Selection of Induced Syntactic Relations

We propose in this paper to use NLP approaches to validate induced syntactic relations. We focus on a Web Validation system, a Semantic Vector-based approach, and finally a Combined system. The Semantic Vector approach is a Roget-based approach which computes a syntactic relation as a vector. The Web Validation technique uses a search engine to determine the relevance of a syntactic relation. We experiment our approaches on real-world data set. The ROC curves are used to evaluate the results.

Nicolas Béchet, Mathieu Roche, Jacques Chauché

DiffPost: Filtering Non-relevant Content Based on Content Difference between Two Consecutive Blog Posts

One of the important issues in blog search engines is to extract the cleaned text from blog post. In practice, this extraction process is confronted with many non-relevant contents in the original blog post, such as menu, banner, site description, etc, causing the ranking be less-effective. The problem is that these non-relevant contents are not encoded in a unified way but encoded in many different ways between blog sites. Thus, the commercial vendor of blog sites should consider tuning works such as making human-driven rules for eliminating these non-relevant contents for all blog sites. However, such tuning is a very inefficient process. Rather than this labor-intensive method, this paper first recognizes that many of these non-relevant contents are not changed between several consequent blog posts, and then proposes a simple and effective DiffPost algorithm to eliminate them based on content difference between two consequent blog posts in the same blog site. Experimental result in TREC blog track is remarkable, showing that the retrieval system using DiffPost makes an important performance improvement of about 10% MAP (Mean Average Precision) increase over that without DiffPost.

Sang-Hyob Nam, Seung-Hoon Na, Yeha Lee, Jong-Hyeok Lee

An Unsupervised Approach to Product Attribute Extraction

Product Attribute Extraction is the task of automatically discovering attributes of products from text descriptions. In this paper, we propose a new approach which is both unsupervised and domain independent to extract the attributes. With our approach, we are able to achieve 92% precision and 62% recall in our experiments. Our experiments with varying dataset sizes show the robustness of our algorithm. We also show that even a minimum of 5 descriptions provide enough information to identify attributes.

Santosh Raju, Prasad Pingali, Vasudeva Varma

Workshops

Workshop on Contextual Information Access, Seeking and Retrieval Evaluation

The main purpose of this workshop is to bring together IR researchers working on or interested in the evaluation of approaches to contextual information access, seeking and retrieval, and let them to share their latest research results, to express their opinions on the related issues, and to promote discussion on the future directions of evaluation.

Bich-Liên Doan, Joemon M. Jose, Massimo Melucci, Lynda Tamine-Lechani

Workshop on Information Retrieval over Social Networks

Popular online communities and services such as Flickr, Youtube, Facebook or LinkedIn are spearheading an emerging type of information on the Web. This information is composed of classical textual and multimedia data, in concert with additional data (tags, annotations, comments, ratings). Perhaps most significantly, the information is overlaid on an explicit social network created by the participants of each of these communities. The result is a rich structure of interrelationships between content items, participants and services. Although the size of such networks requires the use of advanced Information Retrieval techniques, classical IR models are not tailored for this type of content as they do not (in general) take advantage of the particular structure and unique aspects of this socially-driven content.This workshop proposes to report about the state-of-the- art in this direction and to gather a relevant panel of researchers working in the field. This workshop will consist of research papers that address Information Retrieval over Social Networks, including:

Applications of Information Retrieval over Social Network

Adapted IR models for Social Networks

Mining Social Network data

Privacy issues in Social Network information retrieval

Trust and Reliability issues in Social Network information retrieval

Knowledge and Content Discovery in Social Networks

Information diffusion over Social Networks

Performance evaluation for the above (measures, test collections)

Stephane Marchand-Maillet, Arjen P. de Vries, Mor Naaman

Workshop on Geographic Information on the Internet Workshop (GIIW)

Finding geographically-based information constitutes a common use of Web search engines, for a variety of user needs. With the rapid growth of the volume of geographically-related information on the Web, efficient and adaptable ways of tagging, browsing and accessing relevant documents still needs to be found. Structuring and mashing-up geographic information from different Web data sources is one appealing alternative to long term efforts of manually creating large scale geographic resources such as The Alexandria Digital Library or Geonames , whose constructions are costly and not necessarily adapted to specific applications.

Gregory Grefenstette, Pierre-Alain Moëllic, Adrian Popescu, Florence Sèdes

Tutorials

Current Developments in Information Retrieval Evaluation

In the last decade, many evaluation results have been created within the evaluation initiatives like TREC, NTCIR and CLEF. The large amount of data available has led to substantial research on the validity of the evaluation procedure. An evaluation based on the Cranfield paradigm requires basically topics as descriptions of information needs, a document collection, systems to compare, human jurors to judge the documents retrieved by the systems against the information needs descriptions and some metric to compare the systems. For all these elements, there has been a scientific discussion. How many topics, systems, jurors and juror decisions are necessary to achieve valid results? How can the validity be measured? Which metrics are the most reliable ones and which metrics are appropriate from a user perspective? Examples from current CLEF experiments are used to illustrate some of the issues.

User based evaluations confront test users with the results of search systems and let them solve information tasks given in the experiment. In such a test setting, the performance of the user can be measured by observing the number of relevant documents he finds. This measure can be compared to a gold standard of relevance for the search topic to see if the perceived performance correlates with an objective notion of relevance defined by a juror. In addition, the user can be asked about his satisfaction with the search system and its results. In recent years, there has a growing concern on how well the results of batch and user studies correlate. When systems improve in a batch comparison and bring more relevant documents into the results list, do users get a benefit from this improvement? Are users more satisfied with better result lists and do better systems enable them to find more relevant documents? Some studies could not confirm this relation between system performance and user satisfaction.

Thomas Mandl

Information Extraction and Linking in a Retrieval Context

We witness a growing interest and capabilities of automatic content recognition (often referred to as information extraction) in various media sources that identify entities (e.g. persons, locations and products) and their semantic attributes (e.g., opinions expressed towards persons or products, relations between entities).These extraction techniques are most advanced for text sources, but they are also researched for other media, for instance for recognizing persons and objects in images or video. The extracted information enriches and adds semantic meaning to document and queries (the latter e.g., in a relevance feedback setting). In addition, content recognition techniques trigger automated linking of information across documents and even across media. This situation poses a number of opportunities and challenges for retrieval and ranking models. For instance, instead of returning full documents, information extraction provides the means to return very focused results in the form of entities such as persons and locations. Another challenge is to integrate content recognition and content retrieval as much as possible, for instance by using the probabilistic output from the information extraction tools in the retrieval phase. These approaches are important steps towards semantic search, i.e., retrieval approaches that truly use the semantics of the data.

Marie-Francine Moens, Djoerd Hiemstra

Mining Query Logs

Web Search Engines (WSEs) have stored in their query logs information about users since they started to operate. This information often serves many purposes. The primary focus of this tutorial is to introduce to the discipline of query log mining. We will show its foundations, by giving a unified view on the literature on query log analysis, and also present in detail the basic algorithms and techniques that could be used to extract useful knowledge from this (potentially) infinite source of information. Finally, we will discuss how the extracted knowledge can be exploited to improve different quality features of a WSE system, mainly its effectiveness and efficiency.

Salvatore Orlando, Fabrizio Silvestri

Backmatter

Titel: Advances in Information Retrieval
herausgegeben von: Mohand Boughanem
Catherine Berrut
Josiane Mothe
Chantal Soule-Dupuy
Verlag: Springer Berlin Heidelberg
Electronic ISBN: 978-3-642-00958-7
Print ISBN: 978-3-642-00957-0
DOI: https://doi.org/10.1007/978-3-642-00958-7