Top

2012 | Book

Advances in Information Retrieval

34th European Conference on IR Research, ECIR 2012, Barcelona, Spain, April 1-5, 2012. Proceedings

Editors: Ricardo Baeza-Yates, Arjen P. de Vries, Hugo Zaragoza, B. Barla Cambazoglu, Vanessa Murdock, Ronny Lempel, Fabrizio Silvestri

Publisher: Springer Berlin Heidelberg

Book Series : Lecture Notes in Computer Science

Part of: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

About this book

This book constitutes the proceedings of the 34th European Conference on IR Research, ECIR 2012, held in Barcelona, Spain, in April 2012. The 37 full papers, 28 poster papers and 7 demonstrations presented in this volume were carefully reviewed and selected from 167 submissions. The contributions are organized in sections named: query representation; blogs and online-community search; semi-structured retrieval; evaluation; applications; retrieval models; image and video retrieval; text and content classification, categorisation, clustering; systems efficiency; industry track; and posters.

Frontmatter

Query Representation

Explaining Query Modifications

An Alternative Interpretation of Term Addition and Removal

In the course of a search session, searchers often modify their queries several times. In most previous work analyzing search logs, the addition of terms to a query is identified with query specification and the removal of terms with query generalization. By analyzing the result sets that motivated searchers to make modifications, we show that this interpretation is not always correct. In fact, our experiments indicate that in the majority of cases the modifications have the opposite functions. Terms are often removed to get rid of irrelevant results matching only part of the query and thus to make the result set more specific. Similarly, terms are often added to retrieve more diverse results. We propose an alternative interpretation of term additions and removals and show that it explains the deviant modification behavior that was observed.

Vera Hollink, Jiyin He, Arjen de Vries

Modeling Transactional Queries via Templates

Search queries have been roughly classified into three categories – navigational, informational and transactional. The latter group includes queries that aim to perform some Web-mediated task, often by interacting with parameterized Web services. In order to assist users in completing tasks online, one of the first building blocks is identifying whether and which transactional use-case is associated with each query.

This paper describes a framework and an algorithm for automatically generating compact representations of queries associated with transactional use cases. We mine search click logs for queries that lead to clicks on pages associated with a use-case, generalize the set of mined queries into

templates

by replacing query terms with taxonomy categories, and eliminate redundancies. This approach allows associating the use-case with queries unseen in the log sample, while keeping a concise model. Our methodology allows a business owner to select an appropriate operating point that balances the tradeoff between precision and recall. We report the results of an offline evaluation of our framework on three transactional domains, and demonstrate the viability of the approach.

Edward Bortnikov, Pinar Donmez, Amit Kagian, Ronny Lempel

Exploring Query Patterns in Email Search

Despite Email being the most popular communication medium currently in use and that people have been shown to regularly re-use messages, very little is known about how people actually search within email clients. In this paper we present a detailed analysis of email search behaviour obtained from a study of 47 users. We uncover a number of behavioral patterns that contrast with those previously observed in web search. From our findings, we describe ways in which email search could be improved and conclude with a short discussion of possible future work.

Morgan Harvey, David Elsweiler

Interactive Search Support for Difficult Web Queries

Short and common web queries are aptly supported by state-of-the-art search engines but performance and user experience are degraded when web queries are longer and less common. Extending previous solutions that automatically shorten queries, we introduce searchAssist: a novel search interface that provides interactive support for difficult web queries. The query logs and questionnaires from a naturalistic study of 90 web users’ search behaviors show that the usage rate of searchAssist for difficult queries was almost 40%. The results also highlight the importance of term dropping for long queries, and the improvements obtained in topical relevance when our searchers used searchAssist.

Abdigani Diriye, Giridhar Kumaran, Jeff Huang

Blog and Online-Community Search

Predicting the Future Impact of News Events

The amount of news content on the Web is increasing, enabling users to access news articles coming from a variety of sources: from newswires, news agencies, blogs, and at various places, e.g. even within Web search engines result pages. Anyhow, it still is a challenge for current search engines to decide which news events are worth being shown to the user (either for a newsworthy query or in a news portal). In this paper we define the task of predicting the future impact of news events. Being able to predict event impact will, for example, enable a newspaper to decide whether to follow a specific event or not, or a news search engine which stories to display. We define a flexible framework that, given some definition of impact, can predict its future development at the beginning of the event. We evaluate several possible definitions of event impact and experimentally identify the best features for each of them.

Julien Gaugaz, Patrick Siehndel, Gianluca Demartini, Tereza Iofciu, Mihai Georgescu, Nicola Henze

Detection of News Feeds Items Appropriate for Children

Identifying child-appropriate web content is an important yet difficult classification task. This novel task is characterised by attempting to determine age/child appropriateness (which is not necessarily topic-based), despite the presence of unbalanced class sizes and the lack of quality training data with human judgements of appropriateness. Classification of feeds, a subset of web content, presents further challenges due to their temporal nature and short document format. In this paper, we discuss these challenges and present baseline results for this task through an empirical study that classifies incoming news stories as appropriate (or not) for children. We show that while the naïve Bayes approach produces a higher AUC it is vulnerable to the imbalanced data problem, and that support vector machine provides a more robust overall solution. Our research shows that classifying children’s content is a non-trivial task that has greater complexities than standard text based classification. While the F-score values are consistent with other research examining age-appropriate text classification, we introduce a new problem with a new dataset.

Tamara Polajnar, Richard Glassey, Leif Azzopardi

Comparing Tweets and Tags for URLs

The free-form tags available from social bookmarking sites such as Delicious have been shown to be useful for a number of purposes and could serve as a cheap source of metadata about URLs on the web. Unfortunately recent years have seen a reduction in the popularity of such sites, however at the same time microblogging sites such as Twitter have exploded in popularity. On these sites users submit short messages (or “tweets”) about what they are currently reading, thinking and doing and often post URLs.

In this work we look into the similarity between top tags drawn from Delicious and high-frequency terms from tweets to ascertain whether Twitter data could serve as a useful replacement for Delicious. We investigate how these terms compare with web page content, whether or not top Twitter terms converge and determine if the terms are mostly descriptive (and therefore useful) or if they are mostly expressing sentiment or emotion. We discover that provided a large number of tweets are available referring to a chosen URL then the top terms drawn from these tweets are similar to Delicious tags and could therefore be used for similar purposes.

Morgan Harvey, Mark Carman, David Elsweiler

Geo-Location Estimation of Flickr Images: Social Web Based Enrichment

Estimating the geographic location of images is a task which has received a lot of attention in recent years. Large numbers of items uploaded to Flickr do not contain GPS-based latitude/longitude coordinates, although it would be beneficial to obtain such geographic information for a wide variety of potential applications such as travelogues and visual place descriptions. While most works in this area consider an image’s textual meta-data to estimate its geo-location, we consider an additional textual dimension: the image owner’s traces on the social Web, in particular on the micro-blogging platform Twitter. We investigate the following question: does enriching an image’s available textual meta-data with a user’s tweets improve the accuracy of the geographic location estimation process? The results show that this is indeed the case; in an oracle setting, the median error in kilometres decreases by 87%, in the best automatic approach the median error decreases by 56%.

Claudia Hauff, Geert-Jan Houben

Semi-structured Retrieval

A Field Relevance Model for Structured Document Retrieval

Many search applications involve documents with structure or fields. Since query terms often are related to specific structural components, mapping queries to fields and assigning weights to those fields is critical for retrieval effectiveness. Although several field-based retrieval models have been developed, there has not been a formal justification of field weighting.

In this work, we aim to improve the field weighting for structured document retrieval. We first introduce the notion of field relevance as the generalization of field weights, and discuss how it can be estimated using relevant documents, which effectively implements relevance feedback for field weighting. We then propose a framework for estimating field relevance based on the combination of several sources. Evaluation on several structured document collections show that field weighting based on the suggested framework improves retrieval effectiveness significantly.

Jin Young Kim, W. Bruce Croft

Relation Based Term Weighting Regularization

Traditional retrieval models compute term weights based on only the information related to individual terms such as TF and IDF. However, query terms are related. Intuitively, these relations could provide useful information about the importance of a term in the context of other query terms. For example, query “perl tutorial” specifies that a user look for information relevant to both perl and tutorial. Thus, a document containing both terms should have higher relevance score than the ones with only one of them. However, if the IDF value of “tutorial” is much smaller than “perl”, existing retrieval models may assign the document lower score than those containing multiple occurrences of “perl”. It is clear that the importance of a term should be dependent on not only collection statistics but also the relations with other query terms. In this work, we study how to utilize semantic relations among query terms to regularize term weighting. Experiment results over TREC collections show that the proposed strategy is effective to improve the retrieval performance.

Hao Wu, Hui Fang

A New Approach to Answerer Recommendation in Community Question Answering Services

Community Question Answering (CQA) service which enables users to ask and answer questions have emerged popular on the web. However, lots of questions usually can’t be resolved by appropriate answerers effectively. To address this problem, we present a novel approach to recommend users who are most likely to be able to answer the new question. Differently with previous methods, this approach utilizes the inherent semantic relations among asker-question-answerer simultaneously and perform the Answerer Recommendation task based on tensor factorization. Experimental results on two real-world CQA dataset show that the proposed method is able to recommend appropriate answerers for new questions and outperforms other state-of-the-art approaches.

Zhenlei Yan, Jie Zhou

On the Modeling of Entities for Ad-Hoc Entity Search in the Web of Data

The Web of Data describes objects, entities, or “things” in terms of their attributes and their relationships, using RDF statements. There is a need to make this wealth of knowledge easily accessible by means of keyword search. Despite recent research efforts in this direction, there is a lack of understanding of how structured semantic data is best represented for text-based entity retrieval. The task we are addressing in this paper is ad-hoc entity search: the retrieval of RDF resources that represent an entity described in the keyword query. We build upon and formalise existing entity modeling approaches within a generative language modeling framework, and compare them experimentally using a standard test collection, provided by the Semantic Search Challenge evaluation series. We show that these models outperform the current state-of-the-art in terms of retrieval effectiveness, however, this is done at the cost of abandoning a large part of the semantics behind the data. We propose a novel entity model capable of preserving the semantics associated with entities, without sacrificing retrieval effectiveness.

Robert Neumayer, Krisztian Balog, Kjetil Nørvåg

Result Disambiguation in Web People Search

We study the problem of disambiguating the results of a web people search engine: given a query consisting of a person name plus the result pages for this query, find correct referents for all mentions by clustering the pages according to the different people sharing the name. While the problem has been studied extensively, we discover that the increasing availability of results retrieved from social media platforms causes state-of-the-art methods to break down. We analyze the problem and propose a dual strategy where we distinguish between results obtained from social media platforms and those obtained from other sources. In our dual strategy, the two types of documents are disambiguated separately, using different strategies, and their results are then merged. We study several instantiations for the different stages in our proposed strategy and manage to achieve state-of-the-art performance.

Richard Berendsen, Bogomil Kovachev, Evangelia-Paraskevi Nastou, Maarten de Rijke, Wouter Weerkamp

Evaluation

On Smoothing Average Precision

On the basis of a theoretical analysis of issues around populations and sampling, for both topics and documents, and parameters with which we hope to characterise the effectiveness of different systems, we propose a modification to the traditional average precision metric. This modification involves both transformation and (in the estimation of the parameter) smoothing. The modified version is shown to have certain distributional advantages, on a substantial dataset. In particular, the distribution of values of the modified metric, over topics for a given system/run, is approximately normal.

Stephen Robertson

New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

Search effectiveness for tasks where the retrieval units are clearly defined documents is generally evaluated using standard measures such as mean average precision (MAP). However, many practical speech search tasks focus on content within large spoken files lacking defined structure. These data must be segmented into smaller units for search which may only partially overlap with relevant material. We introduce two new metrics for the evaluation of search effectiveness for informally structured speech data:

mean average segment precision (MASP)

which measures retrieval performance in terms of both content segmentation and ranking with respect to relevance; and

mean average segment distance-weighted precision (MASDWP)

which takes into account the distance between the start of the relevant segment and the retrieved segment. We demonstrate the effectiveness of these new metrics on a retrieval test collection based on the AMI meeting corpus.

Maria Eskevich, Walid Magdy, Gareth J. F. Jones

On Aggregating Labels from Multiple Crowd Workers to Infer Relevance of Documents

We consider the problem of acquiring relevance judgements for information retrieval (IR) test collections through crowdsourcing when no true relevance labels are available. We collect multiple, possibly noisy relevance labels per document from workers of unknown labelling accuracy. We use these labels to infer the document relevance based on two methods. The first method is the commonly used majority voting (MV) which determines the document relevance based on the label that received the most votes, treating all the workers equally. The second is a probabilistic model that concurrently estimates the document relevance and the workers accuracy using expectation maximization (EM). We run simulations and conduct experiments with crowdsourced relevance labels from the INEX 2010 Book Search track to investigate the accuracy and robustness of the relevance assessments to the noisy labels. We observe the effect of the derived relevance judgments on the ranking of the search systems. Our experimental results show that the EM method outperforms the MV method in the accuracy of relevance assessments and IR systems ranking. The performance improvements are especially noticeable when the number of labels per document is small and the labels are of varied quality.

Mehdi Hosseini, Ingemar J. Cox, Nataša Milić-Frayling, Gabriella Kazai, Vishwa Vinay

Applications

How Random Walks Can Help Tourism

On-line photo sharing services allow users to share their touristic experiences. Tourists can publish photos of interesting locations or monuments visited, and they can also share comments, annotations, and even the GPS traces of their visits. By analyzing such data, it is possible to turn colorful photos into metadata-rich trajectories through the points of interest present in a city.

In this paper we propose a novel algorithm for the interactive generation of personalized recommendations of touristic places of interest based on the knowledge mined from photo albums and Wikipedia. The distinguishing features of our approach are multiple. First, the underlying recommendation model is built fully automatically in an unsupervised way and it can be easily extended with heterogeneous sources of information. Moreover, recommendations are personalized according to the places previously visited by the user. Finally, such personalized recommendations can be generated very efficiently even on-line from a mobile device.

Claudio Lucchese, Raffaele Perego, Fabrizio Silvestri, Hossein Vahabi, Rossano Venturini

Retrieving Candidate Plagiarised Documents Using Query Expansion

External plagiarism detection systems compare suspicious texts against a reference collection to identify the original one(s). The suspicious text may not contain a verbatim copy of the reference collection since plagiarists often try to disguise their behaviour by altering the text. For large reference collections, such as those accessible via the internet, it is not practical to compare the suspicious text with every document in the reference collection. Consequently many approaches to plagiarism detection begin by identifying a set of candidate documents from the reference collection. We report an IR-based approach to the candidate document selection problem that uses query expansion to identify candidates which have been altered. The reported system outperforms a previously reported approach and is also robust to changes in the reference collection text.

Rao Muhammad Adeel Nawab, Mark Stevenson, Paul Clough

Reliability Prediction of Webpages in the Medical Domain

In this paper, we study how to automatically predict reliability of web pages in the medical domain. Assessing reliability of online medical information is especially critical as it may potentially influence vulnerable patients seeking help online. Unfortunately, there are no automated systems currently available that can classify a medical webpage as being reliable, while manual assessment cannot scale up to process the large number of medical pages on the Web. We propose a supervised learning approach to automatically predict reliability of medical webpages. We developed a gold standard dataset using the standard reliability criteria defined by the Health on Net Foundation and systematically experimented with different link and content based feature sets. Our experiments show promising results with prediction accuracies of over 80%. We also show that our proposed prediction method is useful in applications such as reliability-based re-ranking and automatic website accreditation.

Parikshit Sondhi, V. G. Vinod Vydiswaran, ChengXiang Zhai

Automatic Foldering of Email Messages:A Combination Approach

Automatic organization of email messages into folders is both an open problem and challenge for machine learning techniques. Besides the effect of

email overload

, which affects many email users worldwide, there are some increasing difficulties caused by the semantics applied by each user. The varying number of folders and their meaning are personal and in many cases pose difficulties to learning methods. This paper addresses automatic organization of email messages into folders, based on supervised learning algorithms. The textual fields of the email message (subject and body) are considered for learning, with different representations, feature selection methods, and classifiers. The participant fields are embedded into a vector-space model representation. The classification decisions from the different email fields are combined by majority voting. Experiments on a subset of the Enron Corpus and on a private email data set show the significant improvement over both single classifiers on these fields as well as over previous works.

Tony Tam, Artur Ferreira, André Lourenço

Retrieval Models

A Log-Logistic Model-Based Interpretation of TF Normalization of BM25

The effectiveness of BM25 retrieval function is mainly due to its sub-linear term frequency (TF) normalization component, which is controlled by a parameter

. Although BM25 was derived based on the classic probabilistic retrieval model, it has been so far unclear how to interpret its parameter

probabilistically, making it hard to optimize the setting of this parameter. In this paper, we provide a novel probabilistic interpretation of the BM25 TF normalization and its parameter

based on a log-logistic model for the probability of seeing a document in the collection with a given level of TF. The proposed interpretation allows us to derive different approaches to estimation of parameter

based solely on the current collection without requiring any training data, thus effectively eliminating one free parameter from BM25. Our experiment results show that the proposed approaches can accurately predict the optimal

without requiring training data and achieve better or comparable retrieval performance to a well-tuned BM25 where

is optimized based on training data.

Yuanhua Lv, ChengXiang Zhai

Score Transformation in Linear Combination for Multi-criteria Relevance Ranking

In many Information Retrieval (IR) tasks, documents should be ranked based on a combination of multiple criteria. Therefore, we would need to score a document in each criterion aspect of relevance and then combine the criteria scores to generate a final score for each document. Linear combination of these aspect scores has so far been the dominant approach due to its simplicity and effectiveness. However, such a strategy of combination requires that the scores to be combined are “comparable” to each other, an assumption that generally does not hold due to the different ways of scoring each criterion. Thus it is necessary to transform the raw scores for different criteria appropriately to make them more comparable before combination. In this paper we propose a new principled approach to score transformation in linear combination, in which we would learn a separate non-linear transformation function for each relevance criterion based on the Alternating Conditional Expectation (ACE) algorithm and BoxCox Transformation. Experimental results show that the proposed method is effective and is also robust against non-linear perturbations of the original scores.

Shima Gerani, ChengXiang Zhai, Fabio Crestani

Axiomatic Analysis of Translation Language Model for Information Retrieval

Statistical translation models have been shown to outperform simple document language models which rely on exact matching of words in the query and documents. A main challenge in applying translation models to ad hoc information retrieval is to estimate a translation model without training data. In this paper, we perform axiomatic analysis of translation language model for retrieval in order to gain insights about how to optimize the estimation of translation probabilities. We propose a set of constraints that a reasonable translation language model should satisfy. We check these constraints on the state-of-the-art translation estimation method based on Mutual Information and find that it does not satisfy most of the constraints. We then propose a new estimation method that better satisfies the defined constraints. Experimental results on representative TREC data sets show that the proposed new estimation method outperforms the existing Mutual Information-based estimation, suggesting that the proposed constraints are indeed helpful for designing better estimation methods for translation language model.

Maryam Karimzadehgan, ChengXiang Zhai

An Information-Based Cross-Language Information Retrieval Model

We present in this paper well-founded cross-language extensions of the recently introduced models in the information-based family for information retrieval, namely the LL (log-logistic) and SPL (smoothed power law) models of [4]. These extensions are based on (a) a generalization of the notion of information used in the information-based family, (b) a generalization of the random variables also used in this family, and (c) the direct expansion of query terms with their translations. We then review these extensions from a theoretical point-of-view, prior to assessing them experimentally. The results of the experimental comparisons between these extensions and existing CLIR systems, on three collections and three language pairs, reveal that the cross-language extension of the LL model provides a state-of-the-art CLIR system, yielding the best performance overall.

Bo Li, Eric Gaussier

Extended Expectation Maximization for Inferring Score Distributions

Inferring the distributions of relevant and nonrelevant documents over a ranked list of scored documents returned by a retrieval system has a broad range of applications including information filtering, recall-oriented retrieval, metasearch, and distributed IR. Typically, the distribution of documents over scores is modeled by a mixture of two distributions, one for the relevant and one for the nonrelevant documents, and expectation maximization (EM) is run to estimate the mixture parameters. A large volume of work has focused on selecting the appropriate form of the two distributions in the mixture. In this work we consider the form of the distributions as a given and we focus on the inference algorithm. We extend the EM algorithm (a) by simultaneously considering the ranked lists of documents returned by multiple retrieval systems, and (b) by encoding in the algorithm the constraint that the same document retrieved by multiple systems should have the same, global, probability of relevance. We test the new inference algorithm using TREC data and we demonstrate that it outperforms the regular EM algorithm. It is better calibrated in inferring the probability of document’s relevance, and it is more effective when applied on the task of metasearch.

Keshi Dai, Virgil Pavlu, Evangelos Kanoulas, Javed A. Aslam

Top-k Retrieval Using Facility Location Analysis

The top-

retrieval problem aims to find the optimal set of

documents from a number of relevant documents given the user’s query. The key issue is to balance the relevance and diversity of the top-

search results. In this paper, we address this problem using

Facility Location Analysis

taken from Operations Research, where the locations of facilities are optimally chosen according to some criteria. We show how this analysis technique is a generalization of state-of-the-art retrieval models for diversification (such as the Modern Portfolio Theory for Information Retrieval), which treat the top-

search results like “

obnoxious facilities

” that should be dispersed as far as possible from each other. However, Facility Location Analysis suggests that the top-

search results could be treated like “

desirable facilities

” to be placed as close as possible to their customers. This leads to a new top-

retrieval model where the best representatives of the relevant documents are selected. In a series of experiments conducted on two TREC diversity collections, we show that significant improvements can be made over the current state-of-the-art through this alternative treatment of the top-

retrieval problem.

Guido Zuccon, Leif Azzopardi, Dell Zhang, Jun Wang

Image and Video Retrieval

An Interactive Paper and Digital Pen Interface for Query-by-Sketch Image Retrieval

A major challenge when dealing with large collections of digital images is to find relevant objects, especially when no metadata on the objects is available. Content-based image retrieval (CBIR) addresses this problem but usually lacks query images that are good enough to express the user’s information need. Therefore, in Query-by-Sketch, CBIR has been considered with user provided sketches as query objects – but so far, this has suffered from the limitations of existing user interfaces. In this paper, we present a novel user interface for query by sketch that exploits emergent interactive paper and digital pen technology. Users can draw sketches on paper in a user-friendly way. Search can be started interactively from the paper front-end, due to a streaming interface from the digital pen to the underlying CBIR system. We present the implementation of the interactive paper/digital pen interface on top of QbS, our system for CBIR using sketches, and we present in detail the evaluation of the system on the basis of the MIRFLICKR-25000 image collection.

Roman Kreuzer, Michael Springmann, Ihab Al Kabary, Heiko Schuldt

Image Abstraction in Crossmedia Retrieval for Text Illustration

Text illustration is a multimedia retrieval task that consists in finding suitable images to illustrate text fragments such as blog entries, news reports or children stories. In this paper we describe a crossmedia retrieval system which, given a textual input, selects a short list of candidate images from a large media collection. This approach makes use of a recently proposed method to map metadata and visual features into a common textual representation that can be handled by traditional information retrieval engines. Content-based analysis is enhanced by visual abstraction, namely the

Anisotropic Kuwahara Filter

, which impacts feature information captured by the

Joint Composite

and

Speeded Up Robust Features

visual descriptors. For evaluation purposes, we used the well-established MIRFlickr photo collection, with 25,000 photos and user tags collected from Flickr as well as manual annotations provided as image retrieval groundtruth. Results show that image abstraction can improve visual retrieval as well as significantly reduce processing and storage requirements, even more when paired with Google’s WebP image format. We conclude that applying a visual rerank after an initial text retrieval step improves the quality of results, and that the adopted text mapping method for visual descriptors provides an effective crossmedia approach for text illustration.

Filipe Coelho, Cristina Ribeiro

A Latent Variable Ranking Model for Content-Based Retrieval

Since their introduction, ranking SVM models [11] have become a powerful tool for training content-based retrieval systems. All we need for training a model are retrieval examples in the form of triplet constraints, i.e. examples specifying that relative to some query, a database item

should be ranked higher than database item

. These types of constraints could be obtained from feedback of users of the retrieval system. Most previous ranking models learn either a global combination of elementary similarity functions or a combination defined with respect to a single database item. Instead, we propose a “coarse to fine” ranking model where given a query we first compute a distribution over “coarse” classes and then use the linear combination that has been optimized for queries of that class. These coarse classes are hidden and need to be induced by the training algorithm. We propose a latent variable ranking model that induces both the latent classes and the weights of the linear combination for each class from ranking triplets. Our experiments over two large image datasets and a text retrieval dataset show the advantages of our model over learning a global combination as well as a combination for each test point (i.e. transductive setting). Furthermore, compared to the transductive approach our model has a clear computational advantages since it does not need to be retrained for each test query.

Ariadna Quattoni, Xavier Carreras, Antonio Torralba

Text and Content Classification, Categorisation, Clustering

Language Modelling of Constraints for Text Clustering

Constrained clustering is a recently presented family of semi-supervised learning algorithms. These methods use domain information to impose constraints over the clustering output. The way in which those constraints (typically pair-wise constraints between documents) are introduced is by designing new clustering algorithms that enforce the accomplishment of the constraints. In this paper we present an alternative approach for constrained clustering where, instead of defining new algorithms or objective functions, the constraints are introduced modifying the document representation by means of their language modelling. More precisely the constraints are modelled using the well-known Relevance Models successfully used in other retrieval tasks such as pseudo-relevance feedback. To the best of our knowledge this is the first attempt to try such approach. The results show that the presented approach is an effective method for constrained clustering even improving the results of existing constrained clustering algorithms.

Javier Parapar, Álvaro Barreiro

A Framework for Unsupervised Spam Detection in Social Networking Sites

Social networking sites offer users the option to submit user spam reports for a given message, indicating this message is inappropriate. In this paper we present a framework that uses these user spam reports for spam detection. The framework is based on the HITS web link analysis framework and is instantiated in three models. The models subsequently introduce propagation between messages reported by the same user, messages authored by the same user, and messages with similar content. Each of the models can also be converted to a simple semi-supervised scheme. We test our models on data from a popular social network and compare the models to two baselines, based on message content and raw report counts. We find that our models outperform both baselines and that each of the additions (reporters, authors, and similar messages) further improves the performance of the framework.

Maarten Bosma, Edgar Meij, Wouter Weerkamp

Classification of Short Texts by Deploying Topical Annotations

We propose a novel approach to the classification of short texts based on two factors: the use of Wikipedia-based annotators that have been recently introduced to detect the main topics present in an input text, represented via Wikipedia pages, and the design of a novel classification algorithm that measures the similarity between the input text and each output category by deploying only their annotated topics and the Wikipedia link-structure. Our approach waives the common practice of expanding the feature-space with new dimensions derived either from explicit or from latent semantic analysis. As a consequence it is simple and maintains a compact intelligible representation of the output categories. Our experiments show that it is efficient in construction and query time, accurate as state-of-the-art classifiers (see e.g. Phan

et al.

WWW ’08), and robust with respect to concept drifts and input sources.

Daniele Vitale, Paolo Ferragina, Ugo Scaiella

Cluster Labeling for Multilingual Scatter/Gather Using Comparable Corpora

Scatter/Gather systems are increasingly becoming useful in browsing document corpora. Usability of the present-day systems are restricted to monolingual corpora, and their methods for clustering and labeling do not easily extend to the multilingual setting, especially in the absence of dictionaries/machine translation. In this paper, we study the

cluster labeling

problem for

multilingual

corpora in the absence of machine translation, but using comparable corpora. Using a variational approach, we show that multilingual topic models can effectively handle the cluster labeling problem, which in turn allows us to design a novel Scatter/Gather system

ShoBha

. Experimental results on three datasets, namely the Canadian Hansards corpus, the entire overlapping Wikipedia of English, Hindi and Bengali articles, and a trilingual news corpus containing 41,000 articles, confirm the utility of the proposed system.

Goutham Tholpadi, Mrinal Kanti Das, Chiranjib Bhattacharyya, Shirish Shevade

Systems Efficiency

Adaptive Time-to-Live Strategies for Query Result Caching in Web Search Engines

An important research problem that has recently started to receive attention is the freshness issue in search engine result caches. In the current techniques in literature, the cached search result pages are associated with a fixed time-to-live (TTL) value in order to bound the staleness of search results presented to the users, potentially as part of a more complex cache refresh or invalidation mechanism. In this paper, we propose techniques where the TTL values are set in an adaptive manner, on a per-query basis. Our results show that the proposed techniques reduce the fraction of stale results served by the cache and also decrease the fraction of redundant query evaluations on the search engine backend compared to a strategy using a fixed TTL value for all queries.

Sadiye Alici, Ismail Sengor Altingovde, Rifat Ozcan, B. Barla Cambazoglu, Özgür Ulusoy

Intra-query Concurrent Pipelined Processing for Distributed Full-Text Retrieval

Pipelined query processing over a term-wise distributed inverted index has superior throughput at high query multiprogramming levels. However, due to long query latencies this approach is inefficient at lower levels. In this paper we explore two types of intra-query parallelism within the pipelined approach, parallel execution of a query on different nodes and concurrent execution on the same node. According to the experimental results, our approach reaches the throughput of the state-of-the-art method at about half of the latency. On the single query case the observed latency improvement is up to 2.6 times.

Simon Jonassen, Svein Erik Bratsberg

Industry Track

Usefulness of Sentiment Analysis

What can text sentiment analysis technology be used for, and does a more usage-informed view on sentiment analysis pose new requirements on technology development?

Jussi Karlgren, Magnus Sahlgren, Fredrik Olsson, Fredrik Espinoza, Ola Hamfors

Modeling Static Caching in Web Search Engines

In this paper we model a two-level cache of a Web search engine, such that given memory resources, we find the optimal split fraction to allocate for each cache, results and index. The final result is very simple and implies to compute just five parameters that depend on the input data and the performance of the search engine. The model is validated through extensive experimental results and is motivated on capacity planning and the overall optimization of the search architecture.

Ricardo Baeza-Yates, Simon Jonassen

Posters

Integrating Interactive Visualizations in the Search Process of Digital Libraries and IR Systems

Interactive visualizations for exploring and retrieval have not yet become an integral part of digital libraries and information retrieval systems. We have integrated a set of interactive graphics in a real world social science digital library. These visualizations support the exploration of search queries, results and authors, can filter search results, show trends in the database and can support the creation of new search queries. The use of weighted brushing supports the identification of related metadata for search facets. In a user study we verify that users can gain insights from statistical graphics intuitively and can adopt interaction techniques.

Daniel Hienert, Frank Sawitzki, Philipp Schaer, Philipp Mayr

On Theoretically Valid Score Distributions in Information Retrieval

In this paper, we aim to investigate the practical usefulness of the Recall-Fallout Convexity Hypothesis (RFCH) for a number of document score distribution (SD) models. We compare SD models that do not automatically adhere to the RFCH to modified versions of the same SD models that do adhere to the RFCH. We compare these models using the inference of average precision as a measure of utility. For the three models studied in this paper, we conclude that adhering to the RFCH is practically useful for the two-normal model, makes no difference for the two-gamma model, and degrades the performance of the two-lognormal model.

Ronan Cummins, Colm O’Riordan

Adaptive Temporal Query Modeling

We present an approach to query modeling that uses the temporal distribution of documents in an initially retrieved set of documents. Such distributions tend to exhibit bursts, especially in news-related document collections. We hypothesize that documents in those bursts are more likely to be relevant and update the query model with the most distinguishing terms in high-quality documents sampled from bursts. We evaluate the effectiveness of our models on a test collection of blog posts.

Maria-Hendrike Peetz, Edgar Meij, Maarten de Rijke, Wouter Weerkamp

The Design of a Visual History Tool to Help Users Refind Information within a Website

On the WWW users frequently revisit information they have previously seen, but “keeping found things found” is difficult when the information has not been visited frequently or recently, even if a user knows which website contained the information. This paper describes the design of a tool to help users refind information within a given website. The tool encodes data about a user’s interest in webpages (measured by dwell time), the frequency and recency of visits, and navigational associations between pages, and presents navigation histories in list- and graph-based forms.

Trien V. Do, Roy A. Ruddle

Analyzing the Polarity of Opinionated Queries

In this paper, we present an in-depth analysis of Web search queries for controversial topics, focusing on query sentiment. To this end, we conduct extensive user assessments as well as an automatic sentiment analysis using the SentiWordNet thesaurus.

Sergiu Chelaru, Ismail Sengor Altingovde, Stefan Siersdorfer

Semi-automatic Document Classification: Exploiting Document Difficulty

There are circumstances where classification is required only if a certain condition, such a specific level of quality, is met. This paper investigates a semi-automatic solution where only the predictions for the documents which are more likely to be correctly classified would be considered. This method provides high-quality automatic classification for large subsets of the collection and employs human expertise for the “most complicated” decisions. This research presents different approaches to measure document difficulty and it discusses the benefits of applying it for semi-automatic classification. In addition, experiments are carried out to show the results achieved for different subsets of the collection. Experiments prove that it is possible to improve quality significantly with large subsets (i.e. 13% micro-

increase with 70% of documents) of two different collections. Furthermore, it shows how it provides a flexible mechanism to apply automatic classification to specific subsets while specific constrains are met.

Miguel Martinez-Alvarez, Sirvan Yahyaei, Thomas Roelleke

Investigating Summarization Techniques for Geo-Tagged Image Indexing

Images with geo-tagging information are increasingly available on the Web. However, such images need to be annotated with additional textual information if they are to be retrievable, since users do not search by geo-coordinates. We propose to automatically generate such textual information by (1) generating toponyms from the geo-tagging information (2) retrieving Web documents using toponyms as queries (3) summarizing the retrieved documents. The summaries are then used to index the images. In this paper we investigate how various summarization techniques affect image retrieval performance and show significant improvements can be obtained when using the summaries for indexing.

Ahmet Aker, Xin Fan, Mark Sanderson, Robert Gaizauskas

Handling OOV Words in Indian-language – English CLIR

Because of the lack of resources Cross-lingual information retrieval is a difficult task for many Indian languages. Google Translate provides an easy way of translation from Indian languages to English but due to lexicon limitations most of the out-of-vocabulory words get transliterated letter by letter along with their suffix resulting in an unusually long string. The resulting string often does not match its intended translation which hurts retrieval. We propose an approach to extract the correct word from such strings using word segmentation along with approximate string matching using Soundex algorithm & Levenshtein distance. We evaluate our approach across three Indian languages and find an average improvement of 5.8% MAP on the FIRE-2010 dataset.

Parin Chheda, Manaal Faruqui, Pabitra Mitra

Using a Medical Thesaurus to Predict Query Difficulty

Estimating query performance is the task of predicting the quality of results returned by a search engine in response to a query. In this paper, we focus on pre-retrieval prediction methods for the medical domain. We propose a novel predictor that exploits a thesaurus to ascertain how difficult queries are. In our experiments, we show that our predictor outperforms the state-of-the-art methods that do not use a thesaurus.

Florian Boudin, Jian-Yun Nie, Martin Dawes

Studying a Personality Coreference Network in a News Stories Photo Collection

We build and analyze a coreference network based on entities from photo descriptions, where nodes represent personalities and edges connect people mentioned in the same photo description. We identify and characterize the communities in this network and propose taking advantage of the context provided by community detection methodologies to improve text illustration and general search.

José Devezas, Filipe Coelho, Sérgio Nunes, Cristina Ribeiro

Phrase Pair Classification for Identifying Subtopics

Automatic identification of subtopics for a given topic is desirable because it eliminates the need for manual construction of domain-specific topic hierarchies. In this paper, we design features based on corpus statistics to design a classifier for identifying the (subtopic, topic) links between phrase pairs. We combine these features along with the commonly-used syntactic patterns to classify phrase pairs from datasets in Computer Science and WordNet. In addition, we show a novel application of our

is-a-subtopic-of

classifier for query expansion in Expert Search and compare it with pseudo-relevance feedback.

Sujatha Das, Prasenjit Mitra, C. Lee Giles

Full and Mini-batch Clustering of News Articles with Star-EM

We present a new threshold-based clustering algorithm for news articles. The algorithm consists of two phases: in the first, a local optimum of a score function that captures the quality of a clustering is found with an Expectation-Maximization approach. In the second phase, the algorithm reduces the number of clusters and, in particular, is able to build non-spherical–shaped clusters. We also give a mini-batch version which allows an efficient dynamic processing of data points as they arrive in groups. Our experiments on the TDT5 benchmark collection show the superiority of both versions of this algorithm compared to other state-of-the-art alternatives.

Matthias Gallé, Jean-Michel Renders

Assessing and Predicting Vertical Intent for Web Queries

Aggregating search results from a variety of heterogeneous sources, i.e. so-called verticals [1], such as news, image, video and blog, into a single interface has become a popular paradigm in web search. In this paper, we present the results of a user study that collected more than 1,500 assessments of vertical intent over 320 web topics. Firstly, we show that users prefer diverse vertical content for many queries and that the level of inter-assessor agreement for the task is

fair

[2]. Secondly, we propose a methodology to predict the vertical intent of a query using a search engine log by exploiting

click-through

data, and show that it outperforms

traditional

approaches.

Ke Zhou, Ronan Cummins, Martin Halvey, Mounia Lalmas, Joemon M. Jose

Predicting IMDB Movie Ratings Using Social Media

We predict IMDb movie ratings and consider two sets of features: surface and textual features. For the latter, we assume that no social media signal is isolated and use data from multiple channels that are linked to a particular movie, such as tweets from Twitter and comments from YouTube. We extract textual features from each channel to use in our prediction model and we explore whether data from either of these channels can help to extract a better set of textual feature for prediction. Our best performing model is able to rate movies very close to the observed values.

Andrei Oghina, Mathias Breuss, Manos Tsagkias, Maarten de Rijke

Squeezing the Ensemble Pruning: Faster and More Accurate Categorization for News Portals

Recent studies show that ensemble pruning works as effective as traditional ensemble of classifiers (EoC). In this study, we analyze how ensemble pruning can improve text categorization efficiency in time-critical real-life applications such as news portals. The most crucial two phases of text categorization are training classifiers and assigning labels to new documents; but the latter is more important for efficiency of such applications. We conduct experiments on ensemble pruning-based news article categorization to measure its accuracy and time cost. The results show that our heuristics reduce the time cost of the second phase. Also we can make a trade-off between accuracy and time cost to improve both of them with appropriate pruning degrees.

Cagri Toraman, Fazli Can

A General Framework for People Retrieval in Social Media with Multiple Roles

Internet users are more and more playing multiple roles when connected on the Web, such as “posting”, “commenting”, “tagging” and “sharing” different kinds of information on various social media. Despite the research interest in the field of social networks, few has been done up to now w.r.t. information access in multi-relational social networks where queries can be multifaceted queries (e.g. a mix of textual key-words and key-persons in some social context). We propose a unified and efficient framework to address such complex queries on multi-modal “social” collections, working in 3 distinct phases, namely: (I) aggregation of documents into modal profiles, (II) expansion of mono-modal subqueries to mono-modal and multi-modal subqueries, (III) relevance score computation through late fusion of the different similarities deduced from profiles and subqueries obtained during the first two phases. Experiments on the ENRON email collection for a recipient proposal task show that competitive results can be obtained using the proposed framework.

Amin Mantrach, Jean-Michel Renders

Analysis of Query Reformulations in a Search Engine of a Local Web Site

This study examines reformulations of queries submitted to a search engine of a university Web site with a focus on (implicitly derived) user satisfaction and the performance of the underlying search engine. Using a search log of a university Web site we examined all reformulations submitted in a 10-week period and studied the relation between the popularity of the reformulation and the performance of the search engine estimated using a number of clickthrough-based measures. Our findings are a step towards building better query recommendation systems and suggest a number of metrics to evaluate query recommendation systems.

M-Dyaa Albakour, Udo Kruschwitz, Nikolaos Nanas, Ibrahim Adeyanju, Dawei Song, Maria Fasli, Anne De Roeck

Temporal Pseudo-relevance Feedback in Microblog Retrieval

Twitter has become a major outlet for news, discussion and commentary of on-going events and trends. Effective searching of Twitter collections poses a number of issues for traditional document-based information retrieval (IR) approaches, such as limited document term statistics and spam. In this paper we propose a novel approach to pseudo-relevance feedback, based upon the temporal profiles of n-grams extracted from the top N relevance feedback tweets. A weighted graph is used to model temporal correlation between n-grams, with a PageRank variant employed to combine both pseudo-relevant document term distribution and temporal collection evidence. Preliminary experiments with the TREC Microblogging 2011 Twitter corpus indicate that through parameter optimisation, retrieval effectiveness can be improved.

Stewart Whiting, Iraklis A. Klampanos, Joemon M. Jose

Learning Adaptive Domain Models from Click Data to Bootstrap Interactive Web Search

Today, searchers exploring the World Wide Web have come to expect enhanced search interfaces – query completion and related searches have become standard. Here we propose a Formal Concept Analysis lattice as an underlying domain model to provide a source of query refinements. The initial lattice is constructed using NLP. User clicks on documents, seen as implicit user feedback, are harnessed to adapt it. In this paper, we explore the viability of this adaptation process and the results we present demonstrate its promise and limitations for proposing initial effective refinements when searching the diverse WWW domain.

Deirdre Lungley, Udo Kruschwitz, Dawei Song

A Little Interaction Can Go a Long Way: Enriching the Query Formulation Process

This poster argues for a need for more dialogue and richer information and interaction during query formulation between the user and the system. We present two novel methods – query previews and categorised Interactive Query Expansions – that seek to do just this. Our method enriches a searcher’s query formulation by leveraging semantic information to help identify the topicality of the term, and the outcomes of its selection. The initial findings are largely positive and suggest user preference.

Abdigani Diriye, Anastasios Tombros, Ann Blandford

Learning to Rank from Relevance Feedback for e-Discovery

In recall-oriented search tasks retrieval systems are privy to a greater amount of user feedback. In this paper we present a novel method of combining relevance feedback with learning to rank. Our experiments use data from the 2010 TREC Legal track to demonstrate that learning to rank can tune relevance feedback to improve result rankings for specific queries, even with limited amounts of user feedback.

Peter Lubell-Doughtie, Katja Hofmann

When Simple is (more than) Good Enough: Effective Semantic Search with (almost) no Semantics

Using keyword queries to find entities has emerged as one of the major search types on the Web. In this paper, we study the task of ad-hoc entity retrieval: keyword search in a collection of structured data. We start with a baseline retrieval system that constructs pseudo documents from RDF triples and introduce three extensions: preprocessing of URIs, using two-fielded retrieval models, and boosting popular domains. Using the query sets of the 2010 and 2011 Semantic Search Challenge, we show that our straightforward approach outperforms all previously reported results, some generated by far more complex systems.

Robert Neumayer, Krisztian Balog, Kjetil Nørvåg

Evaluating Personal Information Retrieval

Evaluation of personal search over an individual’s personal information space on the desktop or elsewhere is problematic for reasons relating both to the personal and private nature of the data and the associated personal information needs of collection owners. Indeed challenges associated with evaluation in this space are recognised as one of the key factors hindering the development of research in personal information retrieval. We present the “personal information retrieval evaluation (PIRE)” tool, which provides a solution to this evaluation problem using a ‘living laboratory’ approach. This tool allows for the evaluation of retrieval techniques using ‘real’ individuals’ personal collections, queries and result sets, in a cross-comparable repeatable way, while importantly maintaining an individual’s informational privacy.

Liadh Kelly, Paul Bunbury, Gareth J. F. Jones

Applying Power Graph Analysis to Weighted Graphs

We expanded Power Graph Analysis for use with weighted graphs, applying the technique to document categorisation with promising results. With the additional weight information we were able to create more accurate representations of the underlying data while maintaining a high level of edge reduction and improving visualisation of the graph.

Niels Bloom

An Investigation of Term Weighting Approaches for Microblog Retrieval

The use of effective term frequency weighting and document length normalisation strategies have been shown over a number of decades to have a significant positive effect for document retrieval. When dealing with much shorter documents, such as those obtained from microblogs, it would seem intuitive that these would have less benefit. In this paper we investigate their effect on microblog retrieval performance using the Tweets2011 collection from the TREC 2011 Microblog Track.

Paul Ferguson, Neil O’Hare, James Lanagan, Owen Phelan, Kevin McCarthy

On the Size of Full Element-Indexes for XML Keyword Search

We show that a full element-index can be as space-efficient as a direct index with Dewey ids, after compression using typical techniques.

Duygu Atilgan, Ismail Sengor Altingovde, Özgür Ulusoy

Combining Probabilistic Language Models for Aspect-Based Sentiment Retrieval

In this paper, we present a new methodology aimed at retrieving relevant product aspects from a collection of customer reviews, as well as the most salient sentiments expressed about them. Our proposal is both unsupervised and domain independent, and does not relies on NLP techniques such as parsing or dependence analysis. In our experiments, the proposed method achieves good values of precision. It is also shown that our approach is capable of properly retrieving the relevant aspects and their sentiments even from individual reviews.

Lisette García-Moya, Henry Anaya-Sánchez, Rafael Berlanga-Llavori

In Praise of Laziness: A Lazy Strategy for Web Information Extraction

A large number of Web information extraction algorithms are based on machine learning techniques. For such extraction algorithms, we propose employing a lazy learning strategy to build a specialized model for each test instance to improve the extraction accuracy and avoid the disadvantages of constructing a single general model.

Rifat Ozcan, Ismail Sengor Altingovde, Özgür Ulusoy

Demos

LiveTweet: Monitoring and Predicting Interesting Microblog Posts

This paper describes the LiveTweet application, a system for automatically analysing and predicting the interestingness of microblog posts. Based on a stream of recent microblog posts the system tracks user interactions on Twitter that indicate interesting content. An incremental Naive Bayes model is trained to learn the characteristics of tweets which are considered interesting by the users. Finally, the probability of a microblog post to be retweeted is used as metric for its interestingness.

Arifah Che Alhadi, Thomas Gottron, Jérôme Kunegis, Nasir Naveed

A User Interface for Query-by-Sketch Based Image Retrieval with Color Sketches

This demo will interactively show a system that exploits a novel user interface, running on Tablet PCs or graphic tablets, that provides query-by-sketch based image retrieval using color sketches. The system uses Angular Radial Partitioning (ARP) for the edge information in the sketches and color moments in the CIELAB space, combined with a distance metric that is robust to deviations in color as they usually need to be taken into account with user-generated color sketches.

Ivan Giangreco, Michael Springmann, Ihab Al Kabary, Heiko Schuldt

Crisees: Real-Time Monitoring of Social Media Streams to Support Crisis Management

The

Crisees

demonstrator is a service that aggregates and collects social media streams to support Crisis Managment.

David Maxwell, Stefan Raue, Leif Azzopardi, Chris Johnson, Sarah Oates

A Mailbox Search Engine Using Query Multi-modal Expansion and Community-Based Smoothing

This demo introduces a new tool (or plug-in) for any email client that automatically decomposes the (personal or shared) mailbox into new virtual folders, corresponding to topics and communities, in an unsupervised way to lighten end-user load. The proposed software implements a retrieval system where the user can search for emails but also for people by submitting a double-faceted query: “key words” and “key persons”. The software is able to retrieve three kind of documents that a matching search-based system would not retrieve. Firstly, by using person profiles, the software will rank documents related to the key persons without requiring them to be participant (i.e. being author or recipient). Secondly, the system will retrieve documents sharing the same topics as the key words but not necessarily containing them. Thirdly, the proposed solution will also retrieve other participants who are members of the communities associated to the key persons.

Amin Mantrach, Jean-Michel Renders

EmSe: Supporting Children’s Information Needs within a Hospital Environment

The Emma Search (EmSe) demonstrator developed for the Emma Children’s Hospital showcases the PuppyIR project and PuppyIR framework for building information services for children.

Leif Azzopardi, Doug Dowie, Sergio Duarte, Carsten Eickhoff, Richard Glassey, Karl Gyllstrom, Djoerd Hiemstra, Franciska de Jong, Frea Kruisinga, Kelly Marshall, Sien Moens, Tamara Polajnar, Frans van der Sluis, Arjen de Vries

Retro: Time-Based Exploration of Product Reviews

Most e-commerce websites organize and present product reviews around ratings with hardly any feature to view them in a time-oriented way. Often, there is a way to sort reviews by time but no further temporal analysis is possible. Thus, usually, only few reviews are part of a user’s review analysis process, and there is no way to analyze all reviews of a product collectively. In this paper, we describe Retro, a search engine for exploring product reviews using temporal information.

Jannik Strötgen, Omar Alonso, Michael Gertz

Querium: A Session-Based Collaborative Search System

People’s information-seeking can span multiple sessions, and can be collaborative in nature. Existing commercial offerings do not effectively support searchers to share, save, collaborate or revisit their information. In this demo paper we present Querium: a novel session-based collaborative search system that lets users search, share, resume and collaborate with other users. Querium provides a number of novel search features in a collaborative setting, including relevance feedback, query fusion, faceted search, and search histories.

Abdigani Diriye, Gene Golovchinsky

Backmatter

Title: Advances in Information Retrieval
Editors: Ricardo Baeza-Yates
Arjen P. de Vries
Hugo Zaragoza
B. Barla Cambazoglu
Vanessa Murdock
Ronny Lempel
Fabrizio Silvestri
Publisher: Springer Berlin Heidelberg
Electronic ISBN: 978-3-642-28997-2
Print ISBN: 978-3-642-28996-5
DOI: https://doi.org/10.1007/978-3-642-28997-2

Springer Professional

About this book

Table of Contents

Frontmatter

Query Representation

Explaining Query Modifications

Modeling Transactional Queries via Templates

Exploring Query Patterns in Email Search

Interactive Search Support for Difficult Web Queries

Blog and Online-Community Search

Predicting the Future Impact of News Events

Detection of News Feeds Items Appropriate for Children

Comparing Tweets and Tags for URLs

Geo-Location Estimation of Flickr Images: Social Web Based Enrichment

Semi-structured Retrieval

A Field Relevance Model for Structured Document Retrieval

Relation Based Term Weighting Regularization

A New Approach to Answerer Recommendation in Community Question Answering Services

On the Modeling of Entities for Ad-Hoc Entity Search in the Web of Data

Result Disambiguation in Web People Search

Evaluation

On Smoothing Average Precision

New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

On Aggregating Labels from Multiple Crowd Workers to Infer Relevance of Documents

Applications

How Random Walks Can Help Tourism

Retrieving Candidate Plagiarised Documents Using Query Expansion

Reliability Prediction of Webpages in the Medical Domain

Automatic Foldering of Email Messages:A Combination Approach

Retrieval Models

A Log-Logistic Model-Based Interpretation of TF Normalization of BM25

Score Transformation in Linear Combination for Multi-criteria Relevance Ranking

Axiomatic Analysis of Translation Language Model for Information Retrieval

An Information-Based Cross-Language Information Retrieval Model

Extended Expectation Maximization for Inferring Score Distributions

Top-k Retrieval Using Facility Location Analysis

Image and Video Retrieval

An Interactive Paper and Digital Pen Interface for Query-by-Sketch Image Retrieval

Image Abstraction in Crossmedia Retrieval for Text Illustration

A Latent Variable Ranking Model for Content-Based Retrieval

Text and Content Classification, Categorisation, Clustering

Language Modelling of Constraints for Text Clustering

A Framework for Unsupervised Spam Detection in Social Networking Sites

Classification of Short Texts by Deploying Topical Annotations

Cluster Labeling for Multilingual Scatter/Gather Using Comparable Corpora

Systems Efficiency

Adaptive Time-to-Live Strategies for Query Result Caching in Web Search Engines

Intra-query Concurrent Pipelined Processing for Distributed Full-Text Retrieval

Industry Track

Usefulness of Sentiment Analysis

Modeling Static Caching in Web Search Engines

Posters

Integrating Interactive Visualizations in the Search Process of Digital Libraries and IR Systems

On Theoretically Valid Score Distributions in Information Retrieval

Adaptive Temporal Query Modeling

The Design of a Visual History Tool to Help Users Refind Information within a Website

Analyzing the Polarity of Opinionated Queries

Semi-automatic Document Classification: Exploiting Document Difficulty

Investigating Summarization Techniques for Geo-Tagged Image Indexing

Handling OOV Words in Indian-language – English CLIR

Using a Medical Thesaurus to Predict Query Difficulty

Studying a Personality Coreference Network in a News Stories Photo Collection

Phrase Pair Classification for Identifying Subtopics

Full and Mini-batch Clustering of News Articles with Star-EM

Assessing and Predicting Vertical Intent for Web Queries

Predicting IMDB Movie Ratings Using Social Media

Squeezing the Ensemble Pruning: Faster and More Accurate Categorization for News Portals

A General Framework for People Retrieval in Social Media with Multiple Roles

Analysis of Query Reformulations in a Search Engine of a Local Web Site

Temporal Pseudo-relevance Feedback in Microblog Retrieval

Learning Adaptive Domain Models from Click Data to Bootstrap Interactive Web Search

A Little Interaction Can Go a Long Way: Enriching the Query Formulation Process

Learning to Rank from Relevance Feedback for e-Discovery

When Simple is (more than) Good Enough: Effective Semantic Search with (almost) no Semantics

Evaluating Personal Information Retrieval

Applying Power Graph Analysis to Weighted Graphs

An Investigation of Term Weighting Approaches for Microblog Retrieval

On the Size of Full Element-Indexes for XML Keyword Search

Combining Probabilistic Language Models for Aspect-Based Sentiment Retrieval

In Praise of Laziness: A Lazy Strategy for Web Information Extraction