Skip to main content
Top

2009 | Book

Information Retrieval Technology

5th Asia Information Retrieval Symposium, AIRS 2009, Sapporo, Japan, October 21-23, 2009. Proceedings

Editors: Gary Geunbae Lee, Dawei Song, Chin-Yew Lin, Akiko Aizawa, Kazuko Kuriyama, Masaharu Yoshioka, Tetsuya Sakai

Publisher: Springer Berlin Heidelberg

Book Series : Lecture Notes in Computer Science

insite
SEARCH

About this book

This book constitutes the refereed proceedings of the 5th Asia Information Retrieval Symposium, AIRS 2009, held in Sapporo, Japan, in October 2009. The 18 revised full papers and 20 revised poster papers presented were carefully reviewed and selected from 82 submissions. All current aspects of information retrieval - in theory and practice - are addressed; working with text, audio, image, video and multimedia data.

Table of Contents

Frontmatter

Regular Papers

Fully Automatic Text Categorization by Exploiting WordNet

This paper proposes a Fully Automatic Categorization approach for Text (FACT) by exploiting the semantic features from WordNet and document clustering. In FACT, the training data is constructed automatically by using the knowledge of the category name. With the support of WordNet, it first uses the category name to generate a set of features for the corresponding category. Then, a set of documents is labeled according to such features. To reduce the possible bias originating from the category name and generated features, document clustering is used to refine the quality of initial labeling. The training data are subsequently constructed to train the discriminative classifier. The empirical experiments show that the best performance of FACT can achieve more than 90% of the baseline SVM classifiers in F1 measure, which demonstrates the effectiveness of the proposed approach.

Jianqiang Li, Yu Zhao, Bo Liu
A Latent Dirichlet Framework for Relevance Modeling

Relevance-based language models operate by estimating the probabilities of observing words in documents relevant (or pseudo relevant) to a topic. However, these models assume that if a document is relevant to a topic, then all tokens in the document are relevant to that topic. This could limit model robustness and effectiveness. In this study, we propose a Latent Dirichlet relevance model, which relaxes this assumption. Our approach derives from current research on Latent Dirichlet Allocation (LDA) topic models. LDA has been extensively explored, especially for discovering a set of topics from a corpus. LDA itself, however, has a limitation that is also addressed in our work. Topics generated by LDA from a corpus are synthetic, i.e., they do not necessarily correspond to topics identified by humans for the same corpus. In contrast, our model explicitly considers the relevance relationships between documents and given topics (queries). Thus unlike standard LDA, our model is directly applicable to goals such as relevance feedback for query modification and text classification, where topics (classes and queries) are provided upfront. Thus although the focus of our paper is on improving relevance-based language models, in effect our approach bridges relevance-based language models and LDA addressing limitations of both.

Viet Ha-Thuc, Padmini Srinivasan
Assigning Location Information to Display Individuals on a Map for Web People Search Results

Distinguishing people with identical names is becoming more and more important in Web search. This research aims to display person icons on a map to help users select person clusters that are separated into different people from the result of person searches on the Web. We propose a method to assign person clusters with one piece of location information. Our method is comprised of two processes: (a) extracting location candidates from Web pages and (b) assigning location information using a local search engine. Our main idea exploits search engine rankings and character distance to obtain good location information among location candidates. Experimental results revealed the usefulness of our proposed method. We also show a developed prototype system.

Harumi Murakami, Yuya Takamori, Hiroshi Ueda, Shoji Tatsumi
Web Spam Identification with User Browsing Graph

Combating Web spam has become one of the top challenges for Web search engines. Most previous researches in link-based Web spam identification focus on exploiting hyperlink graphs and corresponding user-behavior models. However, the fact that hyperlinks can be easily added and removed by Web spammers makes hyperlink graph unreliable. We construct a user browsing graph based on users’ Web access log and adopt link analysis algorithms on this graph to identify Web spam pages. The constructed graph is much smaller than the original Web Graph, and link analysis algorithms can perform efficiently on them. Comparative experimental results also show that algorithms performed on the constructed graph outperforms those on the original graph.

Huijia Yu, Yiqun Liu, Min Zhang, Liyun Ru, Shaoping Ma
Metric and Relevance Mismatch in Retrieval Evaluation

Recent investigations of search performance have shown that, even when presented with two systems that are superior and inferior based on a Cranfield-style batch experiment, real users may perform equally well with either system. In this paper, we explore how these evaluation paradigms may be reconciled. First, we investigate the DCG@1 and P@1 metrics, and their relationship with user performance on a common web search task. Our results show that batch experiment predictions based on P@1 or DCG@1 translate directly to user search effectiveness. However, marginally relevant documents are not strongly differentiable from non-relevant documents. Therefore, when folding multiple relevance levels into a binary scale, marginally relevant documents should be grouped with non-relevant documents, rather than with highly relevant documents, as is currently done in standard IR evaluations.

We then investigate relevance mismatch, classifying users based on relevance profiles, the likelihood with which they will judge documents of different relevance levels to be useful. When relevance profiles can be estimated well, this classification scheme can offer further insight into the transferability of batch results to real user search tasks.

Falk Scholer, Andrew Turpin
Test Collection-Based IR Evaluation Needs Extension toward Sessions – A Case of Extremely Short Queries

There is overwhelming evidence suggesting that the real users of IR systems often prefer using extremely short queries (one or two individual words) but they try out several queries if needed. Such behavior is fundamentally different from the process modeled in the traditional test collection-based IR evaluation based on using more verbose queries and only one query per topic. In the present paper, we propose an extension to the test collection-based evaluation. We will utilize

sequences

of short queries based on empirically grounded but idealized session strategies. We employ TREC data and have test persons to suggest search words, while simulating sessions based on the idealized strategies for repeatability and control. The experimental results show that, surprisingly, web-like very short queries (including one-word query sequences) typically lead to good enough results even in a TREC type test collection. This finding motivates the observed real user behavior: as few very simple attempts normally lead to good enough results, there is no need to pay more effort. We conclude by discussing the consequences of our finding for IR evaluation.

Heikki Keskustalo, Kalervo Järvelin, Ari Pirkola, Tarun Sharma, Marianne Lykke
Weighted Rank Correlation in Information Retrieval Evaluation

In Information Retrieval (IR), it is common practice to compare the rankings observed during an experiment – the statistical procedure to compare rankings is called rank correlation. Rank correlation helps decide the success of new systems, models and techniques. To measure rank correlation, the most used coefficient is Kendall’s

τ

. However, in IR, when computing the correlations, the most relevant, useful or interesting items should often be considered more important than the least important items. Despite its simplicity and widespread use, Kendall’s

τ

little helps discriminate the items by importance. To overcome this drawback, in this paper, a family

τ

*

of rank correlation coefficients for IR has been introduced for discriminating the rank correlation according to the rank of the items. The basis has been provided by the notion of gain previously utilized in retrieval effectiveness measurement. The probability distribution for

τ

*

has also been provided.

Massimo Melucci
Extractive Summarization Based on Event Term Temporal Relation Graph and Critical Chain

In this paper, we investigate whether temporal relations among event terms can help improve event-based extractive summarization and text cohesion of machine-generated summaries. Using the verb semantic relation, namely

happens-before

provided by VerbOcean, we construct an event term temporal relation graph for source documents. We assume that the maximal weakly connected component on this graph represents the main topic of source documents. The event terms in the temporal critical chain identified from the maximal weakly connected component are then used to calculate the significance of the sentences in source documents. The most significant sentences are included in final summaries. Experiments conducted on the DUC 2001 corpus show that extractive summarization based on event term temporal relation graph and critical chain is able to organize final summaries in a more coherent way and accordingly achieves encouraging improvement over the well-known tf*idf-based and PageRank-based approaches.

Maofu Liu, Wenjie Li, Huijun Hu
Using an Information Quality Framework to Evaluate the Quality of Product Reviews

The prevalence of Web2.0 makes the Web an invaluable source of information. For instance, product reviews composed collaboratively by many independent Internet reviewers can help consumers make purchase decisions and enable manufactures to improve their business strategies. As the number of reviews is increasing exponentially, opinion mining is needed to identify important reviews and opinions for users. Most opinion mining approaches try to extract sentimental or bipolar expressions from a large volume of reviews. However, the mining process often ignores the quality of each review and may retrieve useless or even noisy reviews. In this paper, we propose a method for evaluating the quality of information in product reviews. We treat review quality evaluation as a classification problem and employ an effective information quality framework to extract representative review features. Experiments based on an expert-composed data corpus demonstrate that the proposed method outperforms state-of-the-art approaches significantly.

You-De Tseng, Chien Chin Chen
Automatic Extraction for Product Feature Words from Comments on the Web

Before deciding to buy a product, many people tend to consult others’ opinions on it. Web provides a perfect platform which one can get information to find out the advantages and disadvantages of the product of his interest. How to automatically manage the numerous opinionated documents and then to give suggestions to the potential customers is becoming a research hotspot recently. Constructing a sentiment resource is one of the vital elements of opinion finding and polarity analysis tasks. For a specific domain, the sentiment resource can be regarded as a dictionary, which contains a list of product feature words and several opinion words with sentiment polarity for each feature word. This paper proposes an automatic algorithm to extraction feature words and opinion words for the sentiment resource. We mine the feature words and opinion words from the comments on the Web with both NLP technique and statistical method. Left context entropy is proposed to extract unknown feature words; Adjective rules and background corpus are taken into consideration in the algorithm. Experimental results show the effectiveness of the proposed automatic sentiment resource construction approach. The proposed method that combines NLP and statistical techniques is better than using only NLP-based technique. Although the experiment is built on mobile telephone comments in Chinese, the algorithm is domain independent.

Zhichao Li, Min Zhang, Shaoping Ma, Bo Zhou, Yu Sun
Image Sense Classification in Text-Based Image Retrieval

An

image sense

is a graphic representation of a concept denoted by a (set of) term(s). This paper proposes algorithms to find image senses for a concept, collect the sense descriptions, and employ them to disambiguate the image senses in text-based image retrieval. In the experiments on 10 ambiguous terms, 97.12% of image senses returned by a search engine are covered. The average precision of sample images is 68.26%. We propose four kinds of classifiers using text, image, URL, and expanded text features, respectively, and a merge strategy to combine the results of these classifiers. The merge classifier achieves 0.3974 in F-measure (β=0.5), which is much better than the baseline and has 51.61% of human performance.

Yih-Chen Chang, Hsin-Hsi Chen
A Subword Normalized Cut Approach to Automatic Story Segmentation of Chinese Broadcast News

This paper presents a subword normalized cut (N-cut) approach to automatic story segmentation of Chinese broadcast news (BN). We represent a speech recognition transcript using a weighted undirected graph, where the nodes correspond to sentences and the weights of edges describe inter-sentence similarities. Story segmentation is formalized as a graph-partitioning problem under the N-cut criterion, which simultaneously minimizes the similarity across different partitions and maximizes the similarity within each partition. We measure inter-sentence similarities and perform N-cut segmentation on the character/syllable (i.e. subword units) overlapping n-gram sequences. Our method works at the subword levels because subword matching is robust to speech recognition errors and out-of-vocabulary words. Experiments on the TDT2 Mandarin BN corpus show that syllable-bigram-based N-cut achieves the best F1-measure of 0.6911 with relative improvement of 11.52% over previous word-based N-cut that has an F1-measure of 0.6197. N-cut at the subword levels is more effective than the word level for story segmentation of noisy Chinese BN transcripts.

Jin Zhang, Lei Xie, Wei Feng, Yanning Zhang
Japanese Spontaneous Spoken Document Retrieval Using NMF-Based Topic Models

In this paper, we propose a document topic model (DTM) which is based on the non-negative matrix factorization (NMF) approach, to explore Japanese spontaneous spoken document retrieval. Each document is interpreted as a generative topic model, belonging to many topics. The relevance of a document to a query is expressed by the probability of a query word being generated by the model. Different from the conventional vector space model where the matching between query and document is at the word level, the topic model complete its matching in the concept or semantic level. So, the problem of term mismatch in the information retrieval can be improved, that is, the relevant documents have possibilities to be retrieved even if the query words do not appear in them. The method also benefit the retrieval of spoken document containing “term misrecognitions”, which is peculiar to the speech transcripts. By using this approach, experiments are conducted on a test collection of corpora of spontaneous Japanese (CSJ), where some of the evaluating queries and answer references are suited to retrieval in semantic level. The retrieval performance is improved by increasing the number of topics. When the topic number exceeds a threshold, the NMF’s retrieval performance surpasses the tf-idf-based vector space model (VSM). Furthermore, compared to the VSM-based method, the NMF-based topic model also shows its strongpoint in dealing with term mismatch and term misrecognition.

Xinhui Hu, Hideki Kashioka, Ryosuke Isotani, Satoshi Nakamura
Finding ‘Lucy in Disguise’: The Misheard Lyric Matching Problem

We investigated methods for music information retrieval systems where the search term is a portion of a misheard lyric. Lyric data presents its own unique challenges that are different to related problems such as name search. We compared three techniques, each configured for local rather than global matching: edit distance, Editex, and SAPS-L — a technique derived from Syllable Alignment Pattern Searching. Each technique was selected based on effectiveness at approximate pattern matching in related fields. Local edit distance and Editex performed comparably as evaluated with mean average precision and mean reciprocal rank. SAPS-L’s effectiveness varied between measures.

Nicholas Ring, Alexandra L. Uitdenbogerd
Selecting Effective Terms for Query Formulation

It is difficult for users to formulate appropriate queries for search. In this paper, we propose an approach to query term selection by measuring the effectiveness of a query term in IR systems based on its linguistic and statistical properties in document collections. Two query formulation algorithms are presented for improving IR performance. Experiments on NTCIR-4 and NTCIR-5 ad-hoc IR tasks demonstrate that the algorithms can significantly improve the retrieval performance by 9.2% averagely, compared to the performance of the original queries given in the benchmarks.

Chia-Jung Lee, Yi-Chun Lin, Ruey-Cheng Chen, Pu-Jen Cheng
Discovering Volatile Events in Your Neighborhood: Local-Area Topic Extraction from Blog Entries

This paper presents a method for the detection of occasional or volatile local events using topic extraction technologies. This is a new application of topic extraction technologies that has not been addressed in general location-based services. A two-level hierarchical clustering method was applied to topics and their transitions using time-series blog entries collected with search queries including place names. According to experiments using 764 events from 37 locations in Tokyo and its vicinity, our method achieved 77.0% event findability. It was found that the number of blog entries in urban areas was sufficient for the extraction of topics, and the proposed method could extract typical volatile events, such as performances of music groups, and places of interest, such as popular restaurants.

Masayuki Okamoto, Masaaki Kikuchi
A Unified Graph-Based Iterative Reinforcement Approach to Personalized Search

General information retrieval systems do not perform well in satisfying users’ individual information need. This paper proposes a novel graph-based approach based on the following three kinds of mutual reinforcement relationships: RR-Relationship (Relationship among search results), RT-Relationship (Relationship between search results and terms), TT-Relationship (Relationship among terms). Moreover, the implicit feedback information, such as query logs and immediately viewed documents, can be utilized by this graph-based model. Our approach produces better ranking results and a better query model mutually and iteratively. Then a greedy algorithm concerning the diversity of the search results is employed to select the recommended results. Based on this approach, we develop an intelligent client-side web search agent GBAIR, and web search based experiments show that the new approach can improve search accuracy over another personalized web search agent.

Yunping Huang, Le Sun, Zhe Wang
Exploiting Sentence-Level Features for Near-Duplicate Document Detection

Digital documents are easy to copy. How to effectively detect possible near-duplicate copies is critical in Web search. Conventional copy detection approaches such as document fingerprinting and bag-of-word similarity target at different levels of granularity in document features, from word

n

-grams to whole documents. In this paper, we focus on the

mutual-inclusive

type of near-duplicates where only partial overlap among documents makes them similar. We propose using a simple and compact sentence-level feature,

the sequence of sentence lengths

, for near-duplicate copy detection. Various configurations of sentence-level and word-level algorithms are evaluated. The experimental results show that sentence-level algorithms achieved higher efficiency with comparable precision and recall rates.

Jenq-Haur Wang, Hung-Chi Chang

Posters

Language Models of Collaborative Filtering

Collaborative filtering is a major technique to make personalized recommendations about information items (movies, books, webpages etc) to individual users. In the literature, a common research objective is to predict unknown ratings of items for a user, on the condition that the user has explicitly rated a certain amount of items. Nevertheless, in many practical situations, we may only have

implicit

evidence of user preferences, such as “playback times of a music file” or “visiting frequency of a web-site”. Most importantly, a more practical view of the recommendation task is to directly generate a top-

N

ranked list of items that the user is most likely to like.

In this paper, we take these two concerns into account. Item ranking in recommender systems is considered as a task highly related to document ranking in text retrieval. Firstly, two practical item scoring functions are derived by adopting the generative language modelling approach of text retrieval. Secondly, to address the uncertainty associated with the score estimation, we introduce a

risk-averse

model that penalizes the less reliable scores. Our experiments on real data sets demonstrate that significant performance gains have been achieved.

Jun Wang
Efficient Text Classification Using Term Projection

In this paper, we propose an efficient text classification method using term projection. Firstly, we use a modified

χ

2

statistic to project terms into predefined categories, which is more efficient compared to other clustering methods. Afterwards, we utilize the generated clusters as features to represent the documents. The classification is then performed in a rule-based manner or via SVM. Experiment results show that our modified

χ

2

statistic feature selection method outperforms traditional

χ

2

statistic especially at lower dimensionalities. And our method is also more efficient than Latent Semantic Analysis (LSA) on homogeneous dataset. Meanwhile, we can reduce the feature dimensionality by three orders of magnitude to save training and testing cost, and maintain comparable accuracy. Moreover, we could use a small training set to gain an approximately 4.3% improvement on heterogeneous dataset as compared to traditional method, which indicates that our method has better generalization capability.

Yabin Zheng, Zhiyuan Liu, Shaohua Teng, Maosong Sun
IPHITS: An Incremental Latent Topic Model for Link Structure

The structure of linked documents is dynamic and keeps on changing. Even though different methods have been proposed to exploit the link structure in identifying hubs and authorities in a set of linked documents, no existing approach can effectively deal with its changing situation. This paper explores changes in linked documents and proposes an incremental link probabilistic framework, which we call IPHITS. The model deals with online document streams in a faster, scalable way and uses a novel link updating technique that can cope with dynamic changes. Experimental results on two different sources of online information demonstrate the time saving strength of our method. Besides, we make analysis of the stable rankings under small perturbations to the linkage patterns.

Huifang Ma, Weizhong Zhao, Zhixin Li, Zhongzhi Shi
Supervised Dual-PLSA for Personalized SMS Filtering

Because users hardly have patience of affording enough labeled data, personalized filter is expected to converge much faster. Topic model based dimension reduction can minimize the structural risk with limited training data. In this paper, we propose a novel supervised dual-PLSA which estimate topics with many kinds of observable data, i.e. labeled and unlabeled documents, supervised information about topics.

c

-

w

PLSA model is first proposed, in which word and class are observable variables and topic is latent. Then, two generative models,

c

-

w

PLSA and typical PLSA, are combined to share observable variables in order to utilize other observed data. Furthermore, supervised information about topic is employed. This is supervised dual-PLSA. Experiments show the dual-PLSA has a very fast convergence. Within 100 gold standard feedback, dual-PLSA’s cumulative error rate drops to 9%. Its total error rate is 6.94%, which is the lowest among all the filters.

Wei-ran Xu, Dong-xin Liu, Jun Guo, Yi-chao Cai, Ri-le Hu
Enabling Effective User Interactions in Content-Based Image Retrieval

This paper presents an interactive content-based image retrieval framework—uInteract, for delivering a novel four-factor user interaction model visually. The four-factor user interaction model is an interactive relevance feedback mechanism that we proposed, aiming to improve the interaction between users and the CBIR system and in turn users overall search experience. In this paper, we present how the framework is developed to deliver the four-factor user interaction model, and how the visual interface is designed to support user interaction activities. From our preliminary user evaluation result on the ease of use and usefulness of the proposed framework, we have learnt what the users like about the framework and the aspects we could improve in future studies. Whilst the framework is developed for our research purposes, we believe the functionalities could be adapted to any content-based image search framework.

Haiming Liu, Srđan Zagorac, Victoria Uren, Dawei Song, Stefan Rüger
Improving Text Rankers by Term Locality Contexts

When ranking texts retrieved for a query, semantics of each term

t

in the texts is a fundamental basis. The semantics often depends on locality context (neighboring) terms of

t

in the texts. In this paper, we present a technique CTFA4TR that improves text rankers by encoding the term locality contexts to the assessment of term frequency (TF) of each term in the texts. Results of the TF assessment may be directly used to improve various kinds of text rankers, without calling for any revisions to algorithms and development processes of the rankers. Moreover, CTFA4TR is efficient to conduct the TF assessment online, and neither training process nor training data is required. Empirical evaluation shows that CTFA4TR significantly improves various kinds of text rankers. The contributions are of practical significance, since many text rankers were developed, and if they consider TF in ranking, CTFA4TR may be used to enhance their performance, without incurring any cost to them.

Rey-Long Liu, Zong-Xing Lin
Mutual Screening Graph Algorithm: A New Bootstrapping Algorithm for Lexical Acquisition

Bootstrapping is a weakly supervised algorithm that has been the focus of attention in many Information Extraction(IE) and Natural Language Processing(NLP) fields, especially in learning semantic lexicons. In this paper, we propose a new bootstrapping algorithm called Mutual Screening Graph Algorithm (MSGA) to learn semantic lexicons. The approach uses only unannotated corpus and a few of seed words to learn new words for each semantic category. By changing the format of extracted patterns and the method for scoring patterns and words, we improve the former bootstrapping algorithm. We also evaluate the semantic lexicons produced by MSGA with previous bootstrapping algorithm Basilisk [1] and GMR (Graph Mutual Reinforcement based Bootstrapping) [4]. Experiments have shown that MSGA can outperform those approaches.

Yuhan Zhang, Yanquan Zhou
Web Image Retrieval for Abstract Queries Using Text and Image Information

In this paper, we propose a method for image retrieval on the web. In this task, we focus on abstract words that do not directly link to images that we want. For example, a user might use a query “summer” to retrieve images of “fireworks” or “a white sand beach with the sea”. In this case retrieval systems need to infer direct words for the images from the abstract query of the user. In our method, we extract related words about a query from the web first. Second, we retrieve images from the web by using the extracted words. Then, a user selects relevant images from the retrieved images. Next, the system computes a similarity between selected images and other images and ranks the images on the basis of the similarity. We use the Earth Mover’s Distance as the similarity. The experimental result shows the effectiveness of our method that uses text and image information for the image retrieval process.

Kazutaka Shimada, Suguru Ishikawa, Tsutomu Endo
Question Answering Based on Answer Trustworthiness

Nowadays, we are faced with finding “trustworthy” answers not only “relevant” answers. This paper proposes a QA model based on answer trustworthiness. Contrary to the past researches which focused simple trust factors of a document, we identified three different answer trustworthiness factors: 1) incorporating document quality at the document layer; 2) representing the authority and reputation of answer sources at the answer source layer; 3) verifying the answers by consulting various QA systems at the sub-QAs layer. In our experiments, the proposed method using all answer trustworthiness factors shows improvement: 237% (0.150 to 0.506 MRR) for answering effectiveness and 92% (28,993 to 2,293 min.) for indexing efficiency.

Hyo-Jung Oh, Chung-Hee Lee, Yeo-Chan Yoon, Myung-Gil Jang
Domain Specific Opinion Retrieval

Opinion retrieval is a novel information retrieval task and has attracted a great deal of attention with the rapid increase of online opinionated information. Most previous work adopts the classical two stage framework, i.e., first retrieving topic relevant documents and then re-ranking them according to opinion relevance. However, none has considered the problem of domain coherence between queries and topic relevant documents. In this work, we propose to address this problem based on the similarity measure of the usage of opinion words (which users employ to express opinions). Our work is based on the observation that the opinion words are domain dependent. We reformulate this problem as measuring the opinion similarity between domain opinion models of queries and document opinion models. Opinion model is constructed to capture the distribution of opinion words. The basic idea is that if a document has high opinion similarity with a domain opinion model, it indicates that it is not only opinionated but also in the same domain with the query (i.e., domain coherence). Experimental results show that our approach performs comparatively with the state-of-the-art work.

Guang Qiu, Feng Zhang, Jiajun Bu, Chun Chen
A Boosting Approach for Learning to Rank Using SVD with Partially Labeled Data

Learning to rank has become a hot issue in the community of information retrieval. It combines the relevance judgment information with the approaches of both in information retrieval and machine learning, so as to learn a more accurate ranking function for retrieval. Most previous approaches only rely on the labeled relevance information provided, thus suffering from the limited training data size available. In this paper, we try to use Singular Value Decomposition (SVD) to utilize the unlabeled data set to extract new feature vectors, which are then embedded in a RankBoost leaning framework. We experimentally compare the performance of our approach against that without incorporating new features generated by SVD. The experimental results show that our approach can consistently improve retrieval performance across several LETOR data sets, thus indicating effectiveness of new SVD generated features for learning ranking function.

Yuan Lin, Hongfei Lin, Zhihao Yang, Sui Su
Opinion Target Network and Bootstrapping Method for Chinese Opinion Target Extraction

Opinion mining systems suffer a great loss when unknown opinion targets constantly appear in newly composed reviews. Previous opinion target extraction methods typically consider human-compiled opinion targets as seeds and adopt syntactic/statistic patterns to extract opinion targets. Three problems are worth noting. First, the manually defined opinion targets are too large to be good seeds. Second, the list that maintains seeds is not powerful to represent relationship between the seeds. Third, one cycle of opinion target extraction is barely able to give satisfactory performance. As a result, coverage of the existing methods is rather low. In this paper, the opinion target network (OTN) is proposed to organize atom opinion targets of component and attribute in a two-layer graph. Based on OTN, a bootstrapping method is designed for opinion target extraction via generalization and propagation in multiple cycles. Experiments on Chinese opinion target extraction show that the proposed method is effective.

Yunqing Xia, Boyi Hao, Kam-Fai Wong
Automatic Search Engine Performance Evaluation with the Wisdom of Crowds

Relevance evaluation is an important topic in Web search engine research. Traditional evaluation methods resort to huge amount of human efforts which lead to an extremely time-consuming process in practice. With analysis on large scale user query logs and click-through data, we propose a performance evaluation method that fully automatically generates large scale Web search topics and answer sets under Cranfield framework. These query-to-answer pairs are directly utilized in relevance evaluation with several widely-adopted precision/recall-related retrieval performance metrics. Besides single search engine log analysis, we propose user behavior models on multiple search engines’ click-through logs to reduce potential bias among different search engines. Experimental results show that the evaluation results are similar to those gained by traditional human annotation, and our method avoids the propensity and subjectivity of manual judgments by experts in traditional ways.

Rongwei Cen, Yiqun Liu, Min Zhang, Liyun Ru, Shaoping Ma
A Clustering Framework Based on Adaptive Space Mapping and Rescaling

Traditional clustering algorithms often suffer from model misfit problem when the distribution of real data does not fit the model assumptions. To address this problem, we propose a novel clustering framework based on adaptive space mapping and rescaling, referred as M-R framework. The basic idea of our approach is to adjust the data representation to make the data distribution fit the model assumptions better. Specifically, documents are first mapped into a low dimensional space with respect to the cluster centers so that the distribution statistics of each cluster could be analyzed on the corresponding dimension. With the statistics obtained in hand, a rescaling operation is then applied to regularize the data distribution based on the model assumptions. These two steps are conducted iteratively along with the clustering algorithm to constantly improve the clustering performance. In our work, we apply the M-R framework on the most widely used clustering algorithm, i.e. k-means, as an example. Experiments on well known datasets show that our M-R framework can obtain comparable performance with state-of-the-art methods.

Yiling Zeng, Hongbo Xu, Jiafeng Guo, Yu Wang, Shuo Bai
Research on Lesk-C-Based WSD and Its Application in English-Chinese Bi-directional CLIR

Cross-Language Information Retrieval (CLIR) combines the traditional Information Retrieval technique and Machine Translation technique. There are many aspects related to the problem of polysemy, which are good cut-in points for the application of WSD in CLIR. Therefore, an attempt in this paper is to apply WSD in English-Chinese Bi-Directional CLIR. The query expansion and the proposed Lesk-C WSD strategy are explored. Although limited improvement on WSD can be obtained, query expansion and disambiguation based on the related strategies of WSD are beneficial to CLIR, and can improve the whole retrieval performance. Specially, by considering the “Coordinate Terms”, the Lesk-C algorithm shows the better performance and has more extensive applicability on CLIR.

Yuejie Zhang, Tao Zhang
Searching Polyphonic Indonesian Folksongs Based on N-gram Indexing Technique

Availability of enormous number of digital music presents challenge to organize and retrieve it in an effective way. We explore polyphonic Indonesian folksongs retrieval based on pattern matching such as n-gram in searching the songs. We compare the pattern matching results to regular text-based information retrieval system. The folksongs are either fully or partially indexed. The results of the experiments show that using text-based IR system or n-gram matching technique, both are effective in retrieving the polyphonic songs, regardless of the query length or position where the query fragment is taken. However, to achieve a better performance, fully indexed songs is preferable than partially indexed songs.

Aurora Marsye, Mirna Adriani
Study on the Click Context of Web Search Users for Reliability Analysis

User behavior information analysis has been shown important for optimization and evaluation of Web search and has become one of the major areas in both information retrieval and knowledge management researches. This paper focuses on users’ searching behavior reliability study based on large scale query and click-through logs collected from commercial search engines. The concept of reliability is defined in a probabilistic notion. The context of user click behavior on search results is analyzed in terms of relevance. Five features, namely query number, click entropy, first click ratio, last click ratio, and rank position, are proposed and studied to separate reliable user clicks from the others. Experimental results show that the proposed method evaluates the reliability of user behavior effectively. The AUC value of the ROC curve is 0.792, and the algorithm maintains 92.8% relevant clicks when filtering out 40% low quality clicks.

Rongwei Cen, Yiqun Liu, Min Zhang, Liyun Ru, Shaoping Ma
Utilizing Social Relationships for Blog Popularity Mining

Due to the ease of use in blogs, this new form of web content has become a popular online media. Detecting the popularity of blogs in the massive blogosphere is a critical issue. General search engines that ignore the social interconnection between bloggers have less discrimination of blogs. This study extracts real-world blog data and analyzes the interconnection in these blog communities for blog popularity mining. The interconnections reveal the consciousness of bloggers and the popularity of blogs which may refer to blog qualities. In this paper, we propose a blog network model based on the interconnection structure between blogs and a popularity ranking method, called BRank, on the constructed model. Several experiments are conducted to analyze the various explicit and implicit interconnection structures and discover variances of the impact of interactions in different communities. Experiments on several real blog communities show that the proposed method could detect blogs with great popularity in the blogosphere.

Chih-Lu Lin, Hao-Lun Tang, Hung-Yu Kao
S-node: A Small-World Navigation System for Exploratory Search

In the retrieval of newspapers or weblogs in which particular terms and expressions are used frequently, it is not easy to remind the user of appropriate query terms. For this case, it is necessary to present typical feature terms or documents in the document set without depending on the user’s input. In this paper, we propose the navigation system ‘S-node’ for documents. The system extracts two kinds of words that show exhaustivity or specificity from documents written in Japanese based on repetition index, and constructs hyperlinks between documents that can reach as short as possible to various documents based on co-occurrence of terms. We describe the implementation of the system, and the results of evaluation.

Satoshi Shimada, Tomohiro Fukuhara, Tetsuji Satoh
Efficient Probabilistic Latent Semantic Analysis through Parallelization

Probabilistic latent semantic analysis (PLSA) is considered an effective technique for information retrieval, but has one notable drawback: its dramatic consumption of computing resources, in terms of both execution time and internal memory. This drawback limits the practical application of the technique only to document collections of modest size.

In this paper, we look into the practice of implementing PLSA with the aim of improving its efficiency

without

changing its output. Recently, Hong et al. [2008] has shown how the execution time of PLSA can be improved by employing OpenMP for shared memory parallelization. We extend their work by also studying the effects from using it in combination with the Message Passing Interface (MPI) for distributed memory parallelization. We show how a more careful implementation of PLSA reduces execution time and memory costs by applying our method on several text collections commonly used in the literature.

Raymond Wan, Vo Ngoc Anh, Hiroshi Mamitsuka
Backmatter
Metadata
Title
Information Retrieval Technology
Editors
Gary Geunbae Lee
Dawei Song
Chin-Yew Lin
Akiko Aizawa
Kazuko Kuriyama
Masaharu Yoshioka
Tetsuya Sakai
Copyright Year
2009
Publisher
Springer Berlin Heidelberg
Electronic ISBN
978-3-642-04769-5
Print ISBN
978-3-642-04768-8
DOI
https://doi.org/10.1007/978-3-642-04769-5

Premium Partner