Skip to main content

2006 | Buch

Information Retrieval Technology

Third Asia Information Retrieval Symposium, AIRS 2006, Singapore, October 16-18, 2006. Proceedings

herausgegeben von: Hwee Tou Ng, Mun-Kew Leong, Min-Yen Kan, Donghong Ji

Verlag: Springer Berlin Heidelberg

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

Asia Information Retrieval Symposium (AIRS) 2006 was the third AIRS conf- ence in the series established in 2004.The ?rst AIRS washeld in Beijing, China, and the 2nd AIRS was held in Cheju, Korea. The AIRS conference series traces its roots to the successful Information Retrieval with Asian Languages (IRAL) workshop series which started in 1996. The AIRS series aims to bring together international researchers and dev- opers to exchange new ideas and the latest results in information retrieval. The scope of the conference encompassed the theory and practice of all aspects of information retrieval in text, audio, image, video, and multimedia data. Wearehappyto reportthatAIRS2006received148submissions,thehighest number since the conference series started in 2004. Submissions came from Asia and Australasia, Europe, and North America. We accepted 34 submissions as regular papers (23%) and 24 as poster papers (16%). We would like to thank all the authors who submitted papers to the conf- ence, the seven area chairs, who worked tirelessly to recruit the program c- mittee members and oversaw the review process, and the program committee members and their secondary reviewers who reviewed all the submissions.

Inhaltsverzeichnis

Frontmatter

Session 1A: Text Retrieval

Query Expansion with ConceptNet and WordNet: An Intrinsic Comparison

This paper compares the utilization of ConceptNet and WordNet in query expansion. Spreading activation selects candidate terms for query expansion from these two resources. Three measures including discrimination ability, concept diversity, and retrieval performance are used for comparisons. The topics and document collections in the ad hoc track of TREC-6, TREC-7 and TREC-8 are adopted in the experiments. The results show that ConceptNet and WordNet are complementary. Queries expanded with WordNet have higher discrimination ability. In contrast, queries expanded with ConceptNet have higher concept diversity. The performance of queries expanded by selecting the candidate terms from ConceptNet and WordNet outperforms that of queries without expansion, and queries expanded with a single resource.

Ming-Hung Hsu, Ming-Feng Tsai, Hsin-Hsi Chen
Document Similarity Search Based on Manifold-Ranking of TextTiles

Document similarity search aims to find documents similar to a query document in a text corpus and return a ranked list of similar documents. Most existing approaches to document similarity search compute similarity scores between the query and the documents based on a retrieval function (e.g. Cosine) and then rank the documents by their similarity scores. In this paper, we proposed a novel retrieval approach based on manifold-ranking of TextTiles to re-rank the initially retrieved documents. The proposed approach can make full use of the intrinsic global manifold structure for the TextTiles of the documents in the re-ranking process. Experimental results demonstrate that the proposed approach can significantly improve the retrieval performances based on different retrieval functions. TextTile is validated to be a better unit than the whole document in the manifold-ranking process.

Xiaojun Wan, Jianwu Yang, Jianguo Xiao
Adapting Document Ranking to Users’ Preferences Using Click-Through Data

This paper proposes a new approach to ranking the documents retrieved by a search engine using click-through data. The goal is to make the final ranked list of documents accurately represent users’ preferences reflected in the click-through data. Our approach combines the ranking result of a traditional IR algorithm (BM25) with that given by a machine learning algorithm (Naïve Bayes). The machine learning algorithm is trained on click-through data (queries and their associated documents), while the IR algorithm runs over the document collection. We consider several alternative strategies for combining the result of using click-through data and that of using document data. Experimental results confirm that any method of using click-through data greatly improves the preference ranking, over the method of using BM25 alone. We found that a linear combination of scores of Naïve Bayes and scores of BM25 performs the best for the task. At the same time, we found that the preference ranking methods can preserve relevance ranking, i.e., the preference ranking methods can perform as well as BM25 for relevance ranking.

Min Zhao, Hang Li, Adwait Ratnaparkhi, Hsiao-Wuen Hon, Jue Wang

Session 1B: Search and Extraction

A PDD-Based Searching Approach for Expert Finding in Intranet Information Management

Expert finding is a frequently faced problem in Intranet information management, which aims at locating certain employees in large organizations. A Person Description Document (PDD)-based retrieval model is proposed in this paper for effective expert finding. At first, features and context about an expert are extracted to form a profile which is called the expert’s PDD. A retrieval strategy based on BM2500 algorithm and bi-gram weighting is then used to rank experts which are represented by their PDDs. This model proves effective and the method based on this model achieved the best performance in TREC2005 expert finding task.Comparative studies with traditional non-PDD methods indicate that the proposed model improves the system performance by over 45%.

Yupeng Fu, Rongjing Xiang, Min Zhang, Yiqun Liu, Shaoping Ma
A Supervised Learning Approach to Entity Search

In this paper we address the problem of entity search. Expert search and time search are used as examples. In entity search, given a query and an entity type, a search system returns a ranked list of entities in the type (e.g., person name, time expression) relevant to the query. Ranking is a key issue in entity search. In the literature, only expert search was studied and the use of co-occurrence was proposed. In general, many features may be useful for ranking in entity search. We propose using a linear model to combine the uses of different features and employing a supervised learning approach in training of the model. Experimental results on several data sets indicate that our method significantly outperforms the baseline method based solely on co-occurrences.

Guoping Hu, Jingjing Liu, Hang Li, Yunbo Cao, Jian-Yun Nie, Jianfeng Gao
Hierarchical Learning Strategy in Relation Extraction Using Support Vector Machines

This paper proposes a novel hierarchical learning strategy to deal with the data sparseness problem in relation extraction by modeling the commonality among related classes. For each class in the hierarchy either manually predefined or automatically clustered, a discriminative function is determined in a top-down way. As the upper-level class normally has much more positive training examples than the lower-level class, the corresponding discriminative function can be determined more reliably and effectively, and thus guide the discriminative function learning in the lower-level, which otherwise might suffer from limited training data. In this paper, the state-of-the-art Support Vector Machines is applied as the basic classifier learning approach using the hierarchical learning strategy. Evaluation on the ACE RDC 2003 and 2004 corpora shows that the hierarchical learning strategy much improves the performance on least- and medium- frequent relations.

GuoDong Zhou, Min Zhang, Guohong Fu

Session 1C: Text Classification and Indexing

Learning to Separate Text Content and Style for Classification

Many text documents naturally have two kinds of labels. For example, we may label web pages from universities according to their categories, such as “student” or “faculty”, or according the source universities, such as “Cornell” or “Texas”. We call one kind of labels the content and the other kind the style. Given a set of documents, each with both content and style labels, we seek to effectively learn to classify a set of documents in a new style with no content labels into its content classes. Assuming that every document is generated using words drawn from a mixture of two multinomial component models, one content model and one style model, we propose a method named

Cartesian EM

that constructs content models and style models through Expectation Maximization and performs classification of the unknown content classes transductively. Our experiments on real-world datasets show the proposed method to be effective for

style independent text content classification

.

Dell Zhang, Wee Sun Lee
Using Relative Entropy for Authorship Attribution

Authorship attribution is the task of deciding who wrote a particular document. Several attribution approaches have been proposed in recent research, but none of these approaches is particularly satisfactory; some of them are ad hoc and most have defects in terms of scalability, effectiveness, and efficiency. In this paper, we propose a principled approach motivated from information theory to identify authors based on elements of writing style. We make use of the Kullback-Leibler divergence, a measure of how different two distributions are, and explore several different approaches to tokenizing documents to extract style markers. We use several data collections to examine the performance of our approach. We have found that our proposed approach is as effective as the best existing attribution methods for two class attribution, and is superior for multi-class attribution. It has lower computational cost and is cheaper to train. Finally, our results suggest this approach is a promising alternative for other categorization problems.

Ying Zhao, Justin Zobel, Phil Vines
Efficient Query Evaluation Through Access-Reordering

Reorganising the index of a search engine based on access frequencies can significantly reduce query evaluation time while maintaining search effectiveness. In this paper we extend access-ordering and introduce a variant index organisation technique that we label access-reordering. We show that by access-reordering an inverted index, query evaluation time can be reduced by as much as 62% over the standard approach, while yielding highly similar effectiveness results to those obtained when using a conventional index.

Steven Garcia, Andrew Turpin

Session 1D: Text Clustering

Natural Document Clustering by Clique Percolation in Random Graphs

Document clustering techniques mostly depend on models that impose explicit and/or implicit priori assumptions as to the number, size, disjunction characteristics of clusters, and/or the probability distribution of clustered data. As a result, the clustering effects tend to be unnatural and stray away more or less from the intrinsic grouping nature among the documents in a corpus. We propose a novel graph-theoretic technique called

Clique Percolation Clustering

(CPC). It models clustering as a process of enumerating adjacent maximal cliques in a random graph that unveils inherent structure of the underlying data, in which we unleash the commonly practiced constraints in order to discover natural overlapping clusters. Experiments show that CPC can outperform some typical algorithms on benchmark data sets, and shed light on natural document clustering.

Wei Gao, Kam-Fai Wong
Text Clustering with Limited User Feedback Under Local Metric Learning

This paper investigates the idea of incorporating incremental user feedbacks and a small amount of sample documents for some, not necessarily all, clusters into text clustering. For the modeling of each cluster, we make use of a local weight metric to reflect the importance of the features for a particular cluster. The local weight metric is learned using both the unlabeled data and the constraints generated automatically from user feedbacks and sample documents. The quality of local metric is improved by incorporating more precise constraints. Improving the quality of local metric will in return enhance the clustering performance. We have conducted extensive experiments on real-world news documents. The results demonstrate that user feedback information coupled with local metric learning can dramatically improve the clustering performance.

Ruizhang Huang, Zhigang Zhang, Wai Lam
Toward Generic Title Generation for Clustered Documents

A cluster labeling algorithm for creating generic titles based on external resources such as WordNet is proposed. Our method first extracts category-specific terms as cluster descriptors. These descriptors are then mapped to generic terms based on a hypernym search algorithm. The proposed method has been evaluated on a patent document collection and a subset of the Reuters-21578 collection. Experimental results revealed that our method performs as anticipated. Real-case applications of these generic terms show promising in assisting humans in interpreting the clustered topics. Our method is general enough such that it can be easily extended to use other hierarchical resources for adaptable label generation.

Yuen-Hsien Tseng, Chi-Jen Lin, Hsiu-Han Chen, Yu-I Lin

Session 1E: Information Retrieval Models

Word Sense Language Model for Information Retrieval

This paper proposes a word sense language model based method for information retrieval. This method, differing from most of traditional ones, combines word senses defined in a thesaurus with a classic statistical model. The word sense language model regards the word sense as a form of linguistic knowledge, which is helpful in handling mismatch caused by synonym and data sparseness due to data limit. Experimental results based on TREC-Mandarin corpus show that this method gains 12.5% improvement on MAP over traditional tf-idf retrieval method but 5.82% decrease on MAP compared to a classic language model. A combination result of this method and the language model yields 8.92% and 7.93% increases over either respectively. We present analysis and discussions on the not-so-exciting results and conclude that a higher performance of word sense language model will owe to high accurate of word sense labeling. We believe that linguistic knowledge such as word sense of a thesaurus will help IR improve ultimately in many ways.

Liqi Gao, Yu Zhang, Ting Liu, Guiping Liu
Statistical Behavior Analysis of Smoothing Methods for Language Models of Mandarin Data Sets

In this paper, we discuss the properties of statistical behavior and entropies of three smoothing methods; two well-known and one proposed smoothing method will be used on three language models in Mandarin data sets. Because of the problem of data sparseness, smoothing methods are employed to estimate the probability for each event (including all the seen and unseen events) in a language model. A set of properties used to analyze the statistical behaviors of three smoothing methods are proposed. Our proposed smoothing methods comply with all the properties. We implement three language models in Mandarin data sets and then discuss the entropy. In general, the entropies of proposed smoothing method for three models are lower than that of other two methods.

Ming-Shing Yu, Feng-Long Huang, Piyu Tsai
No Tag, a Little Nesting, and Great XML Keyword Search

Keyword search from Informational Retrieval (IR) can be seen as one most convenient processing mode catering for common users to obtain interesting information. As XML data becomes more and more widespread, the trend of adapting keyword search on XML data also becomes more and more active. In this paper, we first try nesting mechanism for XML keyword search, which just uses a little nesting skill. This attempt has several benefits. For example, it is convenient for common users, because they need not to know any organization knowledge of the target XML data. Secondly, the nesting pattern can be easily transformed into structural hints, which has same mechanism as what XML data model does. Finally, since there is no need of label information, we can retrieve XML fragments from different schemas. Besides, this paper also proposes a new similarity measuring method for retrieved XML fragments which can be from different schemas. Its kernel is KCAM (Keyword Common Ancestor Matrix) structure, which stores the level information of SLCA (Smallest Lowest Common Ancestor) node between two keywords. By mapping XML fragments into KCAMs, the structural similarity can be computed using matrix distance. KCAM distance can go well with the nesting keyword method.

Lingbo Kong, Shiwei Tang, Dongqing Yang, Tengjiao Wang, Jun Gao

Session 2A: Web Information Retrieval

Improving Re-ranking of Search Results Using Collaborative Filtering

Search Engines today often return a large volume of results with possibly a few relevant results. The notion of relevance is subjective and depends on the user and the context of search. Re-ranking of these results to reflect the most relevant results to the user, using a user profile built from the relevance feedback has proved to provide good results. Our approach assumes implicit feedback gathered from a search engine query logs and learn a user profile. The user profile typically runs into sparsity problems due to the sheer volume of the WWW. Sparsity refers to the missing weights of certain words in the user profile. In this paper, we present an effective re-ranking strategy that compensates for the sparsity in a user’s profile, by applying collaborative filtering algorithms. Our evaluation results show an improvement in precision over approaches that use only a user’s profile.

U Rohini, Vamshi Ambati
Learning to Integrate Web Catalogs with Conceptual Relationships in Hierarchical Thesaurus

Web catalog integration has been addressed as an important issue in current digital content management. Past studies have shown that exploiting a flattened structure with auxiliary information extracted from the source catalog can improve the integration results. Although earlier studies have also shown that exploiting a hierarchical structure in classification may bring better advantages, the effectiveness has not been testified in catalog integration. In this paper, we propose an enhanced catalog integration (ECI) approach to extract the conceptual relationships from the hierarchical Web thesaurus and further improve the accuracy of Web catalog integration. We have conducted experiments of real-world catalog integration with both a flattened structure and a hierarchical structure in the destination catalog. The results show that our ECI scheme effectively boosts the integration accuracy of both the flattened scheme and the hierarchical scheme with the advanced Support Vector Machine (SVM) classifiers.

Jui-Chi Ho, Ing-Xiang Chen, Cheng-Zen Yang
Discovering Authoritative News Sources and Top News Stories

With the popularity of reading news online, the idea of assembling news articles from multiple news sources and digging out the most important stories has become very appealing. In this paper we present a novel algorithm to rank assembled news articles as well as news sources according to their importance and authority respectively. We employ the visual layout information of news homepages and exploit the mutual reinforcement relationship between news articles and news sources. Specifically, we propose to use a label propagation based semi-supervised learning algorithm to improve the structure of the relation graph between sources and new articles. The integration of the label propagation algorithm with the HITS like mutual reinforcing algorithm produces a quite effective ranking algorithm. We implement a system TOPSTORY which could automatically generate homepages for users to browse important news. The result of ranking a set of news collected from multiple sources over a period of half a month illustrates the effectiveness of our algorithm.

Yang Hu, Mingjing Li, Zhiwei Li, Wei-ying Ma

Session 2B: Cross-Language Information Retrieval

Chinese Question-Answering: Comparing Monolingual with English-Chinese Cross-Lingual Results

A minimal approach to Chinese factoid QA is described. It employs entity extraction software, template matching, and statistical candidate answer ranking via five evidence types, and does not use explicit word segmentation or Chinese syntactic analysis. This simple approach is more portable to other Asian languages, and may serve as a base on which more precise techniques can be used to improve results. Applying to the NTCIR-5 monolingual environment, it delivers medium top-1 accuracy and MRR of .295, .3381 (supported answers) and .41, .4998 (including unsupported) respectively. When applied to English-Chinese cross language QA with three different forms of English-Chinese question translation, it attains top-1 accuracy and MRR of .155, .2094 (supported) and .215, .2932 (unsupported), about ~52% to ~62% of monolingual effectiveness. CLQA improvements via successively different forms of question translation are also demonstrated.

Kui-Lam Kwok, Peter Deng
Translation of Unknown Terms Via Web Mining for Information Retrieval

Many English words appear in Asian language texts, especially in the news reports and technical documents. Although a foreign term and its counterpart in English refer to the same concept, they are erroneously treated as independent index units in traditional monolingual IR. For CLIR, one of the major hindrances to achieving retrieval performance at the level of monolingual information retrieval is the translation of terms in queries, which are not found in a bilingual dictionary. This paper describes the degree to which these problems arise in Korean Information Retrieval and suggests a novel approach to solve it. Experimental results based on NTCIR and KT-Set test collections show that the high translation precision of our approach greatly improves the IR performance.

Qing Li, Sung Hyon Myaeng, Yun Jin, Bo-Yeong Kang
A Cross-Lingual Framework for Web News Taxonomy Integration

There are currently many news sites providing online news articles, and many Web news portals arise to provide clustered news categories for users to browse more related news reports and realize the news events in depth. However, to the best of our knowledge, most Web news portals only provide monolingual news clustering services. In this paper, we study the cross-lingual Web news taxonomy integration problem in which news articles of the same news event reported in different languages are to be integrated into one category. Our study is based on cross-lingual classification research results and the cross-training concept to construct SVM-based classifiers for cross-lingual Web news taxonomy integration. We have conducted several experiments with the news articles from Google News as the experimental data sets. From the experimental results, we find that the proposed cross-training classifiers outperforms the traditional SVM classifiers in an all-round manner. We believe that the proposed framework can be applied to different bilingual environments.

Cheng-Zen Yang, Che-Min Chen, Ing-Xiang Chen

Session 2C: Question Answering and Summarization

Learning Question Focus and Semantically Related Features from Web Search Results for Chinese Question Classification

Recently, some machine learning techniques like support vector machines are employed for question classification. However, these techniques heavily depend on the availability of large amounts of training data, and may suffer many difficulties while facing various new questions from the real users on the Web. To mitigate the problem of lacking sufficient training data, in this paper, we present a simple learning method that explores Web search results to collect more training data automatically by a few seed terms (question answers). In addition, we propose a novel semantically related feature model (SRFM), which takes advantage of question focuses and their semantically related features learned from the larger number of collected training data to support the determination of question type. Our experimental results show that the proposed new learning method can obtain better classification performance than the bigram language modeling (LM) approach for the questions with untrained question focuses.

Shu-Jung Lin, Wen-Hsiang Lu
Improving the Robustness to Recognition Errors in Speech Input Question Answering

In our previous work, we developed a prototype of a speech-input help system for home appliances such as digital cameras and microwave ovens. Given a factoid question, the system performs textual question answering using the manuals as the knowledge source. Whereas, given a HOW question, it retrieves and plays a demonstration video. However, our first prototype suffered from speech recognition errors, especially when the Japanese interrogative phrases in factoid questions were misrecognized. We therefore propose a method for solving this problem, which complements a speech query transcript with an interrogative phrase selected from a pre-determined list. The selection process first narrows down candidate phrases based on co-occurrences within the manual text, and then computes the similarity between each candidate and the query transcript in terms of pronunciation. Our method improves the Mean Reciprocal Rank of top three answers from 0.429 to 0.597 for factoid questions.

Hideki Tsutsui, Toshihiko Manabe, Mika Fukui, Tetsuya Sakai, Hiroko Fujii, Koji Urata
An Adjacency Model for Sentence Ordering in Multi-document Summarization

In this paper, we proposed a new method named adjacency based ordering to order sentences for summarization tasks. Given a group of sentences to be organized into the summary, connectivity of each pair of sentences is learned from source documents. Then a top-first strategy is implemented to define the sentence ordering. It provides a solution of ordering texts while other information except the source documents is not available. We compared this method with other existing sentence ordering methods. Experiments and evaluations are made on data collection of DUC04. The results show that this method distinctly outperforms other existing sentence ordering methods. Its low input requirement also makes it capable to most summarization and text generation tasks.

Yu Nie, Donghong Ji, Lingpeng Yang

Session 2D: Natural Language Processing

Poor Man’s Stemming: Unsupervised Recognition of Same-Stem Words

We present a new fully unsupervised human-intervention-free algorithm for stemming for an open class of languages. Since it does not rely on existing large data collections or other linguistic resources than raw text it is especially attractive for low-density languages. The stemming problem is formulated as a decision whether two given words are variants of the same stem and requires that, if so, there is a concatenative relation between the two. The underlying theory makes no assumptions on whether the language uses a lot of morphology or not, whether it is prefixing or suffixing, or whether affixes are long or short. It does however make the assumption that 1. salient affixes have to be frequent, 2. words essentially are variable length sequences of random characters, and furthermore 3. that a heuristic on what constitutes a systematic affix alteration is valid. Tested on four typologically distant languages, the stemmer shows very promising results in an evaluation against a human-made gold standard.

Harald Hammarström
NAYOSE: A System for Reference Disambiguation of Proper Nouns Appearing on Web Pages

We are developing a reference disambiguation system called

NAYOSE System

. In order to cope with the case the same person name or place name appears over two or more Web pages, we propose a system classifying each page into a cluster which corresponds to the same entity in the real world. For this purpose, we propose two new methods involving algorithms to classify these pages. In our evaluation, the combination of local text matching and named entities matching outperformed the previous baseline algorithm used in simple document classification method by 0.22 in the overall F-measure.

Shingo Ono, Minoru Yoshida, Hiroshi Nakagawa
Efficient and Robust Phrase Chunking Using Support Vector Machines

Automatic text chunking is a task which aims to recognize phrase structures in natural language text. It is the key technology of knowledge-based system where phrase structures provide important syntactic information for knowledge representation. Support Vector Machine (SVM-based) phrase chunking system had been shown to achieve high performance for text chunking. But its inefficiency limits the actual use on large dataset that only handles several thousands tokens per second. In this paper, we firstly show that the state-of-the-art performance (94.25) in the CoNLL-2000 shared task based on conventional SVM learning. However, the off-the-shelf SVM classifiers are inefficient when the number of phrase types scales to high. Therefore, we present two novel methods that make the system substantially faster in terms of training and testing while only results in a slightly decrease of system performance. Experimental result shows that our method achieves 94.09 in F rate, which handles 13000 tokens per second in the CoNLL-2000 chunking task.

Yu-Chieh Wu, Jie-Chi Yang, Yue-Shi Lee, Show-Jane Yen

Session 2E: Evaluation

Statistical and Comparative Evaluation of Various Indexing and Search Models

This paper first describes various strategies (character, bigram, automatic segmentation) used to index the Chinese (ZH), Japanese (JA) and Korean (KR) languages. Second, based on the NTCIR-5 test-collections, it evaluates various retrieval models, varying from classical vector-space models to more recent developments in probabilistic and language models. While no clear conclusion was reached for the Japanese language, the bigram-based indexing strategy seems to be the best choice for Korean, and the combined ”unigram & bigram” indexing strategy is best for traditional Chinese. On the other hand,

Divergence from Randomness

(DFR) probabilistic model usually results in the best mean average precision. Finally, upon an evaluation of the four different statistical tests, we find that their conclusions correlate, even more when comparing the non-parametric bootstrap with the t-test.

Samir Abdou, Jacques Savoy
Bootstrap-Based Comparisons of IR Metrics for Finding One Relevant Document

This paper compares the sensitivity of IR metrics designed for the task of finding one relevant document, using a method recently proposed at SIGIR 2006. The metrics are: P

 + 

-measure, P-measure, O-measure, Normalised Weighted Reciprocal Rank (NWRR) and Reciprocal Rank (RR). All of them except for RR can handle graded relevance. Unlike the ad hoc (but nevertheless useful) “swap” method proposed by Voorhees and Buckley, the new method derives the sensitivity and the performance difference required to guarantee a given significance level directly from Bootstrap Hypothesis Tests. We use four data sets from NTCIR to show that, according to this method, “P(

 + 

)-measure ≥ O-measure ≥ NWRR ≥ RR” generally holds, where “≥” means “is at least as sensitive as”. These results generalise and reinforce previously reported ones based on the swap method. Therefore, we recommend the use of P(

 + 

)-measure and O-measure for practical tasks such as known-item search where recall is either unimportant or immeasurable.

Tetsuya Sakai
Evaluating Topic Difficulties from the Viewpoint of Query Term Expansion

Query term expansion is an important technique for achieving higher retrieval performance. However, since many factors affects the quality of this technique, it is difficult to evaluate this technique in isolation. Feature quantities that characterize the quality of the initial query are defined in this study for evaluating topic difficulties from the viewpoint of query term expansion. I also briefly review the result of the NTCIR-5 query term expansion subtask that uses these quantities for evaluating the effectiveness of the query term expansion techniques. I also describe detailed analysis results on the effect of query term expansion based on topic-by-topic analysis.

Masaharu Yoshioka

Session 3A: Multimedia Information Retrieval

Incorporating Prior Knowledge into Multi-label Boosting for Cross-Modal Image Annotation and Retrieval

Automatic image annotation (AIA) has proved to be an effective and promising solution to automatically deduce the high-level semantics from low-level visual features. In this paper, we formulate the task of image annotation as a multi-label, multi class semantic image classification problem and propose a simple yet effective joint classification framework in which probabilistic multi-label boosting and contextual semantic constraints are integrated seamlessly. We conducted experiments on a medium-sized image collection including about 5000 images from Corel Stock Photo CDs. The experimental results demonstrated that the annotation performance of our proposed method is comparable to state-of-the-art approaches, showing the effectiveness and feasibility of the proposed unified framework.

Wei Li, Maosong Sun
A Venation-Based Leaf Image Classification Scheme

Most content-based image retrieval systems use image features such as textures, colors, and shapes. However, in the case of leaf image, it is not appropriate to rely on color or texture features only because such features are similar in most leaves. In this paper, we propose a novel leaf image retrieval scheme which first analyzes leaf venation for leaf categorization and then extracts and utilizes shape feature to find similar ones from the categorized group in the database. The venation of a leaf corresponds to the blood vessel of organisms. Leaf venations are represented using points selected by the curvature scale scope corner detection method on the venation image, and categorized by calculating the density of feature points using non-parametric estimation density. We show its effectiveness by performing several experiments on the prototype system.

Jin-Kyu Park, EenJun Hwang, Yunyoung Nam
Pic-A-Topic: Gathering Information Efficiently from Recorded TV Shows on Travel

We introduce a system called Pic-A-Topic, which analyses closed captions of Japanese TV shows on travel to perform topic segmentation and topic sentence selection. Our objective is to provide a table-of-contents interface that enables efficient viewing of desired topical segments within recorded TV shows to users of appliances such as hard disk recorders and digital TVs. According to our experiments using 14.5 hours of recorded travel TV shows, Pic-A-Topic’s F1-measure for the topic segmentation task is 82% of manual performance on average. Moreover, a preliminary user evaluation experiment suggests that this level of performance may be indistinguishable from manual performance.

Tetsuya Sakai, Tatsuya Uehara, Kazuo Sumita, Taishi Shimomori
A Music Retrieval System Based on Query-by-Singing for Karaoke Jukebox

This paper investigates the problem of retrieving Karaoke music by singing. The Karaoke music encompasses two audio channels in each track: one is a mix of vocal and background accompaniment, and the other is composed of accompaniment only. The accompaniments in the two channels often resemble each other, but are not identical. This characteristic is exploited to infer the vocal’s background music from the accompaniment-only channel, so that the main melody underlying the vocal signals can be extracted more effectively. To enable an efficient and accurate search for a large music database, we propose a phrase onset detection method based on Bayesian Information Criterion (BIC) for predicting the most likely beginning of a sung query, and adopt a multiple-level multiple-pass Dynamic Time Warping (DTW) for melody similarity comparison. The experiments conducted on a Karaoke database consisting of 1,071 popular songs show the promising results of query-by-singing retrieval for Karaoke music.

Hung-Ming Yu, Wei-Ho Tsai, Hsin-Min Wang

Special Session: Medical Image Retrieval

A Semantic Fusion Approach Between Medical Images and Reports Using UMLS

One of the main challenges in content-based image retrieval still remains to bridge the gap between low-level features and semantic information. In this paper, we present our first results concerning a medical image retrieval approach using a semantic medical image and report indexing within a fusion framework, based on the Unified Medical Language System (

UMLS

) metathesaurus. We propose a structured learning framework based on Support Vector Machines to facilitate modular design and extract medical semantics from images. We developed two complementary visual indexing approaches within this framework: a global indexing to access image modality, and a local indexing to access semantic local features. Visual indexes and textual indexes – extracted from medical reports using

MetaMap

software application – constitute the input of the late fusion module. A weighted vectorial norm fusion algorithm allows the retrieval system to increase its meaningfulness, efficiency and robustness. First results on the CLEF medical database are presented. The important perspectives of this approach in terms of semantic query expansion and data-mining are discussed.

Daniel Racoceanu, Caroline Lacoste, Roxana Teodorescu, Nicolas Vuillemenot
Automated Object Extraction for Medical Image Retrieval Using the Insight Toolkit (ITK)

Visual information retrieval is an emerging domain in the medical field as it has been in computer vision for more than ten years. It has the potential to help better managing the rising amount of visual medical data. One of the most frequent application fields for content–based medical image retrieval (CBIR) is diagnostic aid. By submitting an image showing a certain pathology to a CBIR system, the medical expert can easily find similar cases. A major problem is the background surrounding the object in many medical images. System parameters of the imaging modalities are stored around the images in text as well as patient name or a logo of the institution. With such noisy input data, image retrieval often rather finds images where the object appears in the same area and is surrounded by similar structures. Whereas in specialised application domains, segmentation can focus the research on a particular area, PACS–like (Picture Archiving and Communication System) databases containing a large variety of images need a more general approach. This article describes an algorithm to extract the important object of the image to reduce the amount of data to be analysed for CBIR and focuses analysis to the important object. Most current solutions index the entire image without making a difference between object and background when using varied PACS–like databases or radiology teaching files. Our requirement is to have a fully automatic algorithm for object extraction. Medical images have the advantage to normally have one particular object more or less in the centre of the image. The database used for evaluating this task is taken from a radiology teaching file called

casimage

and the retrieval component is an open source retrieval engine called

medGIFT

.

Henning Müller, Joris Heuberger, Adrien Depeursinge, Antoine Geissbühler
Stripe: Image Feature Based on a New Grid Method and Its Application in ImageCLEF

There have been many features developed for images, like Blob, image patches, Gabor filters, etc. But generally the calculation cost is too high. When facing a large image database, their responding speed can hardly satisfy users’ demand in real time, especially for online users. So we developed a new image feature based on a new region division method of images, and named it as ‘stripe’. As proved by the applications in ImageCLEF’s medical subtasks, stripe is much faster at the calculation speed compared with other features. And its influence to the system performance is also interesting: a little higher than the best result in ImageCLEF 2004 medical retrieval task (Mean Average Precision — MAP: 44.95% vs. 44.69%), which uses Gabor filters; and much better than Blob and low-resolution map in ImageCLEF 2006 medical annotation task (classification correctness rate: 75.5% vs. 58.5% & 75.1%).

Bo Qiu, Daniel Racoceanu, Chang Sheng Xu, Qi Tian

Poster Session

An Academic Information Retrieval System Based on Multiagent Framework

In real-life searches in information, a set of information retrieved by a query influences user’s knowledge. Usually this influence inspires the user with new ideas and new conception of the query. As a result, the search in information is iterated while the user’s query is continually shifting in part or whole. This sort of search is called an “evolving search,” and it performs an important role also in academic information retrieval. To support the utilization of digital academic information, this paper proposes a novel system for academic information retrieval. In the proposed system, which is based on a multiagent framework, each piece of academic information is structured as an agent and provided with autonomy. Consequently, since a search is iterated by academic information itself, part of an evolving search is entrusted to the system, and the user’s load to retrieve academic information can be reduced effectively.

Toru Abe, Yuu Chiba, Suoya, Baoning Li, Tetsuo Kinoshita
Comparing Web Logs: Sensitivity Analysis and Two Types of Cross-Analysis

Different Web log studies calculate the same metrics using different search engines logs sampled during different observation periods and processed under different values of two controllable variables peculiar to the Web log analysis: a client discriminator used to exclude clients who are agents and a temporal cut-off used to segment logged client transactions into temporal sessions. How much are the results dependent on these variables? We analyze the sensitivity of the results to two controllable variables. The sensitivity analysis shows significant varying of the metrics values depending on these variables. In particular, the metrics varies up to 30-50% on the commonly assigned values. So the differences caused by controllable variables are of the same order of magnitude as the differences between the metrics reported in different studies. Thus, the direct comparison of the reported results is an unreliable approach leading to artifactual conclusions. To overcome the method-dependency of the direct comparison of the reported results we introduce and use a cross-analysis technique of the direct comparison of logs. Besides, we propose an alternative easy-accessible comparison of the reported metrics, which corrects the reported values accordingly to the controllable variables used in the studies.

Nikolai Buzikashvili
Concept Propagation Based on Visual Similarity
Application to Medical Image Annotation

This paper presents an approach for image annotation propagation to images which have no annotations. In some specific domains, the assumption that visual similarity implies (partial) semantic similarity can be made. For instance, in medical imaging, two images of the same anatomic part in a given modality have a very similar appearance. In the proposed approach, a conceptual indexing phase extracts concepts from texts; a visual similarity between images is computed and then combined with conceptual text indexing. Annotation propagation driven by prior knowledge on the domain is finally performed. Domain knowledge used is a meta-thesaurus for both indexing and annotations propagation. The proposed approach has been applied on the imageCLEF medical image collection.

Jean-Pierre Chevallet, Nicolas Maillot, Joo-Hwee Lim
Query Structuring with Two-Stage Term Dependence in the Japanese Language

We investigate the effectiveness of query structuring in the Japanese language by composing or decomposing compound words and phrases. Our method is based on a theoretical framework using Markov random fields. Our two-stage term dependence model captures both the global dependencies between query components explicitly delimited by separators in a query, and the local dependencies between constituents within a compound word when the compound word appears in a query component. We show that our model works well, particularly when using query structuring with compound words, through experiments using a 100-gigabyte web document collection mostly written in Japanese.

Koji Eguchi, W. Bruce Croft
Automatic Expansion of Abbreviations in Chinese News Text

This paper presents an n-gram based approach to Chinese abbreviation expansion. In this study, we distinguish reduced abbreviations from non-reduced abbreviations that are created by elimination or generalization. For a reduced abbreviation, a mapping table is compiled to map each short-word in it to a set of long-words, and a bigram based Viterbi algorithm is thus applied to decode an appropriate combination of long-words as its full-form. For a non-reduced abbreviation, a dictionary of non-reduced abbreviation/full-form pairs is used to generate its expansion candidates, and a disambiguation technique is further employed to select a proper expansion based on bigram word segmentation. The evaluation on an abbreviation-expanded corpus built from the PKU corpus showed that the proposed system achieved a recall of 82.9% and a precision of 85.5% on average for different types of abbreviations in Chinese news text.

Guohong Fu, Kang-Kwong Luke, GuoDong Zhou, Ruifeng Xu
A Novel Ant-Based Clustering Approach for Document Clustering

Recently, much research has been proposed using nature inspired algorithms to perform complex machine learning tasks. Ant Colony Optimization (ACO) is one such algorithm based on swarm intelligence and is derived from a model inspired by the collective foraging behavior of ants. Taking advantage of the ACO in traits such as self-organization and robustness, this paper proposes a novel document clustering approach based on ACO. Unlike other ACO-based clustering approaches which are based on the same scenario that ants move around in a 2D grid and carry or drop objects to perform categorization. Our proposed ant-based clustering approach does not rely on a 2D grid structure. In addition, it can also generate optimal number of clusters without incorporating any other algorithms such as K-means or AHC. Experimental results on the subsets of 20 Newsgroup data show that the ant-based clustering approach outperforms the classical document clustering methods such as K-means and Agglomerate Hierarchical Clustering. It also achieves better results than those obtained using the Artificial Immune Network algorithm when tested in the same datasets.

Yulan He, Siu Cheung Hui, Yongxiang Sim
Evaluating Scalability in Information Retrieval with Multigraded Relevance

For the user’s point of view, in large environments, it can be desirable to have Information Retrieval Systems (IRS) that retrieve documents according to their relevance levels. Relevance levels have been studied in some previous Information Retrieval (IR) works while some others (few) IR research works tackled the questions of IRS effectiveness and collections size. These latter works used standard IR measures on collections of increasing size to analyze IRS effectiveness scalability. In this work, we bring together these two issues in IR (multigraded relevance and scalability) by designing some new metrics for evaluating the ability of IRS to rank documents according to their relevance levels when collection size increases.

Amélie Imafouo, Michel Beigbeder
Text Mining for Medical Documents Using a Hidden Markov Model

We propose a semantic tagger that provides high level concept information for phrases in clinical documents. It delineates such information from the statements written by doctors in patient records. The tagging, based on Hidden Markov Model (HMM), is performed on the documents that have been tagged with Unified Medical Language System (UMLS), Part-of-Speech (POS), and abbreviation tags. The result can be used to extract clinical knowledge that can support decision making or quality assurance of medical treatment.

Hyeju Jang, Sa Kwang Song, Sung Hyon Myaeng
Multi-document Summarization Based on Unsupervised Clustering

In this paper, we propose a method for multi-document summarization based on unsupervised clustering. First, the main topics are determined by a MDL-based clustering strategy capable of inferring optimal cluster numbers. Then, the problem of multi-document summarization is formalized on the clusters using an entropy-based object function.

Paul Ji
A Content-Based 3D Graphic Information Retrieval System

This paper presents a 3D graphic information retrieval system which supports a content-based retrieval for 3D. The user can pose a visual query involving various 3D graphic features such as inclusion of a given object, object’s shape, descriptive information and spatial relations on the web interface. The data model underlying the retrieval system models 3D scenes using domain objects and their spatial relations. An XML-based data modeling language called 3DGML has been designed to support the data model. We discuss the data modeling technique and the retrieval system in detail.

Yonghwan Kim, Soochan Hwang
Query Expansion for Contextual Question Using Genetic Algorithms

We propose a query expansion method using Genetic Algorithms(GA) in Japanese. Recently, question answering research focuses on contextual questions. Therefore a question answering system has to resolve contextual problems by using both previous questions and previous answers. This problem is largely related to query expansion because of the need to find new keywords. In the contextual processing, a query needs to find other suitable keywords from related resources. Although it is easy for a system to find related words, it is difficult to find a suitable combination of keywords. GA is better suited for a combination problem just like a knapsack problem. Therefore we apply GA to our contextual query expansion method. In the evaluation experiment, MRR was 0.2531 in 360 contextual questions. We confirm the MRR of our method is higher than that of the baseline. We illustrate our method and the experiment.

Yasutomo Kimura, Kenji Araki
Fine-Grained Named Entity Recognition Using Conditional Random Fields for Question Answering

In many QA systems, fine-grained named entities are extracted by coarse-grained named entity recognizer and fine-grained named entity dictionary. In this paper, we describe a fine-grained Named Entity Recognition using Conditional Random Fields (CRFs) for question answering. We used CRFs to detect boundary of named entities and Maximum Entropy (ME) to classify named entity classes. Using the proposed approach, we could achieve an 83.2% precision, a 74.5% recall, and a 78.6% F1 for 147 fined-grained named entity types. Moreover, we reduced the training time to 27% without loss of performance compared to a baseline model. In the question answering, The QA system with passage retrieval and AIU archived about 26% improvement over QA with passage retrieval. The result demonstrated that our approach is effective for QA.

Changki Lee, Yi-Gyu Hwang, Hyo-Jung Oh, Soojong Lim, Jeong Heo, Chung-Hee Lee, Hyeon-Jin Kim, Ji-Hyun Wang, Myung-Gil Jang
A Hybrid Model for Sentence Ordering in Extractive Multi-document Summarization

Ordering information is a critical task for multi-document summarization because it heavily influent the coherence of the generated summary. In this paper, we propose a hybrid model for sentence ordering in extractive multi-document summarization that combines four relations between sentences. This model regards sentence as vertex and combined relation as edge of a directed graph on which the approximately optimal ordering can be generated with PageRank analysis. Evaluation of our hybrid model shows a significant improvement of the ordering over strategies losing some relations and the results also indicate that this hybrid model is robust for articles with different genre.

Dexi Liu, Zengchang Zhang, Yanxiang He, Donghong Ji
Automatic Query Type Identification Based on Click Through Information

We report on a study that was undertaken to better identify users’ goals behind web search queries by using click through data. Based on user logs which contain over 80 million queries and corresponding click through data, we found that query type identification benefits from click through data analysis; while anchor text information may not be so useful because it is only accessible for a small part (about 16%) of practical user queries. We also proposed two novel features extracted from click through data and a decision tree based classification algorithm for identifying user queries. Our experimental evaluation shows that this algorithm can correctly identify the goals for about 80% web search queries.

Yiqun Liu, Min Zhang, Liyun Ru, Shaoping Ma
Japanese Question-Answering System for Contextual Questions Using Simple Connection Method, Decreased Adding with Multiple Answers, and Selection by Ratio

We participated in NTCIR QAC-1, QAC-2, and QAC-3, which were evaluation workshops for answering questions held by the National Institute of Informatics of Japan and studied question-answering systems for contextual questions. Contextual questions are defined as a series of questions with contexts. For example, the first question is “What is the capital of Japan?” and the one succeeding is related to the first such as “What was it called in ancient times?”. Contextual question-answering can be considered interactive. This paper describes our system for contextual questions, which obtained the second best accuracy in QAC-1 and the best accuracy in both QAC-2 and QAC-3 for contextual question-answering. It is thus a high-performance system.

Masaki Murata, Masao Utiyama, Hitoshi Isahara
Multi-document Summarization Using a Clustering-Based Hybrid Strategy

In this paper we propose a clustering-based hybrid approach for multi-document summarization which integrates sentence clustering, local recommendation and global search. For sentence clustering, we adopt a stability-based method which can determine the optimal cluster number automatically. We weight sentences with terms they contain for local sentence recommendation of each cluster. For global selection, we propose a global criterion to evaluate overall performance of a summary. Thus the sentences in the final summary are determined by not only the configuration of individual clusters but also the overall performance. This approach successfully gets top-level performance running on corpus of DUC04.

Yu Nie, Donghong Ji, Lingpeng Yang, Zhengyu Niu, Tingting He
A Web User Preference Perception System Based on Fuzzy Data Mining Method

In a competitive environment, providing suitable information and products to meet customer requirements and improve customer satisfaction is one key factor to measure a company’s competitiveness. In this paper, we propose a preference perception system by combining fuzzy set with data mining technology to detect the information preference of each user on a web-based environment. An experiment was implemented to demonstrate the feasibility and effectiveness of the proposed system in this study. It indicates that the proposed system can effectively perceive the change of information preference for users in a Web environment.

Wei-Shen Tai, Chen-Tung Chen
An Analysis on Topic Features and Difficulties Based on Web Navigational Retrieval Experiments

We analyze the relationship between topic features and difficulties in Web navigational retrieval tasks based on the experiments done on the NTCIR-5 Web test collection. Our analysis shows that the difficulties of a retrieval task are closely related to the specificity of the topic, and topics that are of some particular categories are more difficult than others. For example, a representative page of a company or an organization is easier on average to find than that of a person, a product, or an event. Our results show that adding metadata on a topic would potentially be useful for search engines to predict the difficulty of the task. Additionally, we show that the number of unique documents retrieved from different systems weakly correlates with the query’s performance.

Masao Takaku, Keizo Oyama, Akiko Aizawa
Towards Automatic Domain Classification of Technical Terms: Estimating Domain Specificity of a Term Using the Web

This paper proposes a method of domain specificity estimation of technical terms using the Web. In the proposed method, it is assumed that, for a certain technical domain, a list of known technical terms of the domain is given. Technical documents of the domain are collected through the Web search engine, which are then used for generating a vector space model for the domain. The domain specificity of a target term is estimated according to the distribution of the domain of the sample pages of the target term. Experimental evaluation results show that the proposed method achieved mostly 90% precision/recall.

Takehito Utsuro, Mitsuhiro Kida, Masatsugu Tonoike, Satoshi Sato
Evaluating Score Normalization Methods in Data Fusion

In data fusion, score normalization is a step to make scores, which are obtained from different component systems for all documents, comparable to each other. It is an indispensable step for effective data fusion algorithms such as CombSum and CombMNZ to combine them. In this paper, we evaluate four linear score normalization methods, namely the fitting method, Zero-one, Sum, and ZMUV, through extensive experiments. The experimental results show that the fitting method and Zero-one appear to be the two leading methods.

Shengli Wu, Fabio Crestani, Yaxin Bi
WIDIT: Integrated Approach to HARD Topic Search

Web Information Discovery Tool (WIDIT) Laboratory at the Indiana University School of Library, whose basic approach to combine multiple methods as well as to leverage multiple sources of evidence, participated in 2005 Text Retrieval Conference’s Hard track (HARD-2005) to investigate methods of effectively dealing with HARD topics by exploring a variety of query expansion strategies, the results of which were combined via an automatic fusion optimization process. We hypothesized that the “difficulty” of topics is often due to the lack of appropriate query terms and/or misguided emphasis on non-pivotal query terms by the system. Thus, our first-tier solution was to devise a wide range of query expansion methods that can not only enrich the query with useful term additions but also identify important query terms. Our automatic query expansion included such techniques as noun phrase extraction, synonym identification, definition term extraction, keyword extraction by overlapping sliding window, and Web query expansion. The results of automatic expansion were used in soliciting user feedback, which was utilized in a post-retrieval reranking process. The paper describes our participation in HARD-2005 and is organized as follows. Section 2 gives an overview of HARD track, section 3 describes the WIDIT approach to HARD-2005, and section 4 discusses the results and implications, followed by the concluding remarks in section 5.

Kiduk Yang, Ning Yu, Hui Zhang, Shahrier Akram, Ivan Record
Automatic Query Expansion Using Data Manifold

This paper proposes an automatic query expansion method that combines document re-ranking and standard Rocchio’s relevance feedback. The document re-ranking method ranks the top retrieved documents based on the intrinsic manifold structure collectively revealed by a great amount of data. This is done by using a semi-supervised learning algorithm to integrate pseudo relevant documents with documents to be re-ranked. Given an initial ranked list of retrieved documents, the document re-ranking approach picks a set of documents from the top ones (including query itself) as pseudo relevant documents. In this way, the intrinsic relationship of all the retrieved documents to be re-ranked with the pseudo relevant documents (pseudo irrelevant documents are missing) can be determined via a semi-supervised learning algorithm. Finally, all the retrieved documents can be re-ranked according to above relationship. Evaluation on benchmark corpora show that the approach can achieve much better performance than standard Rocchio’s relevance feedback and performance better than other related approaches.

Lingpeng Yang, Donghong Ji, Yu Nie, Tingting He
An Empirical Comparison of Translation Disambiguation Techniques for Chinese–English Cross-Language Information Retrieval

Disambiguation techniques are typically employed to reduce translation errors introduced during query translation in cross-lingual information retrieval. Previous work has used several techniques — based on term similarity, term co-occurrence, and language modelling. However, the previous experiments were conducted on different data sets, and thus the relative merits of each technique is presently unclear. The goal of this work is to compare the effectiveness of these techniques on the same Chinese–English data sets. Our results show that despite the different underlying models and formulae used, the aggregated results are comparable. However, there is wide variation in the translation of individual queries, suggesting that there is scope for further improvement.

Ying Zhang, Phil Vines, Justin Zobel
Web Mining for Lexical Context-Specific Paraphrasing

In most applications of paraphrasing, contextual information should be considered since a word may have different paraphrases in different contexts. This paper presents a method that automatically acquires lexical context-specific paraphrases from the web. The method includes two main stages, candidate paraphrase extraction and paraphrase validation. Evaluations were conducted on a news title corpus whereby the context-specific paraphrasing method was compared with the Chinese synonymous thesaurus. Results show that the precision of our method is above 60% and the recall is above 55%, which outperforms the thesaurus significantly.

Shiqi Zhao, Ting Liu, Xincheng Yuan, Sheng Li, Yu Zhang
Backmatter
Metadaten
Titel
Information Retrieval Technology
herausgegeben von
Hwee Tou Ng
Mun-Kew Leong
Min-Yen Kan
Donghong Ji
Copyright-Jahr
2006
Verlag
Springer Berlin Heidelberg
Electronic ISBN
978-3-540-46237-8
Print ISBN
978-3-540-45780-0
DOI
https://doi.org/10.1007/11880592

Neuer Inhalt