Skip to main content

2011 | Buch

Information Retrieval Technology

7th Asia Information Retrieval Societies Conference, AIRS 2011, Dubai, United Arab Emirates, December 18-20, 2011. Proceedings

herausgegeben von: Mohamed Vall Mohamed Salem, Khaled Shaalan, Farhad Oroumchian, Azadeh Shakery, Halim Khelalfa

Verlag: Springer Berlin Heidelberg

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

This book constitutes the refereed proceedings of the 7th Asia Information Retrieval Societies Conference AIRS 2011, held in Dubai, United Arab Emirates, in December 2011.

The 31 revised full papers and 25 revised poster papers presented were carefully reviewed and selected from 132 submissions. All current aspects of information retrieval - in theory and practice - are addressed; the papers are organized in topical sections on information retrieval models and theories; information retrieval applications and multimedia information retrieval; user study, information retrieval evaluation and interactive information retrieval; Web information retrieval, scalability and adversarial information retrieval; machine learning for information retrieval; natural language processing for information retrieval; arabic script text processing and retrieval.

Inhaltsverzeichnis

Frontmatter

Information Retrieval Models and Theories

Query-Dependent Rank Aggregation with Local Models

The technologies of learning to rank have been successfully used in information retrieval. General ranking approaches use all training queries to build a single ranking model and apply this model to all different kinds of queries. Such a “global” ranking approach does not deal with the specific properties of queries. In this paper, we propose three query-dependent ranking approaches which combine the results of local models. We construct local models by using clustering algorithms, represent queries by using various ways such as Kull-back-Leibler divergence, and apply a ranking function to merge the results of different local models. Experimental results show that our approaches are better than all rank-based aggregation approaches and some global models in LETOR4. Especially, we found that our approaches have better performance in dealing with difficult queries.

Hsuan-Yu Lin, Chi-Hsin Yu, Hsin-Hsi Chen
On Modeling Rank-Independent Risk in Estimating Probability of Relevance

Estimating the probability of relevance for a document is fundamental in information retrieval. From a theoretical point of view, risk exists in the estimation process, in the sense that the estimated probabilities may not be the actual ones precisely. The estimation risk is often considered to be dependent on the rank. For example, the probability ranking principle assumes that ranking documents in the order of decreasing probability of relevance can optimize the rank effectiveness. This implies that a precise estimation can yield an optimal rank. However, an optimal (or even ideal) rank does not always guarantee that the estimated probabilities are precise. This means that part of the estimation risk is rank-independent. It imposes practical risks in the applications, such as pseudo relevance feedback, where different estimated probabilities of relevance in the first-round retrieval will make a difference even when two ranks are identical. In this paper, we will explore the effect and the modeling of such rank-independent risk. A risk management method is proposed to adaptively adjust the rank-independent risk. Experimental results on several TREC collections demonstrate the effectiveness of the proposed models for both pseudo-relevance feedback and relevance feedback.

Peng Zhang, Dawei Song, Jun Wang, Xiaozhao Zhao, Yuexian Hou
Measuring the Ability of Score Distributions to Model Relevance

Modelling the score distribution of documents returned from any information retrieval (IR) system is of both theoretical and practical importance. The goal of which is to be able to infer relevant and non-relevant documents based on their score to some degree of confidence.

In this paper, we show how the performance of mixtures of score distributions can be compared using inference of query performance as a measure of

utility

. We (1) outline methods which can directly calculate average precision from the parameters of a mixture distribution. We (2) empirically evaluate a number of mixtures for the task of inferring query performance, and show that the log-normal mixture can model more relevance information compared to other possible mixtures. Finally, (3) we perform an empirical analysis of the mixtures using the recall-fallout convexity hypothesis.

Ronan Cummins
Cross-Language Information Retrieval with Latent Topic Models Trained on a Comparable Corpus

In this paper we study cross-language information retrieval using a bilingual topic model trained on comparable corpora such as Wikipedia articles. The bilingual Latent Dirichlet Allocation model (BiLDA) creates an interlingual representation, which can be used as a translation resource in many different multilingual settings as comparable corpora are available for many language pairs. The probabilistic interlingual representation is incorporated in a statistical language model for information retrieval. Experiments performed on the English and Dutch test datasets of the CLEF 2001-2003 CLIR campaigns show the competitive performance of our approach compared to cross-language retrieval methods that rely on pre-existing translation dictionaries that are hand-built or constructed based on parallel corpora.

Ivan Vulić, Wim De Smet, Marie-Francine Moens
Construct Weak Ranking Functions for Learning Linear Ranking Function

Many Learning to Rank models, which apply machine learning techniques to fuse weak ranking functions and enhance ranking performances, have been proposed for web search. However, most of the existing approaches only apply the

Min – Max

normalization method to construct the weak ranking functions without considering the differences among the ranking features. Ranking features, such as the content-based feature

BM

25 and link-based feature

PageRank

, are different from each other in many aspects. And it is unappropriate to apply an uniform method to construct weak ranking functions from ranking features. In this paper, comparing the three frequently used normalization methods:

Min – Max

,

Log

,

Arctan

normalization, we analyze the differences among three normalization methods when constructing the weak ranking functions, and propose two normalization selection methods to decide which normalization should be used for a specific ranking feature. The experimental results show that the final ranking functions based on normalization selection methods significantly outperform the original one.

Guichun Hua, Min Zhang, Yiqun Liu, Shaoping Ma, Hang Yin
Is Simhash Achilles?

Simhash generates compact binary codes for the input data thus improves the search efficiency. Most recent works on Simhash are designed to speed-up the search, generate high-quality descriptors, etc. However, few works discuss in what situations Simhash can be directly applied. This paper proposes a novel method to quantitatively analyze this question. Our method is based on Support Vector Data Description (SVDD), which tries to find a tighten sphere to cover most points. Using the geometry relation between the unit sphere and the SVDD sphere, we give a quantitative analysis on in what situations Simhash is feasible. We also extend the basic Simhash to handle those unfeasible cases. To reduce the complexity, an approximation algorithm is proposed, which is easy for implementation. We evaluate our method on synthetic data and a real-world image dataset. Most results show that our method outperforms the basic Simhash significantly.

Qixia Jiang, Yan Zhang, Liner Yang, Maosong Sun
XML Information Retrieval through Tree Edit Distance and Structural Summaries

Semi-structured Information Retrieval (SIR) allows the user to narrow his search down to the element level. As queries and XML documents can be seen as hierarchically nested elements, we consider that their structural proximity can be evaluated through their trees similarity. Our approach combines both content and structure scores, the latter being based on tree edit distance (minimal cost of operations to turn one tree to another). We use the tree structure to propagate and combine both measures. Moreover, to overcome time and space complexity, we summarize the document tree structure. We experimented various tree summary techniques as well as our original model using the SSCAS task of the INEX 2005 campaign. Results showed that our approach outperforms state of the art ones.

Cyril Laitang, Mohand Boughanem, Karen Pinel-Sauvagnat
An Empirical Study of SLDA for Information Retrieval

A common limitation of many language modeling approaches is that retrieval scores are mainly based on exact matching of terms in the queries and documents, ignoring the semantic relations among terms. Latent Dirichlet Allocation (LDA) is an approach trying to capture the semantic dependencies among words. However, using as document representation, LDA has no successful applications in information retrieval (IR). In this paper, we propose a single-document-based LDA (SLDA) document model for IR. The proposed work has been evaluated on four TREC collections, which shows that SLDA document modeling method is comparable to the state-of-the-art language modeling approaches, and it’s a novel way to use LDA model to improve retrieval performance.

Dashun Ma, Lan Rao, Ting Wang
Learning to Rank by Optimizing Expected Reciprocal Rank

Learning to rank is one of the most hot research areas in information retrieval, among which listwise approach is an important research direction and the methods that directly optimizing evaluation metrics in listwise approach have been used for optimizing some important ranking evaluation metrics, such as MAP, NDCG and etc. In this paper, the structural SVMs method is employed to optimize the Expected Reciprocal Rank(ERR) criterion which is named SVMERR for short. It is compared with state-of-the-art algorithms. Experimental results show that SVMERR outperforms other methods on OHSUMED dataset and TD2003 dataset, which also indicate that optimizing ERR criterion could improve the ranking performance.

Ping Zhang, Hongfei Lin, Yuan Lin, Jiajin Wu

Information Retrieval Applications and Multimedia Information Retrieval

Information Retrieval Strategies for Digitized Handwritten Medieval Documents

This paper describes and evaluates different IR models and search strategies for digitized manuscripts. Written during the thirteenth century, these manuscripts were digitized using an imperfect recognition system with a word error rate of around 6%. Having access to the internal representation during the recognition stage, we were able to produce four automatic transcriptions, each introducing some form of spelling correction as an attempt to improve the retrieval effectiveness. We evaluated the retrieval effectiveness for each of these versions using three text representations combined with five IR models, three stemming strategies and two query formulations. We employed a manually-transcribed error-free version to define the ground-truth. Based on our experiments, we conclude that taking account of the single best recognition word or all possible top-

k

recognition alternatives does not provide the best performance. Selecting all possible words each having a log-likelihood close to the best alternative yields the best text surrogate. Within this representation, different retrieval strategies tend to produce similar performance levels.

Nada Naji, Jacques Savoy
Query Phrase Expansion Using Wikipedia in Patent Class Search

Relevance Feedback methods generally suffer from topic drift caused by words ambiguity and synonymous uses of words. As a way to alleviate the inherent problem, we propose a novel query phrase expansion approach utilizing semantic annotations in Wikipedia pages, trying to enrich queries with context disambiguating phrases. Focusing on the patent domain, especially on patent search where patents are classified into a hierarchy of categories, we attempt to understand the roles of phrases and words in query expansion in determining the relevance of documents and examine their contributions to alleviating the query drift problem. Our approach is compared against Relevance Model, a state-of-the-art, to show its superiority in terms of MAP on all levels of the classification hierarchy.

Bashar Al-Shboul, Sung-Hyon Myaeng
Increasing Broadband Subscriptions for Telecom Carriers through Mobile Advertising

Mobile devices are popular. However, mobile broadband subscriptions are 20 percent of the mobile subscriptions, due to the high payments requirement for broadband subscriptions. On the other hand, US mobile ad spend will exceed US$1 billion in 2011 according to emarketer.com. Therefore, to increase broadband subscribers by providing free or discounted fees through the deployment of mobile advertising framework by the telecommunication system is important. Telecommunication runs ads agent platform to attract investments from advertisers. Subscribers read promotional advertisements to get discounted payment. While the advertisers pay a reasonable price for advertising, the possible commercial activities will bring revenues. As a result, this framework is a triple-win for telecommunication, advertisers and subscribers. We describe a framework for delivering appropriate ads of the ideal time at the ideal place to the ideal subscriber is the three key issues on how/when to show the ads and what potential ads clicked by subscribers.

Chia-Hui Chang, Kaun-Hua Huo
Query Recommendation by Modelling the Query-Flow Graph

Query recommendation has been widely applied in modern search engines to help users in their information seeking activities. Recently, the query-flow graph has shown its utility in query recommendation. However, there are two major problems in directly using query-flow graph for recommendation. On one hand, due to the sparsity of the graph, one may not well handle the recommendation for many dangling queries in the graph. On the other hand, without addressing the ambiguous intents in such an aggregated graph, one may generate recommendations either with multiple intents mixed together or dominated by certain intent. In this paper, we propose a novel mixture model that describes the generation of the query-flow graph. With this model, we can identify the hidden intents of queries from the graph. We then apply an intent-biased random walk over the graph for query recommendation. Empirical experiments are conducted based on real world query logs, and both the qualitative and quantitative results demonstrate the effectiveness of our approach.

Lu Bai, Jiafeng Guo, Xueqi Cheng
Ranking Content-Based Social Images Search Results with Social Tags

With the recent rapid growth of social image hosting websites, such as Flickr, it is easier to construct a large database with tagged images. Social tags have been proven to be effective for providing keyword-based image retrieval and widely used on these websites, but whether they are beneficial for improving content-based image retrieval has not been well investigated in previous work. In this paper, we investigate whether and how social tags can be used for improving content-based image search results. We propose an unsupervised approach for automatic ranking without user interactions. It propagates visual and textual information on an image-tag relationship graph with a mutual reinforcement process. We conduct experiments showing that our approach can successfully use social tags for ranking and improving content-based social image search results, and performs better than other approaches.

Jiyi Li, Qiang Ma, Yasuhito Asano, Masatoshi Yoshikawa

User Study, Information Retrieval Evaluation and Interactive Information Retrieval

Profiling a Non-medical Professional Searcher on a Medical Domain: What Do Search Patterns and Demographic Details Reveal?

Previous research is able to distinguish search patterns of domain experts and non-domain experts. However, little is known about the finer details of a non-domain expert searcher. This is especially so when a non-domain expert performs a search on an expert type domain. Do non-domain experts search similarly? What can we learn and infer from their search patterns? More importantly can we identify the searcher behind the search? In this paper, we perform a study of non-domain experts search behavior on an expert domain. Our results indicate search patterns can be used to generally classify a non-domain expert searcher.

Anushia Inthiran, Saadat M. Alhashmi, Pervaiz K. Ahmed
Prioritized Aggregation of Multiple Context Dimensions in Mobile IR

An interesting aspect emerging in mobile information retrieval is related to the several contextual features that can be considered as new dimensions in the relevance assessment process. In this paper, we propose a multidimensional ranking model based on the three dimensions of topic, interest, and location. The peculiarity of our multidimensional ranking lies in a “prioritized combination” of the considered criteria, using the “prioritized scoring” and “prioritized and” operators, which allow flexible personalization of search results according to users’ preferences. In order to evaluate the effectiveness of our model, we propose a simulation based evaluation framework that investigates the integration of the contextual dimensions into the evaluation process. Extensive experimental results obtained by using our simulation framework show the effectiveness of our multidimensional personalized ranking model.

Ourdia Bouidghaghen, Lynda Tamine-Lechani, Gabriella Pasi, Guillaume Cabanac, Mohand Boughanem, Célia da Costa Pereira
Searching for Islamic and Qur’anic Information on the Web: A Mixed-Methods Approach

This paper seeks to understand and describe web searching patterns for Islamic and Qur’anic information, an area receiving little attention in past research. A mixed-methods approach has been taken to data collection utilizing both quantitative and qualitative techniques. Query logs collected in 2006 from the Microsoft Live search engine were analysed for Islamic-related terms. Characteristics such as query frequency, term frequency, query length, and session length were derived from the data. To complement these quantitative data, interview data were collected from 25 users who had experienced searching for Islamic and/or Qur’anic materials on the web. The interviews gave a deeper understanding of aspects of information seeking including search processes, challenges and opinions on locating Islamic and Qur’anic information on the web.

Rita Wan-Chik, Paul Clough, Nigel Ford
Enriching Query Flow Graphs with Click Information

The increased availability of large amounts of data about user search behaviour in search engines has triggered a lot of research in recent years. This includes developing machine learning methods to build knowledge structures that could be exploited for a number of tasks such as query recommendation. Query flow graphs are a successful example of these structures, they are generated from the sequence of queries typed in by a user in a search session. In this paper we propose to modify the query flow graph by incorporating clickthrough information from the search logs. Click information provides evidence of the success or failure of the search journey and therefore can be used to enrich the query flow graph to make it more accurate and useful for query recommendation. We propose a method of adjusting the weights on the edges of the query flow graph by incorporating the number of clicked documents after submitting a query.

We explore a number of weighting functions for the graph edges using click information. Applying an automated evaluation framework to assess query recommendations allows us to perform automatic and reproducible evaluation experiments. We demonstrate how our modified query flow graph outperforms the standard query flow graph. The experiments are conducted on the search logs of an academic organisation’s search engine and validated in a second experiment on the log files of another Web site.

M-Dyaa Albakour, Udo Kruschwitz, Ibrahim Adeyanju, Dawei Song, Maria Fasli, Anne De Roeck
Effect of Explicit Roles on Collaborative Search in Travel Planning Task

This paper presents a task-based user study carried out to investigate how explicit roles assigned to group members affected collaborative information seeking behaviour during a travel planning task. 24 pairs participated our study where half of them were given a specific instruction to divide the roles into a searcher and writer, while others were given no such instruction. The evaluation looked at travel plans generated, search interaction logs, task perceptions, and dialogues between members. The results suggest that explicit division of roles can have significant effect son a group’s knowledge building during the collaborative search task. The paper also discusses experimental designs of task-based collaborative search studies.

Marika Imazu, Shin’ichi Nakayama, Hideo Joho
A Web 2.0 Approach for Organizing Search Results Using Wikipedia

Most current search engines return a ranked list of results in response to the user’s query. This simple approach may require the user to go through a long list of results to find the documents related to his information need. A common alternative is to cluster the search results and allow the user to browse the clusters, but this also imposes two challenges: ‘how to define the clusters’ and ‘how to label the clusters in an informative way’. In this study, we propose an approach which uses Wikipedia as the source of information to organize the search results and addresses these two challenges. In response to a query, our method extracts a hierarchy of categories from Wikipedia pages and trains classifiers using web pages related to these categories. The search results are organized in the extracted hierarchy using the learned classifiers. Experiment results confirm the effectiveness of the proposed approach.

Mohammadreza Darvish Morshedi Hosseini, Azadeh Shakery, Behzad Moshiri

Web Information Retrieval, Scalability and Adversarial Information Retrieval

Recommend at Opportune Moments

We propose an approach to adapt the existing item-based (movie-based) collaborative filtering algorithm based on the timestamp of ratings to recommend movies to users at opportune moments. Over the last few years, researchers focused recommendation problems on rating scores mostly. They analyzed users’ previous rating scores and predicted those unknown rating scores. However, we found rating scores are not the only problem we have to concern about. When to recommend movies to users is also important for a recommender system since users’ shopping habits vary from person to person. To recommend movies to users at opportune moments, we analyzed the rating distribution of each movie by the timestamps and found a user tending to watch similar movies at similar moments. Several experiments have been conducted on MovieLens Data Sets. The system is evaluated by different recommendation lists during a specific period of time - t

specific

, and the experimental results show the usefulness of our system.

Chien-Chin Su, Pu-Jen Cheng
Emotion Tokens: Bridging the Gap among Multilingual Twitter Sentiment Analysis

Twitter is a microblogging service where worldwide users publish their feelings. However, sentiment analysis for Twitter messages (tweets) is regarded as a challenging problem because tweets are short and informal. In this paper, we focus on this problem by the analysis of emotion tokens, including emotion symbols (e.g. emoticons), irregular forms of words and combined punctuations. According to our observation on five million tweets, these emotion tokens are commonly used (0.47 emotion tokens per tweet). They directly express one’s emotion regardless of his language; hence become a useful signal for sentiment analysis on multilingual tweets. Firstly, emotion tokens are extracted automatically from tweets. Secondly, a graph propagation algorithm is proposed to label the tokens’ polarities. Finally, a multilingual sentiment analysis algorithm is introduced. Comparative evaluations are conducted among semantic lexicon based approach and some state-of-the-art Twitter sentiment analysis Web services, both on English and non-English tweets. Experimental results show effectiveness of the proposed algorithms.

Anqi Cui, Min Zhang, Yiqun Liu, Shaoping Ma
Identifying Popular Search Goals behind Search Queries to Improve Web Search Ranking

Web users usually have a certain search goal before they submit a search query. However, many laypersons can’t transform their search goals into suitable queries. Thus, understanding original search goals behind a query is very important for search engines. In the past decade, many researches focus on classifying search goals behind a query into different search-goal categories. In fact, there may be more than one search goal behind a certain query. We thus propose a novel Popular-Search-Goal-based Search Model to effectively identify search goals by the features extracted from search-result snippets and click-through data. Furthermore, we proposed a Search-Goal-based Ranking Model which exploits the identified search goals to re-rank the search result. The experimental result shows our proposed model can effectively identify the search goals behind a search query (achieve precision of 0.94) and enhance the search result ranking (achieve precision of 0.72 for top-1 returned snippet).

Wang Ting-Xuan, Lu Wen-Hsiang
A Novel Crawling Algorithm for Web Pages

Crawler is a main component of search engines. In search engines, crawler part is responsible for discovering and downloading web pages. No search engine can cover whole of the web, thus it has to focus on the most valuable web pages. Several Crawling algorithms like PageRank, OPIC and FICA have been proposed, but they have low throughput. To overcome the problem, we propose a new crawling algorithm, called FICA+ which is easy to implement. In FICA+, importances of pages are determined based on the logarithmic distance and weight of the incoming links. To evaluate FICA+ we use web graph of university of California, Berkeley. Experimental result shows that our algorithm outperforms other crawling algorithms in discovering highly important pages.

Mohammad Amin Golshani, Vali Derhami, AliMohammad ZarehBidoki
Extraction of Web Texts Using Content-Density Distribution

We propose a method for grasping the content of each Web page and extracting a part of the Web page related to query keywords, in order to make more effective snippets of a Web search engine. We regard the content as a set of words in the text of a Web page, and we generate the content-density distribution by using both the position and the influence of the word. In our experiments, we found that the proposed method facilitated the recognition of the content of Web pages, as compared to conventional methods based on snippets.

Saori Kitahara, Koya Tamura, Kenji Hatano
A New Approach to Search Result Clustering and Labeling

Search engines present query results as a long ordered list of web snippets divided into several pages. Post-processing of retrieval results for easier access of desired information is an important research problem. In this paper, we present a novel search result clustering approach to split the long list of documents returned by search engines into meaningfully grouped and labeled clusters. Our method emphasizes clustering quality by using cover coefficient-based and sequential k-means clustering algorithms. A cluster labeling method based on term weighting is also introduced for reflecting cluster contents. In addition, we present a new metric that employs precision and recall to assess the success of cluster labeling. We adopt a comparative strategy to derive the relative performance of the proposed method with respect to two prominent search result clustering methods: Suffix Tree Clustering and Lingo. Experimental results in the publicly available AMBIENT and ODP-239 datasets show that our method can successfully achieve both clustering and labeling tasks.

Anil Turel, Fazli Can
Efficient Top-k Document Retrieval Using a Term-Document Binary Matrix

Current web search engines perform well for “navigational queries.” However, due to their use of simple conjunctive Boolean filters, such engines perform poorly for “informational queries.” Informational queries would be better handled by a web search engine using an informational retrieval model along with a combination of enhancement techniques such as query expansion and relevance feedback, and the realization of such a engine requires a method to prosess the model efficiently. In this paper, we describe a novel extension of an existing top-k query processing technique. We add a simple data structure called a “term-document binary matrix,” resulting in more efficient evaluation of top-k queries even when the queries have been expanded. We show on the basis of experimental evaluation using the TREC GOV2 data set and expanded versions of the evaluation queries attached to this data set that the expanded technique achieves significant performance gains over existing techniques.

Etsuro Fujita, Keizo Oyama

Machine Learning for Information Retrieval

Topic Analysis for Online Reviews with an Author-Experience-Object-Topic Model

In this paper, we propose a new probabilistic generative model for topic analysis of online reviews, called Author-Experience-Object-Topic Model (AEOT). This model is to capture the relationship between the authors, objects and reviews in order to improve the performance of topic analysis. The model, as a general one, can be transformed to six simpler models, and can produce topic-word, author-topic and object-topic distributions. Experimental results show that the model is suitable for topic analysis of online reviews, and outperforms other existing methods.

Yong Zhang, Dong-Hong Ji, Ying Su, Po Hu
Predicting Query Performance Directly from Score Distributions

The task of predicting query performance has received much attention over the past decade. However, many of the frameworks and approaches to predicting query performance are more heuristic than not. In this paper, we develop a principled framework based on modelling the document score distribution to predict query performance directly.

In particular, we (1) show how a standard performance measure (e.g. average precision) can be inferred from a document score distribution. We (2) develop techniques for query performance prediction (QPP) by automatically estimating the parameters of the document score distribution (i.e. mixture model) when relevance information is unknown. Therefore, the QPP approaches developed herein aim to estimate average precision directly. Finally, we (3) provide a detailed analysis of one of the QPP approaches that shows that only two parameters of the five-parameter mixture distribution are of practical importance.

Ronan Cummins
Wikipedia-Based Smoothing for Enhancing Text Clustering

The conventional algorithms for text clustering that are based on the bag of words model, fail to fully capture the semantic relations between the words. As a result, documents describing an identical topic may not be categorized into same clusters if they use different sets of words. A generic solution for this issue is to utilize background knowledge to enrich the document contents. In this research, we adopt a language modeling approach for text clustering and propose to smooth the document language models using Wikipedia articles in order to enhance text clustering performance. The contents of Wikipedia articles as well as their assigned categories are used in three different ways to smooth the document language models with the goal of enriching the document contents. Clustering is then performed on a document similarity graph constructed on the enhanced document collection. Experiment results confirm the effectiveness of the proposed methods.

Elahe Rahimtoroghi, Azadeh Shakery
ASVMFC: Adaptive Support Vector Machine Based Fuzzy Classifier

SVM

1

and FNN

2

are popular techniques for pattern classification. SVM has excellent generalization performance, but this performance is dependent on appropriate determining its kernel function. FNN is equipped with human-like reasoning, but the learning algorithms used in most FNN classifiers only focus on minimizing empirical risk. In this paper, a new classifier called ASVMFC has offered uses capabilities of SVM and FNN together and does not have the mentioned disadvantages. In fact, ASVMFC is a fuzzy neural network that its parameters is adjusted using a SVM with an adaptive kernel function. ASVMFC uses a new clustering algorithm to make up its fuzzy rules. Moreover, an efficient sampling method has been introduced in this paper that drastically reduces the number of training samples with very slight impact on the performance of ASVMFC. The experimental results illustrate ASVMFC can achieve very good classification accuracy with generating only a few fuzzy rules.

Hamed Ganji, Shahram Khadivi
Ensemble Pruning for Text Categorization Based on Data Partitioning

Ensemble methods can improve the effectiveness in text categorization. Due to computation cost of ensemble approaches there is a need for pruning ensembles. In this work we study ensemble pruning based on data partitioning. We use a ranked-based pruning approach. For this purpose base classifiers are ranked and pruned according to their accuracies in a separate validation set. We employ four data partitioning methods with four machine learning categorization algorithms. We mainly aim to examine ensemble pruning in text categorization. We conduct experiments on two text collections: Reuters-21578 and BilCat-TRT. We show that we can prune 90% of ensemble members with almost no decrease in accuracy. We demonstrate that it is possible to increase accuracy of traditional ensembling with ensemble pruning.

Cagri Toraman, Fazli Can
Sentiment Analysis for Online Reviews Using an Author-Review-Object Model

In this paper, we propose a probabilistic generative model for online review sentiment analysis, called joint Author-Review-Object Model (ARO). The users, objects and reviews form a heterogeneous graph in online reviews. The ARO model focuses on utilizing the user-review-object graph to improve the traditional sentiment analysis. It detects the sentiment based on not only the review content but also the author and object information. Preliminary experimental results on three datasets show that the proposed model is an effective strategy for jointly considering the various factors for the sentiment analysis.

Yong Zhang, Dong-Hong Ji, Ying Su, Cheng Sun

Natural Language Processing for Information Retrieval

Semantic-Based Opinion Retrieval Using Predicate-Argument Structures and Subjective Adjectives

We present the results of our experiment on the use of predicate-argument structures containing subjective adjectives for semantic-based opinion retrieval. The approach exploits the grammatical tree derivation of sentences to show the underlying meanings through the respective predicate-argument structures. The underlying meaning of each subjective sentence is then semantically compared with the underlying meaning of the query topic given in natural language sentence. Rather than using frequency of opinion words or their proximity to query words, our solution is based on frequency of semantically related subjective sentences. We formed a

linear

relevance

model

that uses explicit and implicit semantic similarities between predicate-argument structures of subjective sentences and the given query topic. Thus, the technique ensures that opinionated documents retrieved are not only subjective but have semantic relevance to the given query topic. Experimental results show that the technique improves performance of topical opinion retrieval task.

Sylvester Olubolu Orimaye, Saadat M. Alhashmi, Siew Eu-Gene
An Aspect-Driven Random Walk Model for Topic-Focused Multi-document Summarization

Recently, there has been increased interest in topic-focused multi-document summarization where the task is to produce automatic summaries in response to a given topic or specific information requested by the user. In this paper, we incorporate a deeper semantic analysis of the source documents to select important concepts by using a predefined list of important aspects that act as a guide for selecting the most relevant sentences into the summaries. We exploit these aspects and build a novel methodology for topic-focused multi-document summarization that operates on a Markov chain tuned to extract the most important sentences by following a random walk paradigm. Our evaluations suggest that the augmentation of important aspects with the random walk model can raise the summary quality over the random walk model up to 19.22%.

Yllias Chali, Sadid A. Hasan, Kaisar Imam
An Effective Approach for Topic-Specific Opinion Summarization

Topic-specific opinion summarization (TOS) plays an important role in helping users digest online opinions, which targets to extract a summary of opinion expressions specified by a query, i.e. topic-specific opinionated information (TOI). A fundamental problem in TOS is how to effectively represent the TOI of an opinion so that salient opinions can be summarized to meet user’s preference. Existing approaches for TOS are either limited by the mismatch between topic-specific information and its corresponding opinionated information or lack of ability to measure opinionated information associated with different topics, which in turn affect the performance seriously. In this paper, we represent TOI by word pair and propose a weighting scheme to measure word pair. Then, we integrate word pair into a random walk model for opinionated sentence ranking and adopt MMR method for summarization. Experimental results showed that salient opinion expressions were effectively weighted and significant improvement achieved for TOS.

Binyang Li, Lanjun Zhou, Wei Gao, Kam-Fai Wong, Zhongyu Wei
A Model-Based EM Method for Topic Person Name Multi-polarization

In this paper, we propose an unsupervised approach for multi-polarization of topic person names. We employ a model-based EM method to polarize individuals into positively correlated groups. In addition, we present off-topic block elimination and weighted correlation coefficient techniques to eliminate the off-topic blocks and reduce the text sparseness problem respectively. Our experiment results demonstrate that the proposed method can identify multi-polar person groups of topics correctly.

Chien Chin Chen, Zhong-Yong Chen
Using Key Sentence to Improve Sentiment Classification

When predicting the polarity of a review, not all sentences are equally informative. In this paper, we divide a document into key sentence and trivial sentences. The key sentence expresses the author’s overall view while trivial sentences describe the details. To take full advantage of the differences and complementarity between the two kinds of sentences, we incorporate them in supervised and semi-supervised learning respectively. In supervised sentiment classification, a classifier combination approach is adopted; in semi-supervised sentiment classification, a co-training algorithm is proposed. Experiments carried out on eight domains show that our approach performs better than the baseline method and the key sentence extraction is effective.

Zheng Lin, Songbo Tan, Xueqi Cheng
Using Concept-Level Random Walk Model and Global Inference Algorithm for Answer Summarization

Community Question Answer (cQA) archives contain rich sources of knowledge on extensive topics, in which the quality of the submitted answer is uneven, ranging from excellent detailed answers to completely unrelated content. We propose a framework to generate complete, relevant, and trustful answer summaries. The framework discusses answer summarization in terms of maximum coverage problem with knapsack constraint on conceptual level. Global inference algorithm is employed to extract sentences according to the saliency scores of concepts. The saliency score of each concept is assigned through a two-layer graph-based random walk model incorporating the user social features and text content from answers. The experiments are implemented on a data set from Yahoo! Answer. The results show that our method generates satisfying summaries and is superior to the state-of-the-art approaches in performance.

Xiaoying Liu, Zhoujun Li, Xiaojian Zhao, Zhenggan Zhou
Acquisition of Know-How Information from Web

A variety of know-how such as recipes and solutions for troubles have been stored on the Web. However, it is not so easy to appropriately find certain know-how information. If know-how could be appropriately detected, it would be much easier for us to know how to tackle unforeseen situations such as accidents and disasters. This paper proposes a promising method for acquiring know-how information from the Web. First, we extract passages containing at least one target object and then extract candidates for know-how from them. Then, passages containing the know-how are discriminated from non-know-how information considering each object and its typical usage.

Shunsuke Kozawa, Kiyotaka Uchimoto, Shigeki Matsubara
Topic Based Creation of a Persian-English Comparable Corpus

One of the most important issues in cross language information retrieval (CLIR) is where to obtain the translation knowledge. Multilingual corpora are valuable resources for this purpose, but few studies have been done on constructing multilingual corpora in Persian language. In this study, we propose a method to construct a Persian- English comparable corpus using two independent news collections and based on date and topic criteria. Unlike most existing methods which use publication dates as the main basis for aligning documents, we also consider date-independent alignments: alignments based only on topics and concept similarities. In order to avoid low quality alignments, we cluster the collections based on their topics prior to alignments which allows us to align similar documents whose publication dates are distant. Evaluation results show the high quality of constructed corpus and the possibility of extracting high quality association knowledge from the corpus for the task of CLIR.

Zahra Rahimi, Azadeh Shakery
A Web Knowledge Based Approach for Complex Question Answering

Current researches on Question Answering concern more complex questions than factoid ones. Since complex questions are investigated by many researches, how to acquire accurate answers still becomes a core problem for complex QA. In this paper, we propose an approach that estimates the similarity by topic model. After summarizing relevant texts from web knowledge bases, an answer sentence acquisition model based on Probabilistic Latent Semantic Analysis is introduced to seek sentences, in which the topic is similar to those in definition set. Then, an answer ranking model is employed to select both statistically and semantically similar sentences between sentences retrieved and sentences in the relevant text set. Finally, sentences are ranked as answer candidates according to their scores. Experiments show that our approach achieves an increase of 5.19% F-score than the baseline system.

Han Ren, Donghong Ji, Chong Teng, Jing Wan
Learning to Extract Coherent Keyphrases from Online News

Keyphrases extracted from news articles can be used to concisely represent the main content of news events. In this paper, we first present several criteria of high-quality news keyphrases. After that, in order to integrate those criteria into the keyphrase extraction task, we propose a novel formulation which coverts the task to a learning to rank problem. Our approach involves two phases: selecting candidate keyphrases and ranking all possible sub-permutations among the candidates. Three kinds of feature sets: lexical feature set, locality feature set and coherence feature set are introduced to rank the candidates, and then the best sub-permutation provides the keyphrases. The proposed method is evaluated on a multi-news collection and experimental results verify that our proposed method is effective to extract coherent news keyphrases.

Zhuoye Ding, Qi Zhang, Xuanjing Huang
Maintaining Passage Retrieval Information Need Using Analogical Reasoning in a Question Answering Task

In this paper we study whether a question and its answer can be related using analogical reasoning by using various kinds of textual occurrences in a question answering (QA) task. We argue that in a QA passage retrieval context, low cost language features can contribute some positive influence in the representation of the information need that also appears in other passages, which have some analogical features. We attempt to leverage this through query expansion and query stopwords exchange strategies among analogical question answer pairs, which are modeled by a Bayesian Analogical Reasoning framework. Our study by using ResPubliQA 2009 and 2010 dataset shows that the predicted analogical relation between question answer pairs can be used to maintain the information need of the QA passage retrieval task, but has a poor performance in determining the question type. Our best accuracy score was achieved by using

‘bigram occurrences by using stemmer and TF-IDF weighting completed with named-entity’

feature set for the query expansion approach, and

‘bigram occurrences by using stemmer and TF-IDF weighting’

feature set for the stopwords exchanged approach.

Hapnes Toba, Mirna Adriani, Ruli Manurung
Improving Document Summarization by Incorporating Social Contextual Information

We propose a collaborative approach to improve document summarization by incorporating social contextual information into the sentence ranking process. Both the relationships between sentences from document context and the preference information from user context are investigated in the approach. We validate our method on a social tagging dataset and experimentally demonstrate that by incorporating social contextual information it obtains significant improvement over several baseline methods.

Po Hu, Donghong Ji, Cheng Sun, Chong Teng, Yong Zhang
Automatic Classification of Link Polarity in Blog Entries

In this paper, we propose a method for classification of an author’s sentiment for a linked blog (we call this sentiment link polarity), as a first step for finding authoritative blogs in the blogosphere. Generally, blogs that are linked positively from many other blogs are considered more reliable. In citing a blog entry, there are passages where the author describes his/her sentiments about a linked blog (which we call citing areas). We extract citing areas in a Japanese blog entry automatically, and then classify a link polarity using the information in the citing areas. To investigate the effectiveness of our method, we conducted experiments. For classification of link polarity, we obtained a high precision and recall than baseline methods. For the extraction of the citing areas, we obtained the same Precision and Recall as manual extraction. From our experimental results, we confirmed the effectiveness of our methods.

Aya Ishino, Hidetsugu Nanba, Toshiyuki Takezawa
Feasibility Study for Procedural Knowledge Extraction in Biomedical Documents

We propose how to extract procedural knowledge rather than declarative knowledge utilizing machine learning method with deep language processing features in scientific documents, as well as how to model it. We show the representation of procedural knowledge in PubMed abstracts and provide experiments that are quite promising in that it shows 82%, 63%, 73%, and 70% performances of purpose/solutions (two components of procedural knowledge model) extraction, process’s entity identification, entity association, and relation identification between processes respectively, even though we applied strict guidelines in evaluating the performance.

Sa-kwang Song, Yun-soo Choi, Heung-seon Oh, Sung-Hyon Myaeng, Sung-Pil Choi, Hong-Woo Chun, Chang-Hoo Jeong, Won-Kyung Sung

Arabic Script Text Processing and Retrieval

Small-Word Pronunciation Modeling for Arabic Speech Recognition: A Data-Driven Approach

Incorrect recognition of adjacent small words is considered one of the obstacles in improving the performance of automatic continuous speech recognition systems. The pronunciation variation in the phonemes of adjacent words introduces ambiguity to the triphone of the acoustic model and adds more confusion to the speech recognition decoder. However, small words are more likely to be affected by this ambiguity than longer words. In this paper, we present a data-driven approach to model the small words problem. The proposed method identifies the adjacent small words in the corpus transcription to generate the compound words. The unique compound words are then added to the expanded pronunciation dictionary, as well as to the language model as a new sentence. Results show a significant improvement of 2.16% in the word error rate compared to that of the Baseline speech corpus of Modern Standard Arabic broadcast news.

Dia AbuZeina, Wasfi Al-khatib, Moustafa Elshafei
The SALAH Project: Segmentation and Linguistic Analysis of ḥadīṯ Arabic Texts

A model for the unsupervised segmentation and linguistic analysis of Arabic texts of Prophetic tradition (

ḥadīṯ

s), SALAH, is proposed. The model automatically segments each text unit in a transmitter chain (

isnād

) and a text content (

matn

) and further analyses each segment according to two distinct pipelines: a set of regular expressions chunks transmitter chains in a graph labeled with the relation between transmitters, while a tailored, augmented version of the AraMorph morphological analyzer (RAM) analyzes and annotates lexically and morphologically the text content. A graph with relations among transmitters and a lemmatized text corpus, both in XML format, are the final output of the system, which can further feed the automatic generation of concordances of the texts with variable-sized windows. The model results can be useful for a variety of purposes, including retrieving information from

ḥadīṯ

texts, verify the relations between transmitters, finding variant readings, supplying lexical information to specialized dictionaries.

Marco Boella, Francesca Romana Romani, Anjela Al-Raies, Cristina Solimando, Giuliano Lancioni
Exploring Clustering for Multi-document Arabic Summarisation

In this paper we explore clustering for multi-document Arabic summarisation. For our evaluation we use an Arabic version of the DUC-2002 dataset that we previously generated using Google Translate. We explore how clustering (at the sentence level) can be applied to multi-document summarisation as well as for redundancy elimination within this process. We use different parameter settings including the cluster size and the selection model applied in the extractive summarisation process. The automatically generated summaries are evaluated using the ROUGE metric, as well as precision and recall. The results we achieve are compared with the top five systems in the DUC-2002 multi-document summarisation task.

Mahmoud El-Haj, Udo Kruschwitz, Chris Fox
ZamAn and Raqm: Extracting Temporal and Numerical Expressions in Arabic

In this paper we investigate automatic identification of Arabic temporal and numerical expressions. The objectives of this paper are 1) to describe

ZamAn

, a machine learning method we have developed to label Arabic temporals, processing the functional dashtag -TMP used in the Arabic treebank to mark a temporal modifier which represents a reference to a point in time or a span of time, and 2) to present

Raqm

, a machine learning method applied to identify different forms of numerical expressions in order to normalise them into digits.

We present a series of experiments evaluating how well

ZamAn

(resp.

Raqm

) copes with the enriched Arabic data achieving state-of-the-art results of F1-measure of 88.5% (resp. 96%) for bracketing and 73.1% (resp. 94.4%) for detection.

Iman Saleh, Lamia Tounsi, Josef van Genabith
Extracting Parallel Paragraphs and Sentences from English-Persian Translated Documents

The task of sentence and paragraph alignment is essential for preparing parallel texts that are needed in applications such as machine translation. The lack of sufficient linguistic data for under-resourced languages like Persian is a challenging issue. In this paper, we proposed a hybrid sentence and paragraph alignment model on Persian-English parallel documents based on simple linguistic features as well as length similarity between sentences and paragraphs of source and target languages. We apply a small bilingual dictionary of Persian-English nouns, punctuation marks, and length similarity as alignment metrics. We combine these features in a linear model and use genetic algorithm to learn the linear equation weights. Evaluation results show that the extracted features improve the baseline model which is only a length-based one.

Mohammad Sadegh Rasooli, Omid Kashefi, Behrouz Minaei-Bidgoli
Effect of ISRI Stemming on Similarity Measure for Arabic Document Clustering

Arabic Document Clustering has increasingly become an important task for obtaining good results with the unsupervised learning task. This paper aims to evaluate the impact of the five measures (Cosine similarity, Jaccard coefficient, Pearson correlation, Euclidean distance and Averaged Kullback- Leibler Divergence) for Document Clustering with two types of pre-processing morphology-based The Information Science Research Institute (ISRI) is equivalent to the root-based stemmer and light stemmer; and without stemming without morphology) for an Arabic dataset. Stemming is known as a computational process used to reduce words to their stems. For classification, it is categorised as a recall-enhancing or precision-enhancing component. It is concluded that the method of ISRI for words is proved to be better than without stemming methods which use a five similarities/distance measures for Document Clustering.

Qusay Walid Bsoul, Masnizah Mohd
A Semi-supervised Approach for Key-Synset Extraction to Be Used in Word Sense Disambiguation

Nowadays, although many researches is being done in the field of word sense disambiguation in some languages like English, still some other languages like Persian have many things to be done. Some difficulties are in this way which might have made it less interactive for researchers. For example, Persian WordNet or FarsNet is newly developed and there is no sense tagged corpus developed based on it yet. So we propose a semi-supervised approach for extending FarsNet with some new relations and then use it for WSD. Also a method to extract semantic keywords or key-concepts from textual documents is used. As the key-concepts are extracted exploiting FarsNet, we call them Key-synsets. In fact Key-synsets of a document are those synsets which are semantically related to the main subjects of that document. This method is exploited to improve the precision of the proposed WSD. Although our approach is tested on Persian it can be easily adopted for other languages such as English.

Maryam Haghollahi, Mehrnoush Shamsfard
Mapping FarsNet to Suggested Upper Merged Ontology

FarsNet is a lexical ontology for Persian language. SUMO is an important upper level ontology that contains global knowledge. Mapping of lexical to general knowledge of these two ontologies will be beneficial for Persian language. Producing a mapping of FarsNet to SUMO began after development of the first phase of FarsNet. Since we had mapping to WordNet for some FarsNet synsets, the mapping of FarsNet to SUMO was started with contribution of mapping FarsNet to WordNet. Obviously, there are some gaps between two languages of Persian and English such as lexical gap. So, there is no compulsion to obtain mapped SUMO concepts only through the WordNet mapping directly. Therefore, for covering the gaps, we take advantage of hierarchy relations in FarsNet. Hence, the bias of our mapping to English WordNet will be reduced. In this paper we propose the methodology of our semi automatic mapping method.

Aynaz Taheri, Mehrnoush Shamsfard
Topic Detection and Multi-word Terms Extraction for Arabic Unvowelized Documents

This paper focuses on Topic Detection (TD) for Arabic Unvowelized documents. Our topic detection system was implemented using two different metrics: adapted TF-IDF and Jaccard indicator. The experiments were conducted while studying the impact of working with stems or roots of words, all the words or nouns only. To enhance the TD system we developed The MWTs extraction prototype to generate MWTs vocabularies. To the best of our knowledge MWTs vocabulary has never been used in arabic documents topic’s detection. In this paper we investigate the impact of such use on the quality of topic detection. We used the standard measures: Recall, Precision and F-measure to evaluate the performance of the realized systems on Wattan; an Arabic newspaper corpus.

Rim Koulali, Abdelouafi Meziane
Backmatter
Metadaten
Titel
Information Retrieval Technology
herausgegeben von
Mohamed Vall Mohamed Salem
Khaled Shaalan
Farhad Oroumchian
Azadeh Shakery
Halim Khelalfa
Copyright-Jahr
2011
Verlag
Springer Berlin Heidelberg
Electronic ISBN
978-3-642-25631-8
Print ISBN
978-3-642-25630-1
DOI
https://doi.org/10.1007/978-3-642-25631-8

Neuer Inhalt