Skip to main content

2016 | Buch

Information Retrieval Technology

12th Asia Information Retrieval Societies Conference, AIRS 2016, Beijing, China, November 30 – December 2, 2016, Proceedings

herausgegeben von: Shaoping Ma, Ji-Rong Wen, Yiqun Liu, Zhicheng Dou, Min Zhang, Yi Chang, Xin Zhao

Verlag: Springer International Publishing

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

This book constitutes the refereed proceedings of the 12th Information Retrieval Societies Conference, AIRS 2016, held in Beijing, China, in November/December 2016.

The 21 full papers presented together with 11 short papers were carefully reviewed and selected from 57 submissions. The final programme of AIRS 2015 is divided in the following tracks: IR models and theories; machine learning and data mining for IR; IR applications and user modeling; personalization and recommendation; and IR evaluation.

Inhaltsverzeichnis

Frontmatter

IR Models and Theories

Frontmatter
Modeling Relevance as a Function of Retrieval Rank

Batched evaluations in IR experiments are commonly built using relevance judgments formed over a sampled pool of documents. However, judgment coverage tends to be incomplete relative to the metrics being used to compute effectiveness, since collection size often makes it financially impractical to judge every document. As a result, a considerable body of work has arisen exploring the question of how to fairly compare systems in the face of unjudged documents. Here we consider the same problem from another perspective, and investigate the relationship between relevance likelihood and retrieval rank, seeking to identify plausible methods for estimating document relevance and hence computing an inferred gain. A range of models are fitted against two typical TREC datasets, and evaluated both in terms of their goodness of fit relative to the full set of known relevance judgments, and also in terms of their predictive ability when shallower initial pools are presumed, and extrapolated metric scores are computed based on models developed from those shallow pools.

Xiaolu Lu, Alistair Moffat, J. Shane Culpepper
The Effect of Score Standardisation on Topic Set Size Design

Given a topic-by-run score matrix from past data, topic set size design methods can help test collection builders determine the number of topics to create for a new test collection from a statistical viewpoint. In this study, we apply a recently-proposed score standardisation method called std-AB to score matrices before applying topic set size design, and demonstrate its advantages. For topic set size design, std-AB suppresses score variances and thereby enables test collection builders to consider realistic choices of topic set sizes, and to handle unnormalised measures in the same way as normalised measures. In addition, even discrete measures that clearly violate normality assumptions look more continuous after applying std-AB, which may make them more suitable for statistically motivated topic set size design. Our experiments cover a variety of tasks and evaluation measures from NTCIR-12.

Tetsuya Sakai
Incorporating Semantic Knowledge into Latent Matching Model in Search

The relevance between a query and a document in search can be represented as matching degree between the two objects. Latent space models have been proven to be effective for the task, which are often trained with click-through data. One technical challenge with the approach is that it is hard to train a model for tail queries and tail documents for which there are not enough clicks. In this paper, we propose to address the challenge by learning a latent matching model, using not only click-through data but also semantic knowledge. The semantic knowledge can be categories of queries and documents as well as synonyms of words, manually or automatically created. Specifically, we incorporate semantic knowledge into the objective function by including regularization terms. We develop two methods to solve the learning task on the basis of coordinate descent and gradient descent respectively, which can be employed in different settings. Experimental results on two datasets from an app search engine demonstrate that our model can make effective use of semantic knowledge, and thus can significantly enhance the accuracies of latent matching models, particularly for tail queries.

Shuxin Wang, Xin Jiang, Hang Li, Jun Xu, Bin Wang
Keyqueries for Clustering and Labeling

In this paper we revisit the document clustering problem from an information retrieval perspective. The idea is to use queries as features in the clustering process that finally also serve as descriptive cluster labels “for free.” Our novel perspective includes query constraints for clustering and cluster labeling that ensure consistency with a keyword-based reference search engine.Our approach combines different methods in a three-step pipeline. Overall, a query-constrained variant of k-means using noun phrase queries against an ESA-based search engine performs best. In the evaluation, we introduce a soft clustering measure as well as a freely available extended version of the Ambient dataset. We compare our approach to two often-used baselines, descriptive k-means and k-means plus $$\chi ^2$$χ2. While the derived clusters are of comparable high quality, the evaluation of the corresponding cluster labels reveals a great diversity in the explanatory power. In a user study with 49 participants, the labels generated by our approach are of significantly higher discriminative power, leading to an increased human separability of the computed clusters.

Tim Gollub, Matthias Busse, Benno Stein, Matthias Hagen
A Comparative Study of Answer-Contained Snippets and Traditional Snippets

Almost every text search engine uses snippets to help users quickly assess the relevance of retrieved items in the ranked list. Although answer-contained snippets can help to improve the effectiveness of search intuitively, quantitative study of such intuition remains untouched. In this paper, we first propose a simple answer-contained snippet method for community-based Question and Answer (cQA) search, and then compare our method with the state-of-the-art traditional snippet algorithms. The experimental results show that the answer-contained snippet method significantly outperforms the state-of-the-art traditional methods, considering relevance judgements and information satisfaction evaluations.

Xian-Ling Mao, Dan Wang, Yi-Jing Hao, Wenqing Yuan, Heyan Huang
Local Community Detection via Edge Weighting

Local community detection aims at discovering a community from a seed node by maximizing a given goodness metric. This problem has attracted a lot of attention, and various goodness metrics have been proposed in recent years. However, most existing approaches are based on the assumption that either nodes or edges in network have equal weight. In fact, the usage of weights of both nodes and edges in network can somewhat enhance the algorithmic accuracy. In this paper, we propose a novel approach for local community detection via edge weighting. In detail, we first design a new node similarity measure with full consideration of adjacent nodes’ weights. We next develop an edge weighting method based on this similarity measure. Then, we define a new goodness metric to quantify the quality of local community by integrating the edge weights. In our algorithm, we discover local community by giving priority to shell node which has maximal similarity with the current local community. We evaluate the proposed algorithm on both synthetic and real-world networks. The results of our experiment demonstrate that our algorithm is highly effective at local community detection compared to related algorithms.

Weiji Zhao, Fengbin Zhang, Jinglian Liu

Machine Learning and Data Mining for IR

Frontmatter
Learning a Semantic Space of Web Search via Session Data

In Web search, a user first comes up with an information need and issues an initial query. Then some retrieved URLs are clicked and other queries are issued if he/she is not satisfied. We advocate that Web search is governed by a hidden semantic space, and each involved element such as query and URL has its projection, i.e., as a vector, in this space. Each of above actions in the search procedure, i.e. issuing queries or clicking URLs, is an interaction result of those elements in the space. In this paper, we aim at uncovering such a semantic space of Web search that uniformly captures the hidden semantics of search queries, URLs and other elements. We propose session2vec and session2vec+ models to learn vectors in the space with search session data, where a search session is regarded as an instantiation of an information need and keeps the interaction information of queries and URLs. Vector learning is done on a large query log from a search engine, and the efficacy of learnt vectors is examined in a few tasks.

Lidong Bing, Zheng-Yu Niu, Wai Lam, Haifeng Wang
TLINE: Scalable Transductive Network Embedding

Network embedding is a classical task which aims to project a network into a low-dimensional space. Currently, most existing embedding methods are unsupervised algorithms, which ignore the useful label information. In this paper, we propose TLINE, a semi-supervised extension of LINE algorithm. TLINE is a transductive network embedding method, which optimizes the loss function of LINE to preserve both local and global network structure information, and applies SVM to maximize the margin between the labeled nodes of different classes. By applying the edge-sampling and the negative sampling techniques in the optimizing process, the computational complexity of TLINE is reduced. Thus TLINE can handle the large-scale network. To evaluate the performance in node classification task, we test our methods on two real world network datasets, which are Citeseer and DBLP. The experimental result indicates that TLINE outperforms state-of-the-art baselines and is suitable for large-scale networks.

Xia Zhang, Weizheng Chen, Hongfei Yan
Detecting Synonymous Predicates from Online Encyclopedia with Rich Features

The integration of Linked Open Data faces great challenges on the semantic level, despite unified data models. Inappropriate use of ontology concepts, namely predicates, impedes knowledge discovery. Although predicate unification is one of the most crucial steps when building structured knowledge base, little effort has been put forward. In this paper, we propose a supervised approach to detect synonymous predicates. Our detection focuses on feature selection and their effectiveness analysis. We not only leverage different resources such as Wikipedia, Freebase, but also use different word embeddings to represent predicates. The experimental results indicate that wikitext defined by Wikipedia and predicate surface form are most useful features.

Zhe Han, Yansong Feng, Dongyan Zhao

IR Applications and User Modeling

Frontmatter
Patent Retrieval Based on Multiple Information Resources

Query expansion methods have been proven to be effective to improve the average performance of patent retrieval, and most of query expansion methods use single source of information for query expansion term selection. In this paper, we propose a method which exploits external resources for improving patent retrieval. Google search engine and Derwent World Patents Index were used as external resources to enhance the performance of query expansion methods. LambdaRank was employed to improve patent retrieval performance by combining different query expansion methods with different text fields weighting strategies of different resources. Experiments on TREC data sets showed that our combination of multiple information sources for query formulation was more effective than using any single source to improve patent retrieval performance.

Kan Xu, Hongfei Lin, Yuan Lin, Bo Xu, Liang Yang, Shaowu Zhang
Simulating Ideal and Average Users

We propose a framework for deterministic simulation of user behavior that allows to analyze the cost-gain-based performance on single result lists or whole search sessions. The ideal user representing optimal behavior (i.e., most gain with lowest effort) is contrasted with more “average” users that employ the spreading activation model from cognitive theory. On TREC Session Track data, the ideal user achieves about double the gain of real users at the same costs while the average gain of our different simulated users correlates well with the session-DCG metric—another argument for that metric in session-based evaluation.

Matthias Hagen, Maximilian Michel, Benno Stein
Constraining Word Embeddings by Prior Knowledge – Application to Medical Information Retrieval

Word embedding has been used in many NLP tasks and showed some capability to capture semantic features. It has also been used in several recent studies in IR. However, word embeddings trained in unsupervised manner may fail to capture some of the semantic relations in a specific area (e.g. healthcare). In this paper, we leverage the existing knowledge (word relations) in the medical domain to constrain word embeddings using the principle that related words should have similar embeddings. The resulting constrained word embeddings are used to rerank documents, showing superior effectiveness to unsupervised word embeddings.

Xiaojie Liu, Jian-Yun Nie, Alessandro Sordoni

Personalization and Recommendation

Frontmatter
Use of Microblog Behavior Data in a Language Modeling Framework to Enhance Web Search Personalization

Diversity in users’ information needs has been effectively dealt with through personalized Web search systems whereby a user’s interests and preferences are taken into account within the retrieval model. A significant component of any Web search personalization model is the means with which to model a user’s interests and preferences to build what is termed as a user profile. This work explores the use of the Twitter microblog network as a source of user profile construction for Web search personalization. We propose a statistical language modeling approach taking into account various aspects of a user’s behavior on the Twitter network (such as Twitterers followed, mentioned and retweeted). The model also incorporates network and topical similarity measures which enables the model to be a better representation of the user’s profile. The richness of the Web search personalization model leads to significant performance improvements in retrieval accuracy.

Arjumand Younus
A Joint Framework for Collaborative Filtering and Metric Learning

We have developed a framework for jointly conducting collaborative filtering and distance metric learning based on regularized singular value decomposition (RSVD), which discovers the user matrix and item matrix in the low rank space. Our approach is able to solve RSVD and simultaneously learn the parameters of Mahalanobis distance considering the ratings given by similar users and dissimilar users. One characteristic of our approach is that the learned model can be effectively applied to rating prediction and other relevant applications such as trust prediction, resulting in a solution which is coherent and optimal to both tasks. Another characteristic is that social community information and similarity information can be easily considered in our framework. We have conducted extensive experiments on rating prediction using real-world datasets to evaluate our framework. We have also compared our framework with other existing works to illustrate the effectiveness. Experimental results show that our framework achieves a promising prediction performance and outperforms the existing works.

Tak-Lam Wong, Wai Lam, Haoran Xie, Fu Lee Wang
Scrutinizing Mobile App Recommendation: Identifying Important App-Related Indicators

Among several traditional and novel mobile app recommender techniques that utilize a diverse set of app-related features (such as an app’s Twitter followers, various version instances, etc.), which app-related features are the most important indicators for app recommendation? In this paper, we develop a hybrid app recommender framework that integrates a variety of app-related features and recommendation techniques, and then identify the most important indicators for the app recommendation task. Our results reveal an interesting correlation with data from third-party app analytics companies; and suggest that, in the context of mobile app recommendation, more focus could be placed in user and trend analysis via social networks.

Jovian Lin, Kazunari Sugiyama, Min-Yen Kan, Tat-Seng Chua
User Model Enrichment for Venue Recommendation

An important task in recommender systems is suggesting relevant venues in a city to a user. These suggestions are usually created by exploiting the user’s history of preferences, which are, for example, collected in previously visited cities. In this paper, we first introduce a user model based on venues’ categories and their descriptive keywords extracted from Foursquare tips. Then, we propose an enriched user model which leverages the users’ reviews from Yelp. Our participation in the TREC 2015 Contextual Suggestion track, confirmed that our model outperforms other approaches by a significant margin.

Mohammad Aliannejadi, Ida Mele, Fabio Crestani
Learning Distributed Representations for Recommender Systems with a Network Embedding Approach

In this paper, we present a novel perspective to address recommendation tasks by utilizing the network representation learning techniques. Our idea is based on the observation that the input of typical recommendation tasks can be formulated as graphs. Thus, we propose to use the k-partite adoption graph to characterize various kinds of information in recommendation tasks. Once the historical adoption records have been transformed into a graph, we can apply the network embedding approach to learn vertex embeddings on the k-partite adoption network. Embeddings for different kinds of information are projected into the same latent space, where we can easily measure the relatedness between multiple vertices on the graph using some similarity measurements. In this way, the recommendation task has been casted into a similarity evaluation process using embedding vectors. The proposed approach is both general and scalable. To evaluate the effectiveness of the proposed approach, we construct extensive experiments on two different recommendation tasks using real-world datasets. The experimental results have shown the superiority of our approach. To the best of our knowledge, it is the first time that a network representation learning approach has been applied to recommendation tasks.

Wayne Xin Zhao, Jin Huang, Ji-Rong Wen
Factorizing Sequential and Historical Purchase Data for Basket Recommendation

Basket recommendation is an important task in market basket analysis. Existing work on this problem can be summarized into two paradigms. One is the item-centric paradigm, where sequential patterns are mined from users’ transactional data and leveraged for prediction. However, these approaches usually suffer from the data sparseness problem. The other is the user-centric paradigm, where collaborative filtering techniques have been applied on users’ historical data. However, these methods ignore the sequential behaviors of users, which are often crucial for basket recommendation. In this paper, we introduce a hybrid method, namely the Co-Factorization model over Sequential and Historical purchase data (CFSH for short) for basket recommendation. Compared with existing methods, our approach enjoys the following merits: (1) By mining and factorizing global sequential patterns, we can avoid the sparseness problem in traditional item-centric methods; (2) By factorizing item-item and user-item matrices simultaneously, we can exploit both sequential and historical behaviors to learn user and item representations better; (3) Experimental results on three real-world transaction datasets demonstrated the effectiveness of our approach as compared with the existing methods.

Pengfei Wang, Jiafeng Guo, Yanyan Lan, Jun Xu, Xueqi Cheng

IR Evaluation

Frontmatter
Search Success Evaluation with Translation Model

Evaluation plays an essential way in Information Retrieval (IR) researches. Existing Web search evaluation methodologies usually come in two ways: offline and online methods. The benchmarks generated by offline methods (e.g. Cranfield-like ones) could be easily reused. However, the evaluation metrics in these methods are usually based on various user behavior assumptions (e.g. Cascade assumption) and may not well accord with actual user behaviors. Online methods, in contrast, can well capture users’ actual preferences while the results are not usually reusable. In this paper, we focus on the evaluation problem where users are using search engines to finish complex tasks. These tasks usually involve multiple queries in a single search session and propose challenges to both offline and online evaluation methodologies. To tackle this problem, we propose a search success evaluation framework based on machine translation model. In this framework, we formulate the search success evaluation problem as a machine translation evaluation problem: the ideal search outcome (i.e. necessary information to finish the task) is considered as the reference while search outcome from individual users (i.e. content that are perceived by users) as the translation. Thus, we adopt BLEU, a long standing machine translation evaluation metric, to evaluate the success of searchers. This framework avoids the introduction of possibly unreliable behavior assumptions and is reusable as well. We also tried a number of automatic methods which aim to minimize assessors’ efforts based on search interaction behavior such as eye-tracking and click-through. Experimental results indicate that the proposed evaluation method well correlates with explicit feedback on search satisfaction from search users. It is also suitable for search success evaluation when there is need for quick or frequent evaluations.

Cheng Luo, Yiqun Liu, Min Zhang, Shaoping Ma
Evaluating the Social Acceptability of Voice Based Smartwatch Search

There has been a recent increase in the number of wearable (e.g. smartwatch, interactive glasses, etc.) devices available. Coupled with this there has been a surge in the number of searches that occur on mobile devices. Given these trends it is inevitable that search will become a part of wearable interaction. Given the form factor and display capabilities of wearables this will probably require a different type of search interaction to what is currently used in mobile search. This paper presents the results of a user study focusing on users’ perceptions of the use of smartwatches for search. We pay particular attention to social acceptability of different search scenarios, focussing on input method, device form and information need. Our findings indicate that audience and location heavily influence whether people will perform a voice based search. The results will help search system developers to support search on smartwatches.

Christos Efthymiou, Martin Halvey
How Precise Does Document Scoring Need to Be?

We explore the implications of tied scores arising in the document similarity scoring regimes that are used when queries are processed in a retrieval engine. Our investigation has two parts: first, we evaluate past TREC runs to determine the prevalence and impact of tied scores, to understand the alternative treatments that might be used to handle them; and second, we explore the implications of what might be thought of as “deliberate” tied scores, in order to allow for faster search. In the first part of our investigation we show that while tied scores had the potential to be disruptive to TREC evaluations, in practice their effect was relatively minor. The second part of our exploration helps understand why that was so, and shows that quite marked levels of score rounding can be tolerated, without greatly affecting the ability to compare between systems. The latter finding offers the potential for approximate scoring regimes that provide faster query processing with little or no loss of effectiveness.

Ziying Yang, Alistair Moffat, Andrew Turpin

Short Paper

Frontmatter
Noise Correction in Pairwise Document Preferences for Learning to Rank

This paper proposes a way of correcting noise in the training data for Learning to Rank. It is natural to assume that some level of noise might seep in during the process of producing query-document relevance labels by human evaluators. These relevance labels, which act as gold standard training data for Learning to Rank can adversely affect the efficiency of learning algorithm if they contain errors. Hence, an automated way of reducing noise can be of great advantage. The focus in this paper is on noise correction for pairwise document preferences which are used for pairwise Learning to Rank algorithms. The approach relies on representing pairwise document preferences in an intermediate feature space on which ensemble learning based approach is applied to identify and correct the errors. Up to 90 % errors in the pairwise preferences could be corrected at statistically significant levels by using this approach, which is robust enough to even operate at high levels of noise.

Harsh Trivedi, Prasenjit Majumder
Table Topic Models for Hidden Unit Estimation

We propose a method to estimate hidden units of numbers written in tables. We focus on Wikipedia tables and propose an algorithm to estimate which units are appropriate for a given cell that has a number but no unit words. We try to estimate such hidden units using surrounding contexts such as a cell in the first row. To improve the performance, we propose the table topic model that can model tables and surrounding sentences simultaneously.

Minoru Yoshida, Kazuyuki Matsumoto, Kenji Kita
Query Subtopic Mining Exploiting Word Embedding for Search Result Diversification

Understanding the users’ search intents through mining query subtopic is a challenging task and a prerequisite step for search diversification. This paper proposes mining query subtopic by exploiting the word embedding and short-text similarity measure. We extract candidate subtopic from multiple sources and introduce a new way of ranking based on a new novelty estimation that faithfully represents the possible search intents of the query. To estimate the subtopic relevance, we introduce new semantic features based on word embedding and bipartite graph based ranking. To estimate the novelty of a subtopic, we propose a method by combining the contextual and categorical similarities. Experimental results on NTCIR subtopic mining datasets turn out that our proposed approach outperforms the baselines, known previous methods, and the official participants of the subtopic mining tasks.

Md Zia Ullah, Md Shajalal, Abu Nowshed Chy, Masaki Aono
Assessing the Authors of Online Books in Digital Libraries Using Users Affinity

Information quality generated by crowd-sourcing platforms is a major concern. Incomplete or inaccurate user-generated data prevent truly comprehensive analysis and might lead to inaccurate reports and forecasts. In this paper, we address the problem of assessing the authors of users generated published books in digital libraries. We propose to model the platform using an heterogeneous graph representation and to exploit both the users’ interests and the natural inter-users affinities to infer the authors of unlabelled books. We formalize the task as an optimization problem and integrate in the objective a prior of consistency associated to the networked users in order to capture the neighboors’ interests. Experiments conducted over the Babellio platform (http://babelio.com/), a French crowd-sourcing website for book lovers, achieved successful results and confirm the interest of considering an affinity-based regularization term.

B. de La Robertie
Reformulate or Quit: Predicting User Abandonment in Ideal Sessions

We present a comparison of different types of features for predicting session abandonment. We show that under ideal conditions for identifying topical sessions, the best features are those related to user actions and document relevance, while features related to query/document similarity actually hurt prediction abandonment.

Mustafa Zengin, Ben Carterette
Learning to Rank with Likelihood Loss Functions

According to a given query in training set, the documents can be grouped based on their relevance judgments. If the group with higher relevance labels is in front of the one with lower relevance judgments, the ranking performance of ranking model could be perfect. Inspired by this idea, we propose a novel machine learning framework for ranking, which depends on two new samples. The first sample is one-group constructed of one document with higher relevance judgment and a group of documents with lower relevance judgment; the second sample is group-group constructed of a group of documents with higher relevance judgment and a group of documents with lower relevance judgment. We also develop a novel preference-weighted loss function for multiple relevance judgment data sets. Finally, we optimize the group ranking approaches by optimizing initial ranking list for likelihood loss function. Experimental results show that our approaches are effective in improving ranking performance.

Yuan Lin, Liang Yang, Bo Xu, Hongfei Lin, Kan Xu
Learning to Improve Affinity Ranking for Diversity Search

Search diversification plays an important role in modern search engine, especially when user-issued queries are ambiguous and the top ranked results are redundant. Some diversity search approaches have been proposed for reducing the information redundancy of the retrieved results, while do not consider the topic coverage maximization. To solve this problem, the Affinity ranking model has been developed aiming at maximizing the topic coverage meanwhile reducing the information redundancy. However, the original model does not involve a learning algorithm for parameter tuning, thus limits the performance optimization. In order to further improve the diversity performance of Affinity ranking model, inspired by its ranking principle, we propose a learning approach based on the learning-to-rank framework. Our learning model not only considers the topic coverage maximization and redundancy reduction by formalizing a series of features, but also optimizes the diversity metric by extending a well-known learning-to-rank algorithm LambdaMART. Comparative experiments have been conducted on TREC diversity tracks, which show the effectiveness of our model.

Yue Wu, Jingfei Li, Peng Zhang, Dawei Song
An In-Depth Study of Implicit Search Result Diversification

In this paper, we present a novel Integer Linear Programming formulation (termed ILP4ID) for implicit search result diversification (SRD). The advantage is that the exact solution can be achieved, which enables us to investigate to what extent using the greedy strategy affects the performance of implicit SRD. Specifically, a series of experiments are conducted to empirically compare the state-of-the-art methods with the proposed approach. The experimental results show that: (1) The factors, such as different initial runs and the number of input documents, greatly affect the performance of diversification models. (2) ILP4ID can achieve substantially improved performance over the state-of-the-art methods in terms of standard diversity metrics.

Hai-Tao Yu, Adam Jatowt, Roi Blanco, Hideo Joho, Joemon Jose, Long Chen, Fajie Yuan
Predicting Information Diffusion in Social Networks with Users’ Social Roles and Topic Interests

In this paper, we propose an approach, Role and Topic aware Independent Cascade (RTIC), to uncover information diffusion in social networks, which extracts the opinion leaders and structural hole spanners and analyze the users’ interests on specific topics. Results conducted on three real datasets show that our approach achieves substantial improvement with only limited features compared with previous methods.

Xiaoxuan Ren, Yan Zhang
When MetaMap Meets Social Media in Healthcare: Are the Word Labels Correct?

Health forums have gained attention from researchers for studying various topics on healthcare. In many of these studies, identifying biomedical words by using the MetaMap is often a pre-processing step. MetaMap is a popular tool for recognizing Unified Medical Language System (UMLS) concepts in free text. However, MetaMap favors identifying terminologies used by professionals rather than laymen terms by the common users. The word labels given by MetaMap on social media may not be accurate, and may adversely affect the next level studies. In this study, we manually annotate the correctness of medical words extracted by MetaMap from 100 posts in HealthBoards and get a precision of 43.75 %. We argue that directly applying MetaMap on social media data in healthcare may not be a good choice for identifying the medical words.

Hongkui Tu, Zongyang Ma, Aixin Sun, Xiaodong Wang
Evaluation with Confusable Ground Truth

Subjective judgment with human rating has been an important way of constructing ground truth for the evaluation in the research areas including information retrieval. Researchers aggregate the ratings of an instance into a single score by statistical measures or label aggregation methods to evaluate the proposed approaches and baselines. However, the rating distributions of instances are diverse even if the aggregated scores are same. We define a term of confusability which represents how confusable the reviewers are on the instances. We find that confusability has prominent influence on the evaluation results with a exploration study. We thus propose a novel evaluation solution with several effective confusability measures and confusability aware evaluation methods. They can be used as a supplementary to existing rating aggregation methods and evaluation methods.

Jiyi Li, Masatoshi Yoshikawa
Backmatter
Metadaten
Titel
Information Retrieval Technology
herausgegeben von
Shaoping Ma
Ji-Rong Wen
Yiqun Liu
Zhicheng Dou
Min Zhang
Yi Chang
Xin Zhao
Copyright-Jahr
2016
Electronic ISBN
978-3-319-48051-0
Print ISBN
978-3-319-48050-3
DOI
https://doi.org/10.1007/978-3-319-48051-0