Skip to main content

2017 | Buch

Information Retrieval

23rd China conference, CCIR 2017, Shanghai, China, July 13–14, 2017, Proceedings

insite
SUCHEN

Über dieses Buch

This book constitutes the refereed proceedings of the 23rd China Conference on Information Retrieval, CCIR 2017, held in Shanghai, China, in July 2017. The 21 full papers presented were carefully reviewed and selected from 41 submissions. The papers are organized in topical sections: recommendation; understanding users; NLP for IR; IR and applications; query processing and analysis.

Inhaltsverzeichnis

Frontmatter

Recommendation

Frontmatter
Neural or Statistical: An Empirical Study on Language Models for Chinese Input Recommendation on Mobile
Abstract
Chinese input recommendation plays an important role in alleviating human cost in typing Chinese words, especially in the scenario of mobile applications. The fundamental problem is to predict the conditional probability of the next word given the sequence of previous words. Therefore, statistical language models, i.e. n-grams based models, have been extensively used on this task in real application. However, the characteristics of extremely different typing behaviors usually lead to serious sparsity problem, even n-gram with smoothing will fail. A reasonable approach to tackle this problem is to use the recently proposed neural models, such as probabilistic neural language model, recurrent neural network and word2vec. They can leverage more semantically similar words for estimating the probability. However, there is no conclusion on which approach of the two will work better in real application. In this paper, we conduct an extensive empirical study to show the differences between statistical and neural language models. The experimental results show that the two different approach have individual advantages, and a hybrid approach will bring a significant improvement.
Hainan Zhang, Yanyan Lan, Jiafeng Guo, Jun Xu, Xueqi Cheng
Dynamic-K Recommendation with Personalized Decision Boundary
Abstract
In this paper, we investigate the recommendation task in the most common scenario with implicit feedback (e.g., clicks, purchases). State-of-the-art methods in this direction usually cast the problem as to learn a personalized ranking on a set of items (e.g., webpages, products). The top-N results are then provided to users as recommendations, where the N is usually a fixed number pre-defined by the system according to some heuristic criteria (e.g., page size, screen size). There is one major assumption underlying this fixed-number recommendation scheme, i.e., there are always sufficient relevant items to users’ preferences. Unfortunately, this assumption may not always hold in real-world scenarios. In some applications, there might be very limited candidate items to recommend, and some users may have very high relevance requirement in recommendation. In this way, even the top-1 ranked item may not be relevant to a user’s preference. Therefore, we argue that it is critical to provide a dynamic-K recommendation, where the K should be different with respect to the candidate item set and the target user. We formulate this dynamic-K recommendation task as a joint learning problem with both ranking and classification objectives. The ranking objective is the same as existing methods, i.e., to create a ranking list of items according to users’ interests. The classification objective is unique in this work, which aims to learn a personalized decision boundary to differentiate the relevant items from irrelevant items. Based on these ideas, we extend two state-of-the-art ranking-based recommendation methods, i.e., BPRMF and HRM, to the corresponding dynamic-K versions, namely DK-BPRMF and DK-HRM. Our experimental results on two datasets show that the dynamic-K models are more effective than the original fixed-N recommendation methods.
Yan Gao, Jiafeng Guo, Yanyan Lan, Huaming Liao
The Impact of Profile Coherence on Recommendation Performance for Shared Accounts on Smart TVs
Abstract
Most recommendation algorithms assume that an account represents a single user, and capture a user’s interest by what he/she has preferred. However, in some applications, e.g., video recommendation on smart TVs, an account is often shared by multiple users who tend to have disparate interests. It poses great challenges for delivering personalized recommendations. In this paper, we propose the concept of profile coherence to measure the coherence of an account’s interests, which is computed as the average similarity between items in the account profile in our implementation. Furthermore, we evaluate the impact of profile coherence on the quality of recommendation lists for coherent and incoherent accounts generated by different variants of item-based collaborative filtering. Experiments conducted on a large-scale watch log on smart TVs conform that the profile coherence indeed impact the quality of recommendation lists in various aspects—accuracy, diversity and popularity.
Tao Lian, Zhengxian Li, Zhumin Chen, Jun Ma
Academic Access Data Analysis for Literature Recommendation
Abstract
Academic reading plays an important role in researchers’ daily life. To alleviate the burden of seeking relevant literature from rapidly growing academic repository, different kinds of recommender systems have been introduced in recent years. However, most existing work focused on adopting traditional recommendation techniques, like content-based filtering or collaborative filtering, in the literature recommendation scenario. Little work has yet been done on analyzing the academic reading behaviors to understand the reading patterns and information needs of real-world academic users, which would be a foundation for improving existing recommender systems or designing new ones. In this paper, we aim to tackle this problem by carrying out empirical analysis over large scale academic access data, which can be viewed as a proxy of academic reading behaviors. We conduct global, group-based and sequence-based analysis to address the following questions: (1) Are there any regularities in users’ academic reading behaviors? (2) Will users with different levels of activeness exhibit different information needs? (3) How to correlate one’s future demands with his/her historical behaviors? By answering these questions, we not only unveil useful patterns and strategies for literature recommendation, but also identify some challenging problems for future development.
Yixing Fan, Jiafeng Guo, Yanyan Lan, Jun Xu, Xueqi Cheng

Understanding Users

Frontmatter
Incorporating Position Bias into Click-Through Bipartite Graph
Abstract
Click-through bipartite graph has been regarded as an effective method in search user behavior analysis researches. In most existing bipartite graph construction studies, user clicks are treated as equally important. However, considering the existence of position bias factor in user click-through behavior, clicks on results in different ranking positions should be treated separately. In this work, we choose a classical click-through bipartite graph model, which named label propagation model, and evaluate the improvement of performance by considering the effect of position bias. We propose three hypotheses to explain the influence of position bias, and modify the formulas of label propagation algorithm. We use AUC as the evaluation metric, which express the effectiveness of spam URLs identification by label propagation algorithm and its improved methods. The experimental results demonstrate that the proposed methods work better than the baseline method.
Rongjie Cai, Cheng Luo, Yiqun Liu, Shaoping Ma, Min Zhang
A Study of User Image Search Behavior Based on Log Analysis
Abstract
Study of user behavior in Web search helps understand users’ search intents and improve the ranking quality of search results. To better understand user’s Web image search behavior in practical environment, we investigate user behavior by analyzing a query log collected in one week from a popular image search engine in China. We focus on individual query analyses, temporal distribution, click-through behavior on the search engine result pages (SERPs), and behaviors on preview pages. Compared to general Web search, image search users usually submit shorter query strings and their selections of query terms are more diverse. We find that there exists a huge difference among users in image search click-through behavior. Users are more likely to do exploratory search compared to that in general Web search. This finding may provide us some insights about users’ behavior in the context of image search. Our findings may also benefit multiple perspectives of image search, such as UI design, effectiveness evaluation, ranking algorithms, and etc.
Zhijing Wu, Xiaohui Xie, Yiqun Liu, Min Zhang, Shaoping Ma
User Preference Prediction in Mobile Search
Abstract
As search requests from mobile devices are growing very quickly, mobile search evaluation becomes one of the central concerns in mobile search studies. Beyond traditional Cranfield paradigm, side-by-side user preference between two ranked lists does not rely on user behavior assumptions and has been shown to produce more accurate results comparing to traditional evaluation methods based on “query-document” relevance. On the other hand, result list preference judgements have very high annotation cost. Previous studies attempted to assist human judges by automatically predicting preference. However, whether these models are effective in mobile search environment is still under investigation. In this paper, we proposed a machine learning model to predict user preference automatically in mobile search environment. We find that the relevance features can predict user preference very well, so we compare the agreement of evaluation metrics with side-by-side user preferences on our dataset. We get inspiration from the agreement comparison method and proposed new relevance features to build models. Experimental results show that our proposed model can predict user preference very effectively.
Mengyang Liu, Cheng Luo, Yiqun Liu, Min Zhang, Shaoping Ma
How Users Select Query Suggestions Under Different Satisfaction States?
Abstract
Query suggestion (or recommendation) has become an important technique in commercial search engines (e.g., Google, Bing and Baidu) in order to improve users’ search experience. Most existing studies on query suggestion focus on formalizing various query suggestion models, while ignoring the study on investigating how users select query suggestions under different satisfaction states. Specifically, although a number of effective query suggestion models have been proposed, some basic problems have not been well investigated. For example, (i) how much the importance of query suggestion feature for users with respect to different queries; (ii) how user’s satisfaction for current search results will influence the selection of query suggestions. In this paper, we conduct extensive user study with a search engine interface in order to investigate above problems. Through the user study, we gain a series of insightful findings which may benefit for the design of future search engine and query suggestion models.
Zhenguo Shang, Jingfei Li, Peng Zhang, Dawei Song, Benyou Wang

NLP for IR

Frontmatter
Tripartite-Replicated Softmax Model for Document Representations
Abstract
Text mining tasks based on machine learning require inputs to be represented as fixed-length vectors, and effective vectors of words, phrases, sentences and even documents may greatly improve the performance of these tasks. Recently, distributed word representations based on neural networks have been demonstrated powerful in many tasks by encoding abundant semantic and linguistic information. However, it remains a great challenge for document representations because of the complex semantic structures in different documents. To meet the challenge, we propose two novel tripartite graphical models for document representations by incorporating word representations into the Replicated Softmax model, and we name the models as Tripartite-Replicated Softmax model (TRPS) and directed Tripartite-Replicated Softmax model (d-TRPS), respectively. We also introduce some optimization strategies for training the proposed models to learn better document representations. The proposed models can capture linear relationships among words and latent semantic information within documents simultaneously, thus learning both linear and nonlinear document representations. We examine the learned document representations in a document classification task and a document retrieval task. Experimental results show that the learned representations by our models outperform the state-of-the-art models in improving the performance of these two tasks.
Bo Xu, Hongfei Lin, Lin Wang, Yuan Lin, Kan Xu, Xiaocong Wei, Dong Huang
A Normalized Framework Based on Multiple Relationships for Document Re-ranking
Abstract
Document re-ranking has been widely adopted in Information Retrieval as a way of improving precision of top documents based on the first round retrieval results. There are methods that use semi-supervised learning based on graphs constructed based on similarities between documents. However, most of them only consider relationships between documents. In this paper, we propose an approach to take the relationships between documents, between words in documents, as well as between documents and words into consideration. We develop a novel generative model which integrates neural language model with latent semantic model, then we incorporate the relationships between documents and words into a normalized framework to re-rank documents based on the initial retrieval results. Experimental results show that the method show significant improvements in comparison with other baseline methods.
Wenyu Zhao, Dong Zhou
Constructing Semantic Hierarchies via Fusion Learning Architecture
Abstract
Semantic hierarchies construction means to build structure of concepts linked by hypernym-hyponym (“is-a”) relations. A major challenge for this task is the automatic discovery of hypernym-hyponym (“is-a”) relations. We propose a fusion learning architecture based on word embeddings for constructing semantic hierarchies, composed of discriminative generative fusion architecture and a very simple lexical structure rule for assisting, getting an F1-score of 74.20% with 91.60% precision-value, outperforming the state-of-the-art methods on a manually labeled test dataset. Subsequently, combining our method with manually-built hierarchies can further improve F1-score to 82.01%. Besides, the fusion learning architecture is language-independent.
Tianwen Jiang, Ming Liu, Bing Qin, Ting Liu
Jointly Learning Bilingual Sentiment and Semantic Representations for Cross-Language Sentiment Classification
Abstract
Cross-language sentiment classification (CLSC) aims at leveraging the semantic and sentiment knowledge in a resource-abundant language (source language) for sentiment classification in a resource-scarce language (target language). This paper proposes an approach to jointly learning bilingual semantic and sentiment representations (BSSR) for English-Chinese CLSC. First, two neural networks are adopted to learn sentence-level sentiment representations in English and Chinese views respectively, which are attached to all word semantic representations in the corresponding sentence to express the words in the certain sentiment context. Then, another two neural networks in two views are designed to jointly learn BSSR of the document from word representations concatenated with their sentence-level sentiment representations. The proposed approach could capture rich sentiment and semantic information in BSSR learning process. Experiments on NLP&CC 2013 CLSC dataset show that our approach is competitive with the state-of-the-art results.
Huiwei Zhou, Yunlong Yang, Zhuang Liu, Yingyu Lin, Pengfei Zhu, Degen Huang
Stacked Learning for Implicit Discourse Relation Recognition
Abstract
The existing discourse relation recognition systems have distinctive advantages, such as superior classification models, reliable feature selection, or holding rich training data. This shows the feasibility of making the systems collaborate with each other within a uniform framework. In this paper, we propose a stacked learning based collaborative approach. By the two-level learning, it facilitates the application of the confidence of different systems for the discourse relation determination. Experiments on PDTB show that our method yields promising improvement.
Yang Xu, Huibin Ruan, Yu Hong

IR and Applications

Frontmatter
Network Structural Balance Analysis for Sina Microblog Based on Particle Swarm Optimization Algorithm
Abstract
Research on structure balance of networks is of great importance for theoretical research and practical application, and received extensive attention of scholars from diverse fields in recent years. The computation and transformation of structure balance primarily aim at calculating the cost of converting an unbalanced network into a balanced network. In this paper, we proposed an efficient method to study the structure balance of the microblog network. Firstly, we model the structural balance of social network as a mathematical optimization problem. Secondly, we design an energy function incorporate with structure balance theory. Finally, considering the standard particle swarm optimization algorithm can not deal with discrete problem, we redefined the velocity and position updating rules of particles from a discrete perspective to solve the modeled optimization problem. Experiments on real data sets demonstrate our method is efficient.
Xia Fu, Yajun Du, Yongtao Ye
Vision Saliency Feature Extraction Based on Multi-scale Tensor Region Covariance
Abstract
In the process of extracting image saliency features by using regional covariance, the low-level higher-order data are dealt with by vectorization, however, the structure of the data (color, intensity, direction) may be lost in the process, leading to a poorer representation and overall performance degradation. In this paper we introduce an approach for sparse representation of region covariance that will preserve the inherent structure of the image. This approach firstly calculates the image low-level data (color, intensity, direction), and then uses multi-scale transform to extract the multi-scale features for constructing tensor space, at last by using tensor sparse coding the image bottom features are extracted from region covariance. In the paper, it compares the experimental results with the commonly used feature extraction algorithms’ results. The experimental results show that the proposed algorithm is closer to the actual boundary of the object and achieving better results.
Shimin Wang, Mingwen Wang, Jihua Ye, Anquan Jie
Combining Large-Scale Unlabeled Corpus and Lexicon for Chinese Polysemous Word Similarity Computation
Abstract
Word embeddings have achieved an outstanding performance in word similarity measurement. However, most prior works focus on building models with one embedding per word, neglect the fact that a word can have multiple senses. This paper proposes two sense embedding learning methods based on large-scale unlabeled corpus and Lexicon respectively for Chinese polysemous words. The corpus-based method labels the senses of polysemous words by clustering the contexts with tf-idf weight, and using the HowNet to initialize the number of senses instead of simply inducing a fixed number for each polysemous word. The lexicon-based method extends the AutoExtend to Tongyici Cilin with some related lexicon constraints for sense embedding learning. Furthermore, these two methods are combined for Chinese polysemous word similarity computation. The experiments on the Chinese Polysemous Word Similarity Dataset show the effectiveness and complementarity of our two sense embedding learning methods. The final Spearman rank correlation coefficient achieves 0.582, which outperforms the state-of-the-art performance on the evaluation dataset.
Huiwei Zhou, Chen Jia, Yunlong Yang, Shixian Ning, Yingyu Lin, Degen Huang
Latent Dirichlet Allocation Based Image Retrieval
Abstract
In recent years, Bag-of-Visual-Word (BoVW) model has been widely used in computer vision. However, BoVW ignores not only spatial information but also semantic information between visual words. In this study, a latent Dirichlet allocation (LDA) based model has been proposed to obtain the semantic relations of visual words. Because the LDA-based topic model used alone usually degrade performance. Thus, a visual language model (VLM) is combined with LDA-based topic model linearly to represent each image. On our dataset, the proposed approach has been compared with state-of-the-art approaches (such as BoVW, LLC, SPM and VLM). Experimental results indicate that the proposed approach outperforms the original BoVW, LLC, SPM and VLM.
Jing Hao, Hongxi Wei

Query Processing and Analysis

Frontmatter
Leveraging External Knowledge to Enhance Query Model for Event Query
Abstract
Retrieval based on event query has recently become one of the most popular applications in information retrieval domain, whose goal is to retrieve event-related documents according to the given query about some specific event. However, using conventional retrieval method for this kind of task would usually be demonstrated with poor performance. To enhance query model and improve retrieval effectiveness for event query, an adaptive learning approach of PLSA model is presented in this paper. Through leveraging the knowledge of known coarse-grained events from external resource, the new approach can adaptively adjust the topic generative process of PLSA model on pseudo-relevance feedback documents, and learn the accurate language model for a particular topic, i.e., target event, which can be used to update the representation of users intention and finally improve the retrieval results. Experimental results on standard TREC collections show the proposed approach consistently outperform the state-of-the-art methods.
Wang Pengming, Li Peng, Li Rui, Wang Bin
Musical Query-by-Semantic-Description Based on Convolutional Neural Network
Abstract
We present a new music retrieval system based on query by semantic description (QBSD) system, by which a novel song can be used as query and transformed into semantic vector by a convolutional neural network. This method based on Supervised Multi-class labeling (SML), which a song can be annotated by some semantically meaningful tags and retrieved relevant song in semantically annotated database. CAL500 data set is used in experiment, we can learn a deep learning model for each tag in semantic space. To improve the annotation effect, loss function adjustment algorithm and SMOTE algorithm are employed. The experiment results show that this model can get songs with high semantically similarity, and provide a more nature way to music retrieval.
Jing Qin, Hongfei Lin, Dongyu Zhang, Shaowu Zhang, Xiaocong Wei
A Feature Extraction and Expansion-Based Approach for Question Target Identification and Classification
Abstract
Detecting question target words from user questions is a crucial step in question target classification as it can precisely reflect the users’ potential need. In this paper we propose a concise approach named as QTF_EE to identify question target words, extract question target features and expand the features for question target classification. Based on two publicly available datasets that are labeled with 50 answer types, we compare the QTF_EE approach with 12 conventional classification methods such as bag-of-words and Random Forest as baseline methods. The results show that the QTF_EE approach outperforms the baselines and is able to improve the question target classification performance to an accuracy of 87.4%, demonstrating its effectiveness in question target identification.
Wenxiu Xie, Dongfa Gao, Tianyong Hao
Combine Non-text Features with Deep Learning Structures Based on Attention-LSTM for Answer Selection
Abstract
Because of the lexical gap between questions and answer candidates, methods with only word features cannot solve Answer Selection (AS) problem well. In this paper, we apply a LSTMs with Attention model to extract the latent semantic information of sentences and propose a method to learning non-text features. Besides, we propose an index to evaluate the sorting ability of models with the same accuracy value. Our model achieved the best accuracy and F1 performance than other known models, and the ranking index results, including MAP, AvgRec and MRR index’s result, are after only KeLP system and Beihang MSRA system in SemEval-2017 Task 3 Subtask A.
Chang’e Jia, Chengjie Sun, Bingquan Liu, Lei Lin
Backmatter
Metadaten
Titel
Information Retrieval
herausgegeben von
Jirong Wen
Jianyun Nie
Tong Ruan
Prof. Yiqun Liu
Tieyun Qian
Copyright-Jahr
2017
Electronic ISBN
978-3-319-68699-8
Print ISBN
978-3-319-68698-1
DOI
https://doi.org/10.1007/978-3-319-68699-8

Neuer Inhalt