Skip to main content

2015 | Buch

Natural Language Processing and Chinese Computing

4th CCF Conference, NLPCC 2015, Nanchang, China, October 9-13, 2015, Proceedings

insite
SUCHEN

Über dieses Buch

This book constitutes the refereed proceedings of the 4th CCF Conference, NLPCC 2015, held in Nanchang, China, in October 2015.

The 35 revised full papers presented together with 22 short papers were carefully reviewed and selected from 238 submissions. The papers are organized in topical sections on fundamentals on language computing; applications on language computing; NLP for search technology and ads; web mining; knowledge acquisition and information extraction.

Inhaltsverzeichnis

Frontmatter

Fundamentals on Language Computing

Frontmatter
A Maximum Entropy Approach to Discourse Coherence Modeling

This paper introduces a maximum entropy method to Discourse Coherence Modeling (DCM). Different from the state-of-art supervised entity-grid model and unsupervised cohesion-driven model, the model we proposed only takes as input lexicon features, which increases the training speed and decoding speed significantly. We conduct an evaluation on two publicly available benchmark data sets via sentence ordering tasks, and the results confirm the effectiveness of our maximum entropy based approach in DCM.

Rui Lin, Muyun Yang, Shujie Liu, Sheng Li, Tiejun Zhao
Transition-Based Dependency Parsing with Long Distance Collocations

Long distance dependency relation is one of the main challenges for the state-of-the-art transition-based dependency parsing algorithms. In this paper, we propose a method to improve the performance of transition-based parsing with long distance collocations. With these long distance collocations, our method provides an approximate global view of the entire sentence, which is a little bit similar to top-down parsing. To further improve the accuracy of decision, we extend the set of parsing actions with two more fine-grained actions based on the types of arcs. Experimental results show that our method improve the performance of parsing effectively, especially for long sentence.

Chenxi Zhu, Xipeng Qiu, Xuanjing Huang
Recurrent Neural Networks with External Memory for Spoken Language Understanding

Recurrent Neural Networks (RNNs) have become increasingly popular for the task of language understanding. In this task, a semantic tagger is deployed to associate a semantic label to each word in an input sequence. The success of RNN may be attributed to its ability to memorise long-term dependence that relates the current-time semantic label prediction to the observations many time instances away. However, the memory capacity of simple RNNs is limited because of the gradient vanishing and exploding problem. We propose to use an external memory to improve memorisation capability of RNNs. Experiments on the ATIS dataset demonstrated that the proposed model was able to achieve the state-of-the-art results. Detailed analysis may provide insights for future research.

Baolin Peng, Kaisheng Yao, Li Jing, Kam-Fai Wong
Improving Chinese Dependency Parsing with Lexical Semantic Features

Lexical semantic information plays an important role in supervised dependency parsing. In this paper, we add lexical semantic features to the feature set of a parser, obtaining improvements on the Penn Chinese Treebank. We extract semantic categories of words from HowNet, and use them as semantic information of words. Moreover, we investigate the method to compute semantic similarity between Chinese compound words, and obtain semantic information of words which did not record in HowNet. Our experiments show that unlabeled attachment scores can increase by 1.29%.

Lvexing Zheng, Houfeng Wang, Xueqiang Lv

Machine Translation and Multi-Lingual Information Access

Frontmatter
Entity Translation with Collective Inference in Knowledge Graph

Nowadays knowledge base (KB) has been viewed as one of the important infrastructures for many web search applications and NLP tasks. However, in practice the availability of KB data varies from language to language, which greatly limits potential usage of knowledge base. In this paper, we propose a novel method to construct or enrich a knowledge base by entity translation with help of another KB but compiled in a different language. In our work, we concentrate on two key tasks: 1) collecting translation candidates with as good coverage as possible from various sources such as web or lexicon; 2) building an effective disambiguation algorithm based on collective inference approach over knowledge graph to find correct translation for entities in the source knowledge base. We conduct experiments on movie domain of our in-house knowledge base from English to Chinese, and the results show the proposed method can achieve very high translation precision compared with classical translation methods, and significantly increase the volume of Chinese knowledge base in this domain.

Qinglin Li, Shujie Liu, Rui Lin, Mu Li, Ming Zhou
Stochastic Language Generation Using Situated PCFGs

This paper presents a purely data-driven approach for generating natural language (NL) expressions from its corresponding semantic representations. Our aim is to exploit a parsing paradigm for natural language generation (NLG) task, which first encodes semantic representations with a situated probabilistic context-free grammar (PCFG), then decodes and yields natural sentences at the leaves of the optimal parsing tree. We deployed our system in two different domains, one is response generation for a Chinese spoken dialogue system, and the other is instruction generation for a virtual environment in English language, obtaining results comparable to state-of-the-art systems both in terms of BLEU scores and human evaluation.

Caixia Yuan, Xiaojie Wang, Ziming Zhong

Machine Learning for NLP

Frontmatter
Clustering Sentiment Phrases in Product Reviews by Constrained Co-clustering

Clustering sentiment phrases in product reviews is convenient for us to get the most important information about one product directly through thousands of reviews. There are mainly two components in a sentiment phrase, the aspect word and the opinion word. We need to cluster these two parts simultaneously. Although several methods have been proposed to cluster words or phrases, limited work has been done on clustering two-dimensional sentiment phrases. In this paper, we apply a two-sided hidden Markov random field (HMRF) model on this task. We use the approach of constrained co-clustering with some priori knowledge, in a semi-supervised setting. Experimental results on sentiment phrases extracted from about 0.7 million mobile phone reviews show that this method is promising for this task and our method outperforms baselines remarkably.

Yujie Cao, Minlie Huang, Xiaoyan Zhu
A Cross-Domain Sentiment Classification Method Based on Extraction of Key Sentiment Sentence

Cross-domain sentiment analysis focuses on these problems where the source domain and the target domain are from different domains. However, traditional sentiment classification approaches usually perform poorly to address cross-domain problems. So, this paper proposed a cross-domain sentiment classification method based on extraction of key sentiment sentence. Firstly, based on the observation that not every part of the document is equally informative for inferring the sentiment orientation of the whole document, the concept of key sentiment sentence was defined. Secondly, taking advantage of three properties: sentiment purity, keyword property and position property, we construct heuristic rules, and combine with machine learning to extract key sentiment sentence. Then, data is divided into key and detail views. Integrating two views effectively can improve performance. Finally, experimental results show the superiority of our proposed method.

Shaowu Zhang, Huali Liu, Liang Yang, Hongfei Lin
Convolutional Neural Networks for Correcting English Article Errors

In this paper, convolutional neural networks are employed for English article error correction. Instead of employing features relying on human ingenuity and prior natural language processing knowledge, the words surrounding the context of the article are taken as features. Our approach could be trained both on an error annotated corpus and an error non-annotated corpus. Experiments are conducted on CoNLL-2013 data set. Our approach achieves 38.10 % in F1, and outperforms the best system (33.40 %) that participates in the task. Experimental results demonstrate the effectiveness of our proposed approach.

Chengjie Sun, Xiaoqiang Jin, Lei Lin, Yuming Zhao, Xiaolong Wang

NLP for Social Media

Frontmatter
Automatic Detection of Rumor on Social Network

The rumor detection problem on social network has attracted considerable attention in recent years. Most previous works focused on detecting rumors by shallow features of messages, including content and blogger features. But such shallow features cannot distinguish between rumor messages and normal messages in many cases. Therefore, in this paper we propose an automatic rumor detection method based on the combination of new proposed implicit features and shallow features of the messages. The proposed implicit features include popularity orientation, internal and external consistency, sentiment polarity and opinion of comments, social influence, opinion retweet influence, and match degree of messages. Experiments illustrate that our rumor detection method obtain significant improvement compared with the state-of-the-art approaches. The proposed implicit features are effective in rumor detection on social network.

Qiao Zhang, Shuiyuan Zhang, Jian Dong, Jinhua Xiong, Xueqi Cheng
Multimodal Learning Based Approaches for Link Prediction in Social Networks

The link prediction problem in social networks is to estimate the value of the link that can represent relationship between social members. Researchers have proposed several methods for solving link prediction and a number of features have been used. Most of these models are learned with only considering the features from one kind of data. In this paper, by considering the data from link network structure and user comment, both of which could imply the concept of link value, we propose multimodal learning based approaches to predict the link values. The experiment results done on dataset from typical social networks show that our model could learn the joint representation of these datas properly, and the method MDBN outperforms other state-of-art link prediction methods.

Feng Liu, Bingquan Liu, Chengjie Sun, Ming Liu, Xiaolong Wang
Sentiment Analysis Based on User Tags for Traditional Chinese Medicine in Weibo

With Western culture and science been widely accepted in China, Traditional Chinese Medicine (TCM) has become a controversial issue. So, it is important to study the public’s sentiment and opinions on TCM. The rapid development of online social network, such as twitter, make it convenient and efficient to sample hundreds of millions of people for the aforementioned sentiment study. To the best of our knowledge, the present work is the first attempt that applies sentiment analysis to the fields of TCM on Sina Weibo (a twitter-like microblogging service in China). In our work, firstly, we collected tweets topics about TCM from Sina Weibo, and labelled the tweets as supporting TCM or opposing TCM automatically based on user tags. Then, a Support Vector Machine classifier was built to predict the sentiment of TCM tweets without tags. Finally, we presented a method to adjust the classifier results. The performance of F-measure attained by our method is 97 %.

Junhui Shen, Peiyan Zhu, Rui Fan, Wei Tan, Xueyan Zhan
Predicting User Mention Behavior in Social Networks

Mention is an important interactive behavior used to explicitly refer to target users for specific information in social networks. Understanding user mention behavior can provide important insights into questions of human social behavior and improve design of social network platforms. However, most previous works mainly focus on mentioning for the effect of information diffusion, few researches consider the problem of mention behavior prediction. In this paper, we propose an intuitive approach to predict user mention behavior using link prediction method. Specifically, we first formulate user mention prediction problem as a classification task, and then extract new features including semantic interest match, social tie, mention momentum and interaction strength to improve the performance of prediction. To evaluate the proposed approach, we conduct extensive experiments on Twitter dataset. The experimental results clearly show that our approach has 15% increase in precision compared with the best baseline method.

Bo Jiang, Ying Sha, Lihong Wang
Convolutional Neural Networks for Multimedia Sentiment Analysis

Recently, user generated multimedia contents (e.g. text, image, speech and video) on social media are increasingly used to share their experiences and emotions, for example, a tweet usually contains both texts and images. Compared to sentiment analysis of texts and images separately, the combination of text and image may reveal tweet sentiment more adequately. Motivated by this rationale, we propose a method based on convolutional neural networks (CNN) for multimedia (tweets consist of text and image) sentiment analysis. Two individual CNN architectures are used for learning textual features and visual features, which can be combined as input of another CNN architecture for exploiting the internal relation between text and image. Experimental results on two real-world datasets demonstrate that the proposed method achieves effective performance on multimedia sentiment analysis by capturing the combined information of texts and images.

Guoyong Cai, Binbin Xia

Applications on Language Computing

Frontmatter
An Adaptive Approach to Extract Characters from Digital Ink Text in Chinese Based on Extracted Errors

Extracting characters from digital ink text is an essential step which leads to more reliable recognition of text and also a prerequisite for structured editing. Casualness and diversity of handwriting input result in unsatisfied accuracy of extracted characters. Reprocessing the initial extracted characters based on context makes some considerable improvement. Therefore, this paper proposes an approach to adaptively extracting characters from digital ink text in Chinese based on extracted errors. The approach firstly classified the extracted errors in the primary extraction. According to different types of extracted errors, the approach gives different operations. Experimental data shows that the approach is effective.

Hao Bai
Context-Dependent Metaphor Interpretation Based on Semantic Relatedness

The previous work of metaphor interpretation mostly focused on single-word verbal metaphors and ignored the influence of contextual information, leading to some limitations(e.g. ignore the polysemy of metaphor). In this paper, we creatively propose the aspect-based semantic relatedness, and we present a novel metaphor interpretation method based on semantic relatedness for context-dependent nominal metaphors. First, we obtain the possible comprehension aspects according to the properties of source domain. Then, combined with contextual information, we calculate the degree of relatedness between the target and source domains from different aspects. Finally, we select the aspect which makes the relatedness between target and source domains maximum as comprehension aspect, and the metaphor explanation is formed with corresponding property of source domain. The results show that our method has higher accuracy. In particular, when the information of target domain is insufficient in corpus, our method still exhibits the good performance.

Chang Su, Shuman Huang, Yijiang Chen
Context Vector Model for Document Representation: A Computational Study

To tackle the sparse data problem of the bag-of-words model for document representation, the Context Vector Model (CVM) has been proposed to enrich a document with the relatedness of all the words in a corpus to the document. The nature of CVM is the combination of word vectors, wherefore the representation method for words is essential for CVM. A computational study is performed in this paper to compare the effects of the newly proposed word representation methods embedded in CVM. The experimental results demonstrate that some of the newly proposed word representation methods significantly improve the performance of CVM, for they estimate the relatedness between words better.

Yang Wei, Jinmao Wei, Hengpeng Xu

NLP for Search Technology and Ads

Frontmatter
Refine Search Results Based on Desktop Context

During a search task, a user’s search intention is possible inaccurate. Even with clear information need, it is probable that the search query cannot precisely describe the user’s need. And besides, the user is utterly impossible browse all the returned results. Thus, a selected and valuable returned search list is quite important for a search system. Actually, there are lots of reliable and highly relevant personal documents existing in a user’s personal computer. Based on the desktop documents, it is relevantly easy to understand the user’s current knowledge level about the present search subject, which is useful to predict a user’s need. An approach was proposed to exploit the potential of desktop context to refine the search returned list. Firstly, to attain a comprehensive long-term user model, the operational history and a series of time-related information were analyzed to achieve the attention degree that a user paid to a document. And the keywords and user tags were focused on to understand the content. Secondly, working scenario was regarded as the most valuable information to construct a short-term user model, which directly suggested what exactly a user was working on. Experiment results showed that desktop context could effectively help refine the search returned results, and only the effectively combination of the long-term user model and the short-term user model could offer more relevant items to satisfy the user.

Xiaoyun Li, Ying Yu, Chunping Ouyang
Incorporating Semantic Knowledge with MRF Term Dependency Model in Medical Document Retrieval

Term dependency models are generally better than bag-of-word models, because complete concepts are often represented by multiple terms. However, without semantic knowledge, such models may introduce many false dependencies among terms, especially when the document collection is small and homogeneous(e.g. newswire documents, medical documents). The main contribution of this work is to incorporate semantic knowledge with term dependency models, so that more accurate dependency relations will be assigned to terms in the query. In this paper, experiments will be made on CLEF2013 eHealth Lab medical information retrieval data set, and the baseline term dependency model will be the popular MRF(Markov Random Field) model [

1

], which proves to be better than traditional independent models in general domain search. Experiment results show that, in medical document retrieval, full dependency MRF model is worse than independent model, it can be significantly improved by incorporating semantic knowledge.

Zhongda Xie, Yunqing Xia, Qiang Zhou
A Full-Text Retrieval Algorithm for Encrypted Data in Cloud Storage Applications

Nowadays, more and more Internet users use the cloud storage services to store their personal data, especially when the mobile devices which have limited storage capacity popularize. With the cloud storage services, the users can access their personal data at any time and anywhere without storing the data at local. However, the cloud storage service provider is not completely trusted. Therefore, the first concern of using cloud storage services is the data security. A straightforward method to address the security problem is to encrypt the data before uploading to the cloud server. The encryption method is able to keep the data secret from the cloud server, but cloud server also can not manipulate the data after encryption. It will greatly undermine the advantage of the cloud storage. For example, a user encrypts his personal data before uploading them to the cloud. When he wants to access some data at the cloud, he has to download all the data and decrypt them. Obviously, this service mode will incur the huge overheads of communication and computation. Several related works have been proposed to enable the search over the encrypted data, but all of them only support the encrypted keyword search. In this paper, we propose a new full-text retrieval algorithm over the encrypted data for the scenario of cloud storage, in which all the words in a document have been extracted and built a privacy-preserved full-text retrieval index. Based on the privacy-preserved full-text retrieval index, cloud server can execute full-text retrieval over the large scale encrypted documents. The numerical analysis and experimental results further validate the high efficiency and scalability of the proposed algorithm.

Wei Song, Yihui Cui, Zhiyong Peng
How Different Features Contribute to the Session Search?

Session search aims to improve ranking effectiveness by incorporating user interaction information, including short-term interactions within one session and global interactions from other sessions (or other users). While various session search models have been developed and a large number of interaction features have been used, there is a lack of a systematic investigation on how different features would influence the session search. In this paper, we propose to classify typical interaction features into four categories (current query, current session, query change, and collective intelligence). Their impact on the session search performance is investigated through a systematic empirical study, under the widely used Learning-to-Rank framework. One of our key findings, different from what have been reported in the literature, is: features based on current query and collective intelligence have a more positive influence than features based on query change and current session. This would provide insights for development of future session search techniques.

Jingfei Li, Dawei Song, Peng Zhang, Yuexian Hou

Web Mining

Frontmatter
Beyond Your Interests: Exploring the Information Behind User Tags

Tags have been used in different social medias, such as Delicious, Flickr, LinkedIn and Weibo. In previous work, considerable efforts have been made to make use of tags without identification of their different types. In this study, we argue that tags in user profile indicate three different types of information, say

the basics

(age, status, locality, etc),

interests

and

specialty

of a person. Based on this novel user tag taxonomy, we propose a tag classification approach in Weibo to conduct a clearer image of user profiles, which makes use of three categories of features: general statistics feature (including user links with followers and followings), content feature and syntax feature. Furthermore, different from many previous studies on tag which concentrate on user specialties, such as expert finding, we find that valuable information can be discovered with

the basics

and

interests

user tags. We show some interesting findings in two scenarios, including user profiling with people come from different generations and area profiling with mass appeal, with large scale tag clustering and mining in over 6 million identical tags with 13 million users in Weibo data.

Weizhi Ma, Min Zhang, Yiqun Liu, Shaoping Ma, Lingfeng Chen
Nonparametric Symmetric Correspondence Topic Models for Multilingual Text Analysis

Topic model aims to analyze collection of documents and has been widely used in the fields of machine learning and natural language processing. Recently, researchers proposed some topic models for multilingual parallel or comparable documents. The symmetric correspondence Latent Dirichlet Allocation (SymCorrLDA) is one such model. Despite its advantages over some other existing multilingual topic models, this model is a classic Bayesian parametric model, thus can’t overcome the shortcoming of Bayesian parametric models. For example, the number of topics must be specified in advance. Based on this intuition, we extend this model and propose a Bayesian nonparametric model (NPSymCorrLDA). Experiments on Chinese-English datasets extracted from Wikipedia(

https://zh.wikipedia.org/

) show significant improvement over SymCorrLDA.

Rui Cai, Miaohong Chen, Houfeng Wang

Knowledge Acquisition and Information Extraction

Frontmatter
Mining RDF from Tables in Chinese Encyclopedias

Web tables understanding has recently attracted a number of studies. However, many works focus on the tables in English, because they usually need the help of knowledge bases, while the existing knowledge bases such as DBpedia, YAGO, Freebase and Probase mainly contain knowledge in English.

In this paper, we focus on the RDF triples extraction from tables in Chinese encyclopedias. Firstly, we constructed a Chinese knowledge base through taxonomy mining and class attribute mining. Then, with the help of our knowledge base, we extracted triples from tables through column scoring, table classification and RDF extraction. In our experiments, we practically implemented our approach in 6,618,544 articles from

Hudong Baike

with 764,292 tables, and extracted about 1,053,407 unique and new RDF triples with an estimated accuracy of

$$90.2\%$$

, which outperforms other similar works.

Weiming Lu, Zhenyu Zhang, Renjie Lou, Hao Dai, Shansong Yang, Baogang Wei
Taxonomy Induction from Chinese Encyclopedias by Combinatorial Optimization

Taxonomy is an important component in knowledge bases, and it is an urgent, meaningful but challenging task for Chinese taxonomy construction. In this paper, we propose a taxonomy induction approach from a Chinese encyclopedia by using combinatorial optimizations. At first,

subclass-of

relations are derived by validating the relation between two categories. Then, integer programming optimizations are applied to find out

instance-of

relations from encyclopedia articles by considering the constrains among categories. The experimental results show that our approach can construct a practicable taxonomy from Chinese encyclopedias.

Weiming Lu, Renjie Lou, Hao Dai, Zhenyu Zhang, Shansong Yang, Baogang Wei
Recognition of Person Relation Indicated by Predicates

This paper focuses on recognizing person relations indicated by predicates from large scale of free texts. In order to determine whether a sentence contains a potential relation between persons, we cast this problem to a classification task. Dynamic Convolution Neural Network (DCNN) is improved for this task. It uses frame convolution for making uses of more features efficiently. Experimental results on Chinese person relation recognition show that the proposed model is superior when compared to the original DCNN and several strong baseline models. We also explore employing large scale unlabeled data to achieve further improvements.

Zhongping Liang, Caixia Yuan, Bing Leng, Xiaojie Wang
Target Detection and Knowledge Learning for Domain Restricted Question Answering

Frequent Asked Questions(FAQ) answering in restricted domain has attracted increasing attentions in various areas. FAQ is a task to automated response user’s typical questions within specific domain. Most researches use NLP parser to analyze user’s intention and employ ontology to enrich the domain knowledge. However, syntax analysis performs poorly on the short and informal FAQ questions, and external ontology knowledge bases in specific domains are usually unavailable and expensive to manually construct. In our research, we propose a semi-automatic domain-restricted FAQ answering framework SDFA, without relying on any external resources. SDFA detects the targets of questions to assist both the fast domain knowledge learning and the answer retrieval. The proposed framework has been successfully applied in real project on bank domain. Extensive experiments on two large datasets demonstrate the effectiveness and efficiency of the approaches.

Mengdi Zhang, Tao Huang, Yixin Cao, Lei Hou

Short Papers

Frontmatter
An Improved Algorithm of Logical Structure Reconstruction for Re-flowable Document Understanding

The basic idea of re-flowable document understanding and automatic typesetting is to generate logical documents by judging the hierarchical relationship of physical units and logical tags based on the identification of logical paragraph tags in re-flowable document. In order to overcome the shortages of conventional logical structure reconstruction methods, a novel logical structure reconstruction method of re-flowable document based on directed graph is proposed in this paper. This method extracts the logical structure from the template document and then utilizes directed graph’s single-source shortest path algorithm to filter out redundant logical tags, thus solving the problem of logical structure reconstruction of a document. Experimental results show that the algorithm can effectively improve the accuracy of logical structure recognition.

Lin Zhao, Ning Li, Xin Peng, Qi Liang
Mongolian Inflection Suffix Processing in NLP: A Case Study

Inflection suffix is an important morphological characteristic of Mongolian words, since the suffixes express abundant syntactic and semantic meanings. In order to provide an informative introduction of it, this paper implements a case study on it. Through three Mongolian NLP tasks, we disclose the following information: (1) views of inflection suffix in NLP tasks, (2) Inflection suffix processing ways, (3) Inflection suffix effects on system performance and (4) some suffix related conclusion.

Xiangdong Su, Guanglai Gao, Yupeng Jiang, Jing Wu, Feilong Bao
Resolving Coordinate Structures for Chinese Constituent Parsing

Coordinate structures are linguistic structures consisting of two or more conjuncts, which usually compose into larger constituent as a whole unit. However, the boundary of each conjunct is difficult to identify, which makes it difficult to parse the whole coordinate and larger structures. In labeled data, such as the Penn Chinese Tree Bank (CTB), coordinate structures are not labeled explicitly, which makes solving the problem more complicated. In this paper, we treat resolving coordinate structures as an independent sub-problem of parsing. We first define coordinate structures explicitly and design rules to extract the coordinate structures from labeled CTB data. Then a specifically designed grammar is proposed for automatic parsing of coordinate structures. We propose two groups of new features to better model coordinate structures in a shift-reduce parsing framework. Our approach can achieve a

$$15\%$$

improvement in F-1 score on resolving coordinate structures.

Yichu Zhou, Shujian Huang, Xinyu Dai, Jiajun Chen
P-Trie Tree: A Novel Tree Structure for Storing Polysemantic Data

Trie tree, is an ordered tree data structure that is used to store a dynamic set or associative array where the keys are usually strings. It makes the search and update of words more efficient and is widely used in the construction of English dictionary for the storage of English vocabulary. Within the application of big data, efficiency determines the availability and usability of a system. In this paper, I introduce p-trie tree-a novel trie tree structure which can be used for polysemantic data which are not limited to English strings. I apply p-trie to the storage of Japanese vocabulary and evaluate the performance through experiments.

Xin Zhou
Research on the Extraction of Wikipedia-Based Chinese-Khmer Named Entity Equivalents

Named entity equivalent has been playing a significant role in the processing of cross-language information. However limited by the corpora resource, few in-depth studies have been made on the extraction of the bilingual Chinese-Khmer named entity equivalents. On account of this, this paper proposes a Wikipedia-based approach, utilizes the internal web links in Wikipedia and computes the feature similarity to extract the bilingual Chinese-Khmer named entity equivalents. The experimental result shows that good effect has been achieved when the entity equivalents are acquired through the internal web links in Wikipedia with F value up to 90.67%. Also it shows that the result is quite favorable when the bilingual Chinese-Khmer named entity equivalents are acquired through the computation of feature similarity, turning out that the method proposed in this paper is able to give better effect.

Qing Xia, Xin Yan, Zhengtao Yu, Shengxiang Gao
Bilingual Lexicon Extraction with Temporal Distributed Word Representation from Comparable Corpora

Distributed word representation has been found to be highly effective to extract a bilingual lexicon from comparable corpora by a simple linear transformation. However,

polysemous words

often vary their meanings at different time points in the corresponding corpora. A single word representation which is learned from the whole corpora can’t express the temporal change of the word meaning very well. This paper proposes a simple solution which exploits the temporal distributed word representation for

polysemous words

. The experimental results confirm that the proposed solution can offer better performance on the English-to-Chinese bilingual lexicon extraction task.

Chunyue Zhang, Tiejun Zhao
Bilingually-Constrained Recursive Neural Networks with Syntactic Constraints for Hierarchical Translation Model

Hierarchical phrase-based translation models have advanced statistical machine translation (SMT). Because such models can improve leveraging of syntactic information, two types of methods (leveraging source parsing and leveraging shallow parsing) are applied to introduce syntactic constraints into translation models. In this paper, we propose a bilingually-constrained recursive neural network (BC-RNN) model to combine the merits of these two types of methods. First we perform supervised learning on a manually parsed corpus using the standard recursive neural network (RNN) model. Then we employ unsupervised bilingually-constrained tuning to improve the accuracy of the standard RNN model. Leveraging the BC-RNN model, we introduce both source parsing and shallow parsing information into a hierarchical phrase-based translation model. The evaluation demonstrates that our proposed method outperforms other state-of-the-art statistical machine translation methods for National Institute of Standards and Technology 2008 (NIST 2008) Chinese-English machine translation testing data.

Wei Chen, Bo Xu
Document-Level Machine Translation Evaluation Metrics Enhanced with Simplified Lexical Chain

Document-level Machine Translation (MT) has been drawing more and more attention due to its potential of resolving sentence-level ambiguities and inconsistencies with the benefit of wide-range context. However, the lack of simple yet effective evaluation metrics largely impedes the development of such document-level MT systems. This paper proposes to improve traditional MT evaluation metrics by simplified lexical chain, modeling document-level phenomena from the perspectives of text cohesion. Experiments show the effectiveness of such method on evaluating document-level translation quality and its potential of integrating with traditional MT evaluation metrics to achieve higher correlation with human judgments.

Zhengxian Gong, Guodong Zhou
Cross-Lingual Tense Tagging Based on Markov Tree Tagging Model

In this paper, we transform the issue of Chinese-English tense conversion into the issue of tagging a Chinese tense tree. And then we propose Markov Tree Tagging Model to tag nodes of the untagged tense tree with English tenses. Experimental results show that the method is much better than linear-based CRF tagging for the issue.

Yijiang Chen, Tingting Zhu, Chang Su, Xiaodong Shi
Building a Large-Scale Cross-Lingual Knowledge Base from Heterogeneous Online Wikis

Cross-Lingual Knowledge Bases are very important for global knowledge sharing. However, there are few Chinese-English knowledge bases due to the following reasons: 1) the scarcity of Chinese knowledge in existing cross-lingual knowledge bases; 2) the limited number of cross-lingual links; 3) the incorrect relationships in semantic taxonomy. In this paper, a large-scale Cross-Lingual Knowledge Base(named XLORE) is built to address the above problems. Particularly, XLORE integrates four online wikis including English Wikipedia, Chinese Wikipedia, Baidu Baike and Hudong Baike to balance the knowledge volume in different languages, employs a link-discovery method to augment the cross-lingual links, and introduces a pruning approach to refine taxonomy. Totally, XLORE harvests 663,740 classes, 56,449 properties, and 10,856,042 instances, among of which, 507,042 entities are cross-lingually linked. At last, we provide an online cross-lingual knowledge base system supporting two ways to access established XLORE, namely a search engine and a SPARQL endpoint.

Mingyang Li , Yao Shi , Zhigang Wang, Yongbin Liu
Refining Kazakh Word Alignment Using Simulation Modeling Methods for Statistical Machine Translation

Word alignment play an important role in the training of statistical machine translation systems. We present a technique to refine word alignments at phrase level after the collection of sentences from the Kazakh-English parallel corpora. The estimation technique extracts the phrase pairs from the word alignment and then incorporates them into the translation system for further steps. Although it is a pretty important step in training procedure, an word alignment process often has practical concerns with agglutinative languages. We consider an approach, which is a step towards an improved statistical translation model that incorporates morphological information and has better translation performance. Our goal is to present a statistical model of the morphology dependent procedure, which was evaluated over the Kazakh-English language pair and has obtained an improved BLEU score over state-of-the-art models.

Amandyk Kartbayev
A Local Method for Canonical Correlation Analysis

Canonical Correlation Analysis (CCA) is a standard statistical technique for finding linear projections of two arbitrary vectors that are maximally correlated. In complex situations, the linearity of CCA is not applicable. In this paper, we propose a novel local method for CCA to handle the non-linear situations.We aim to find a series of local linear projections instead of a single globe one. We evaluate the performance of our method and CCA on two real-world datasets. Our experiments show that local method outperforms original CCA in several realistic cross-modal multimedia retrieval tasks.

Tengju Ye, Zhipeng Xie, Ang Li
Learning to Rank Microblog Posts for Real-Time Ad-Hoc Search

Microblogging websites have emerged to the center of information production and diffusion, on which people can get useful information from other users’ microblog posts. In the era of Big Data, we are overwhelmed by the large amount of microblog posts. To make good use of these informative data, an effective search tool is required specialized for microblog posts. However, it is not trivial to do microblog search due to the following reasons: 1) microblog posts are noisy and time-sensitive rendering general information retrieval models ineffective. 2) Conventional IR models are not designed to consider microblog-specific features. In this paper, we propose to utilize learning to rank model for microblog search. We combine content-based, microblog-specific and temporal features into learning to rank models, which are found to model microblog posts effectively. To study the performance of learning to rank models, we evaluate our models using tweet data set provided by TERC 2011 and TREC 2012 microblogs track with the comparison of three state-of-the-art information retrieval baselines, vector space model, language model, BM25 model. Extensive experimental studies demonstrate the effectiveness of learning to rank models and the usefulness to integrate microblog-specific and temporal information for microblog search task.

Jing Li, Zhongyu Wei, Hao Wei, Kangfei Zhao, Junwen Chen, Kam-Fai Wong
Fuzzy-Rough Set Based Multi-labeled Emotion Intensity Analysis for Sentence, Paragraph and Document

Most existing sentiment analysis methods focus on single-label classification, which means only a exclusive sentiment orientation (negative, positive or neutral) or an emotion state (joy, hate, love, sorrow, anxiety, surprise, anger, or expect) is considered for the given text. However, multiple emotions with different intensity may be coexisting in one document, one paragraph or even in one sentence. In this paper, we propose a fuzzy-rough set based approach to detect the multi-labeled emotions and calculate their corresponding intensities in social media text. Using the proposed fuzzy-rough set method, we can simultaneously model multi emotions and their intensities with sentiment words for a sentence, a paragraph, or a document. Experiments on a well-known blog emotion corpus show that our proposed multi-labeled emotion intensity analysis algorithm outperforms baseline methods by a large margin.

Chu Wang, Shi Feng, Daling Wang, Yifei Zhang
What Causes Different Emotion Distributions of a Hot Event? A Deep Event-Emotion Analysis System on Microblogs

Current online public opinion analysis systems can explore lots of hot events and present the public emotion distribution for each event, which are useful for the governments and companies. However, the public emotion distributions are just the shallow analysis of the hot events, more and more people want to know the hidden causation behind the emotion distributions. Thus, this paper presents a deep Event-Emotion analysis system on Microblogs to reveal what causes different emotions of a hot event. We here use several related sub-events to describe a hot event in different perspectives, accordingly these sub-events combined with their different emotion distributions can be used to explain the total emotion distribution of a hot event. Experiments on 15 hot events show that the above idea is reasonable to exploit the emotion causation and can help people better understand the evolution of the hot event. Furthermore, this deep Event-Emotion analysis system also tracks the amount treads and emotion treads of the hot event, and presents the deep analysis based on the user profile.

Yanyan Zhao, Bing Qin, Zhenjiang Dong, Hong Chen, Ting Liu
Deceptive Opinion Spam Detection Using Deep Level Linguistic Features

This paper focuses on improving a specific opinion spam detection task, deceptive spam. In addition to traditional word form and other shallow syntactic features, we introduce two types of deep level linguistic features. The first type of features are derived from a shallow discourse parser trained on Penn Discourse Treebank (PDTB), which can capture inter-sentence information. The second type is based on the relationship between sentiment analysis and spam detection. The experimental results over the benchmark dataset demonstrate that both of the proposed deep features achieve improved performance over the baseline.

Changge Chen, Hai Zhao, Yang Yang
Multi-sentence Question Segmentation and Compression for Question Answering

We present a multi-sentence question segmentation strategy for community question answering services to alleviate the complexity of long sentences. We develop a complete scheme and make a solution to complex-question segmentation, including a question detector to extract question sentences, a question compression process to remove duplicate information, and a graph model to segment multi-sentence questions. In the graph model, we train a SVM classifier to compute the initial weight and we calculate the authority of a vertex to guide the propagating. The experimental results show that our method gets a good balance between completeness and redundancy of information, and significantly outperforms state-of-the-art methods.

Yixiu Wang, Yunfang Wu, Xueqiang Lv
A User-Oriented Special Topic Generation System for Digital Newspaper

With the coming of digital newspaper, user-oriented special topic generation becomes extremely urgent to satisfy the users’ requirements both functionally and emotionally. We propose an applicable automatic special topic generation system for digital newspapers based on users’ interests. Firstly, extract subject heading vector of the topic of interest by filtering out function words, localizing Latent Dirichlet Allocation (LDA) and training the LDA model. Secondly, remove semantically repetitive vector component by constructing a synonymy word map. Lastly, organize and refine the special topic according to the similarity between the candidate news and the topic, and the density of topic-related terms. The experimental results show that the system has both simple operation and high accuracy, and it is stable enough to be applied for user-oriented special topic generation in practical applications.

Xi Xu, Mao Ye, Zhi Tang, Jian-Bo Xu, Liang-Cai Gao

Shared Task (Long Papers)

Frontmatter
Exploiting Heterogeneous Annotations for Weibo Word Segmentation and POS Tagging

This paper describes our system designed for the NLPCC 2015 shared task on Chinese word segmentation (WS) and POS tagging for Weibo Text. We treat WS and POS tagging as two separate tasks and use a cascaded approach. Our major focus is how to effectively exploit multiple heterogeneous data to boost performance of statistical models. This work considers three sets of heterogeneous data, i.e., Weibo (

$$\textit{WB}$$

, 10K sentences), Penn Chinese Treebank 7.0 (

$$\textit{CTB7}$$

, 50K), and People’s Daily (

$$\textit{PD}$$

, 280K). For WS, we adopt the recently proposed coupled sequence labeling to combine

$$\textit{WB}$$

,

$$\textit{CTB7}$$

, and

$$\textit{PD}$$

, boosting F1 score from

$$93.76\%$$

(baseline model trained on only

$$\textit{WB}$$

) to

$$95.58\%$$

(

$$+1.82\%$$

). For POS tagging, we adopt an ensemble approach combining coupled sequence labeling and the guide-feature based method, since the three datasets have three different annotation standards. First, we convert

$$\textit{PD}$$

into the annotation style of

$$\textit{CTB7}$$

based on coupled sequence labeling, denoted by

$$\textit{PD}^{\textit{CTB}}$$

. Then, we merge

CTB

7 and

$$\textit{PD}^{\textit{CTB}}$$

to train a POS tagger, denoted by

$$\textit{Tag}_{\textit{CTB7}+\textit{PD}^{\textit{CTB}}}$$

, which is further used to produce guide features on

$$\textit{WB}$$

. Finally, the tagging F1 score is improved from 87.93% to 88.99% (+1.06%).

Jiayuan Chao, Zhenghua Li, Wenliang Chen, Min Zhang
Entity Recognition and Linking in Chinese Search Queries

Aiming at the task of Entity Recognition and Linking in Chinese Search Queries in NLP&CC 2015, this paper proposes the solutions to entity recognition, entity linking and entity disambiguation. Dictionary, online knowledge base and SWJTU Chinese word segmentation are used in entity recognition. Synonyms thesaurus, redirect of Wikipedia and the combination of improved PED (Pinyin Edit Distance) algorithm and LCS (Longest Common Subsequence) are applied in entity linking. The methods of suffix supplement and link value computation based on online encyclopedia are adopted in entity disambiguation. The experiment results indicate that the proposed solutions in this paper are effective for the case of short queries and insufficient contexts.

Jinwei Yuan, Yan Yang, Zhen Jia, Hongfeng Yin, Junfu Huang, Jie Zhu
BosonNLP: An Ensemble Approach for Word Segmentation and POS Tagging

Chinese word segmentation and POS tagging are arguably the most fundamental tasks in Chinese natural language processing. In this paper, we show an ensemble approach for segmentation and POS tagging, combining both discriminative and generative methods to get the advantage of both worlds. Our approach achieved the F1-score of 96.65% and 91.55% for segmentation and tagging respectively in the contest of NLPCC 2015 Shared Task 1, obtained the 1st place for both tasks.

Kerui Min, Chenggang Ma, Tianmei Zhao, Haiyan Li
Research on Open Domain Question Answering System

Aiming at open domain question answering system evaluation task in the fourth CCF Natural Language Processing and Chinese Computing Conference (NLPCC2015), a solution of automatic question answering which can answer natural language questions is proposed. Firstly, SPE (Subject Predicate Extraction) algorithm is presented to find answers from the knowledge base, and then WKE (Web Knowledge Extraction) algorithm is used to extract answers from search engine query result. Experimental data provided in the evaluation task includes the knowledge base and questions in natural language. The evaluation result shows that MRR is 0.5670, accuracy is 0.5700, and average F1 is 0.5240, and indicates the proposed method is feasible in open domain question answering system.

Zhonglin Ye, Zheng Jia, Yan Yang, Junfu Huang, Hongfeng Yin
Overview of the NLPCC 2015 Shared Task: Chinese Word Segmentation and POS Tagging for Micro-blog Texts

In this paper, we give an overview for the shared task at the 4th CCF Conference on Natural Language Processing & Chinese Computing (NLPCC 2015): Chinese word segmentation and part-of-speech (POS) tagging for micro-blog texts. Different with the popular used newswire datasets, the dataset of this shared task consists of the relatively informal micro-texts. The shared task has two sub-tasks: (1) individual Chinese word segmentation and (2) joint Chinese word segmentation and POS Tagging. Each subtask has three tracks to distinguish the systems with different resources. We first introduce the dataset and task, then we characterize the different approaches of the participating systems, report the test results, and provide a overview analysis of these results. An online system is available for open registration and evaluation at

http://nlp.fudan.edu.cn/nlpcc2015

.

Xipeng Qiu, Peng Qian, Liusong Yin, Shiyu Wu, Xuanjing Huang
Overview of the NLPCC 2015 Shared Task: Entity Recognition and Linking in Search Queries

This paper provides an overview of the Shared Task at the 4th CCF Conference on Natural Language Processing and Chinese Computing (NLPCC 2015): Entity Recognition and Linking in Search Queries, where participant systems are required to recognize entity mentions from short search queries in Chinese, and further link them into a given structured knowledge base. In this paper, we introduce how the task is defined, how we collect the datasets and last we report the evaluation results with a brief analysis.

Yansong Feng, Zhe Han, Kun Zhang
Overview of the NLPCC 2015 Shared Task: Weibo-Oriented Chinese News Summarization

The Weibo-oriented Chinese news summarization task aims to automatically generate a short summary for a given Chinese news article, and the short summary is used for news release and propagation on Sina Weibo. The length of the short summary is less than 140 Chinese characters. The task can be considered a special case of single document summarization. In this paper, we will introduce the evaluation dataset, the participating teams and the evaluation results. The dataset has been released publicly.

Xiaojun Wan, Jianmin Zhang, Shiyang Wen, Jiwei Tan
Overview of the NLPCC 2015 Shared Task: Open Domain QA

In this paper, we give the overview of the open domain Question Answering (or open domain QA) shared task in NLPCC 2015. We first review the background of QA, and then describe open domain QA shared task in this year’s NLPCC, including the construction of the benchmark datasets, the auxiliary dataset, and the evaluation metrics. The evaluation results of submissions from participating teams are presented in the experimental part, together with a brief introduction to the techniques used in each participating team’s QA system.

Nan Duan

Short Task (Short Papers)

Frontmatter
Word Segmentation of Micro Blogs with Bagging

This paper describes the model we designed for the Chinese word segmentation Task of NLPCC 2015. We firstly apply a word-based perceptron algorithm to build the base segmenter. Then, we use a Bootstrap Aggregating model of bagging which improves the segmentation results consistently on the three tracks of closed, semi-open and open test. Considering the characteristics of Weibo text, we also perform rule-based adaptation before decoding. Finally, our model achieves F-score 95.12% on closed track, 95.3% on semi-open track and 96.09% on open track.

Zhenting Yu, Xin-Yu Dai, Si Shen, Shujian Huang, Jiajun Chen
Weibo-Oriented Chinese News Summarization via Multi-feature Combination

The past several years have witnessed the rapid development of social media services, and the UGCs (User Generated Contents) have been increased dramatically, such as tweets in Twitter and posts in Sina Weibo. In this paper, we describe our system at NLPCC2015 on the Weibo-oriented Chinese news summarization task. Our model is established based on multi-feature combination to automatically generate summary for the given news article. In our system, we mainly utilize four kinds of features to compute the significance score of a sentence, including term frequency, sentence position, sentence length and the similarity between sentence and news article title, and then the summary sentences are chosen according to the significance score of each sentence from the news article. The evaluation results on Weibo news document sets show that our system is efficient in Weibo-oriented Chinese news summarization and outperforms all the other systems.

Maofu Liu, Limin Wang, Liqiang Nie
Linking Entities in Chinese Queries to Knowledge Graph

This paper presents our approach for NLPCC 2015 shared task, Entity Recognition and Linking in Chinese Search Queries. The proposed approach takes a query as input, and generates a ranked mention-entity links as results. It combines several different metrics to evaluate the probability of each entity link, including entity relatedness in the given knowledge graph, document similarity between query and the virtual document of entity in the knowledge graph. In the evaluation, our approach gets 33.2 % precision and 65.2 % recall, and ranks the 6th among all the 14 teams according to the average F1-measure.

Jun Li, Jinxian Pan, Chen Ye, Yong Huang, Danlu Wen, Zhichun Wang
A Hybrid Re-ranking Method for Entity Recognition and Linking in Search Queries

In this paper, we construct an entity recognition and linking system using Chinese Wikipedia and knowledge base. We utilize refined filter rules in entity recognition module, and then generate candidate entities by search engine and attributes in Wikipedia article pages. In entity linking module, we propose a hybrid entity re-ranking method combined with three features: textual and semantic match-degree, the similarity between candidate entity and entity mention, entity frequency. Finally, we get the linking results by the entity’s final score. In the task of entity recognition and linking in search queries at NLPCC 2015, the

Average-F1

value of this method achieved 61.1% in 3849 test dataset, which ranks second place in fourteen teams.

Gongbo Tang, Yuting Guo, Dong Yu, Endong Xun
Backmatter
Metadaten
Titel
Natural Language Processing and Chinese Computing
herausgegeben von
Juanzi Li
Heng Ji
Dongyan Zhao
Yansong Feng
Copyright-Jahr
2015
Electronic ISBN
978-3-319-25207-0
Print ISBN
978-3-319-25206-3
DOI
https://doi.org/10.1007/978-3-319-25207-0