Skip to main content
main-content

Über dieses Buch

This book constitutes the refereed proceedings of the Third CCF Conference, NLPCC 2014, held in Shenzhen, China, in December 2014. The 35 revised full papers presented together with 8 short papers were carefully reviewed and selected from 110 English submissions. The papers are organized in topical sections on fundamentals on language computing; applications on language computing; machine translation and multi-lingual information access; machine learning for NLP; NLP for social media; NLP for search technology and ads; question answering and user interaction; web mining and information extraction.

Inhaltsverzeichnis

Frontmatter

Long Papers

Fundamentals on Language Computing

A Global Generative Model for Chinese Semantic Role Labeling

The predicate and its semantic roles compose a unified entity that conveys the semantics of a given sentence. A standard pipeline of current approaches to semantic role labeling (SRL) is that for a given predicate in a sentence, we can extract features for each candidate argument and then perform the role classification through a classifier. However, this process totally ignores the integrality of the predicate and its semantic roles. To address this problem, we present a global generative model in which a novel concept called Predicate-Arguments-Coalition (PAC) is proposed to encode the relations among individual arguments. Owing to PAC, our model can effectively mine the inherent properties of predicates and obtain a globally consistent solution for SRL. We conduct experiments on the standard benchmarks: Chinese PropBank. Experimental results on a single syntactic tree show that our model outperforms the state-of-the-art methods.

Haitong Yang, Chengqing Zong

Chinese Comma Disambiguation on K-best Parse Trees

Chinese comma disambiguation plays key role in many natural language processing (NLP) tasks. This paper proposes a joint approach combining K-best parse trees to Chinese comma disambiguation to reduce the dependent on syntactic parsing. Experimental results on a Chinese comma corpus show that the proposed approach significantly outperform the baseline system. To our best knowledge, this is the first work improving the performance of Chinese comma disambiguation on K-best parse trees. Moreover, we release a Chinese comma corpus which adds a layer of annotation to the manually-parsed sentences in the CTB (Chinese Treebank) 6.0 corpus.

Fang Kong, Guodong Zhou

Event Schema Induction Based on Relational Co-occurrence over Multiple Documents

Event schema which comprises a set of related events and participants is of great importance with the development of information extraction (IE) and inducing event schema is prerequisite for IE and natural language generation. Event schema and slots are usually designed manually for traditional IE tasks. Methods for inducing event schemas automatically have been proposed recently. One of the fundamental assumptions in event schema induction is that related events tend to appear together to describe a scenario in natural-language discourse, meanwhile previous work only focused on co-occurrence in one document. We find that semantically typed relational tuples co-occurrence over multiple documents is helpful to construct event schema. We exploit the relational tuples co-occurrence over multiple documents by locating the key tuple and counting relational tuples, and build a co-occurrence graph which takes account of co-occurrence information over multiple documents. Experiments show that co-occurrence information over multiple documents can help to combine similar elements of event schema as well as to alleviate incoherence problems.

Tingsong Jiang, Lei Sha, Zhifang Sui

Negation and Speculation Target Identification

Negation and speculation are common in natural language text. Many applications, such as biomedical text mining and clinical information extraction, seek to distinguish positive/factual objects from negative/speculative ones (i.e., to determine what is negated or speculated) in biomedical texts. This paper proposes a novel task, called negation and speculation target identification, to identify the target of a negative or speculative expression. For this purpose, a new layer of the target information is incorporated over the BioScope corpus and a machine learning algorithm is proposed to automatically identify this new information. Evaluation justifies the effectiveness of our proposed approach on negation and speculation target identification in biomedical texts.

Bowei Zou, Guodong Zhou, Qiaoming Zhu

Applications on Language Computing

An Adjective-Based Embodied Knowledge Net

As the findings about the embodiment of language comprehension and some difficulties in the existing models of metaphor processing, this paper presents an adjective-based embodied cognitive net, which constructs the comprehension of knowledge in a novel view. Different from the traditional way that takes concepts as the core of knowledge comprehension, this paper views the emotions as the core and the motive power that human beings knowing the world. It is claimed that the adjective is the carrier of emotion in this paper, rather the concept. From the very nature, while getting a new thing, the first thing that comes to human’s mind are the original descriptions(usually are adjectives) and then are the concepts. Thus, this paper constructs a net based on adjectives from concrete to abstract according to the embodiment. In this knowledge net, nouns are contained as the attachment to construct a mapping between adjectives and concepts.Specially, this paper gives the embodied emotion to the adjective to deal with the emotion inference and metaphor emotion analysis in the future work.

Chang Su, Jia Tian, Yijiang Chen

A Method of Density Analysis for Chinese Characters

Density analysis plays an important role in font design and recognition. This paper presents a method of density analysis for Chinese characters. A number of density metrics are adopted to describe the density degree of a character from both local and global perspectives, including center-to-center distance of connected components, gap between connected components, ratio of perimeter and area, connected components area ratio, and area ratio of holes. The experiment results demonstrate that the proposed method is effective in measuring the density of Chinese characters.

Jingwei Qu, Xiaoqing Lu, Lu Liu, Zhi Tang, Yongtao Wang

Computing Semantic Relatedness Using a Word-Text Mutual Guidance Model

The computation of relatedness between two fragments of text or two words is a challenging task in many fields. In this study, we propose a novel method for measuring semantic relatedness between word units and between text units using an iterative process, which we refer to as the word-text mutual guidance (WTMG) method. WTMG combines the surface and contextual information when computing word or text relatedness. The iterative process can start in two different ways: calculating relatedness between texts using the initial relatedness of the words, or computing the relatedness between words using the initial relatedness of the texts. This method obtains the final relatedness result after the iterative process reaches convergence. We compared WTMG with previous relatedness computation methods, which showed that obvious improvements were obtained in terms of the correlation with human judgments.

Bingquan Liu, Jian Feng, Ming Liu, Feng Liu, Xiaolong Wang, Peng Li

Short Text Feature Enrichment Using Link Analysis on Topic-Keyword Graph

In this paper, we propose a novel feature enrichment method for short text classification based on the link analysis on topic-keyword graph. After topic modeling, we re-rank the keywords distribution extracted by biterm topic model (BTM) to make the topics more salient. Then a topic-keyword graph is constructed and link analysis is conducted. For complement, the K-L divergence is integrated with the structural similarity to discover the most related keywords. At last, the short text is expanded by appending these related keywords for classification. Experimental results on two open datasets validate the effectiveness of the proposed method.

Peng Wang, Heng Zhang, Bo Xu, Chenglin Liu, Hongwei Hao

Machine Translation and Multi-Lingual Information Access

Sentence-Length Informed Method for Active Learning Based Resource-Poor Statistical Machine Translation

This paper presents a simple but effective sentence-length informed method to select informative sentences for active learning (AL) based SMT. A length factor is introduced to penalize short sentences to balance the “

exploration

” and “

exploitation

” problem. The penalty is dynamically updated at each iteration of sentence selection by the ratio of the current candidate sentence length and the overall average sentence length of the monolingual corpus. Experimental results on NIST Chinese–English pair and WMT French-English pair show that the proposed sentence-length penalty based method performs best compared with the typical selection method and random selection strategy.

Jinhua Du, Miaomiao Wang, Meng Zhang

Detection of Loan Words in Uyghur Texts

For low-resource languages like Uyghur, data sparseness is always a serious problem in related information processing, especially in some tasks based on parallel texts. To enrich bilingual resources, we detect Chinese and Russian loan words from Uyghur texts according to phonetic similarities between a loan word and its corresponding donor language word. In this paper, we propose a novel approach based on perceptron model to discover loan words from Uyghur texts, which consider the detection of loan words in Uyghur as a classification procedure. The experimental results show that our method is capable of detecting the Chinese and Russian loan words in Uyghur Texts effectively.

Chenggang Mi, Yating Yang, Lei Wang, Xiao Li, Kamali Dalielihan

A Novel Rule Refinement Method for SMT through Simulated Post-Editing

Post-editing has been successfully applied to correct the output of MT systems to generate better translation, but as a downstream task its positive feedback to MT has not been well studied. In this paper, we present a novel rule refinement method which uses Simulated Post-Editing (SiPE) to capture the errors made by the MT systems and generates refined translation rules. Our method is system-independent and doesn’t entail any additional resources. Experimental results on large-scale data show a significant improvement over both phrase-based and syntax-based baselines.

Sitong Yang, Heng Yu, Qun Liu

Case Frame Constraints for Hierarchical Phrase-Based Translation: Japanese-Chinese as an Example

Hierarchical phrase-based model has two main problems. Firstly, without any semantic guidance, large numbers of redundant rules are extracted. Secondly, it cannot efficiently capture long reordering. This paper proposes a novel approach to exploiting case frame in hierarchical phrase-based model in both rule extraction and decoding. Case frame is developed by case grammar theory, and it captures sentence structure and assigns components with different case information. Our case frame constraints system holds the properties of long distance reordering and phrase in case chunk-based dependency tree. At the same time, the number of HPB rules decrease with the case frame constraints. The results of experiments carried out on Japanese-Chinese test sets shows that our approach yields improvements over the HPB model (+1.48 BLEU on average).

Jiangming Liu, JinAn Xu, Jun Xie, Yujie Zhang

Machine Learning for NLP

Bridging the Language Gap: Learning Distributed Semantics for Cross-Lingual Sentiment Classification

Cross-lingual sentiment classification aims to automatically predict sentiment polarity (e.g., positive or negative) of data in a label-scare target language by exploiting labeled data from a label-rich language. The fundamental challenge of cross-lingual learning stems from a lack of overlap between the feature spaces of the source language data and that of the target language data. To address this challenge, previous work in the literature mainly relies on machine translation engines or bilingual lexicons to directly adapt labeled data from the source language to the target language. However, machine translation may change the sentiment polarity of the original data. In this paper, we propose a new model which uses stacked autoencoders to learn language-independent distributed representations for the source and target languages in an unsupervised fashion. Sentiment classifiers trained on the source language can be adapted to predict sentiment polarity of the target language with the language-independent distributed representations. We conduct extensive experiments on English-Chinese sentiment classification tasks of multiple data sets. Our experimental results demonstrate the efficacy of the proposed cross-lingual approach.

Guangyou Zhou, Tingting He, Jun Zhao

A Short Texts Matching Method Using Shallow Features and Deep Features

Semantic matching is widely used in many natural language processing tasks. In this paper, we focus on the semantic matching between short texts and design a model to generate deep features, which describe the semantic relevance between short “text object”. Furthermore, we design a method to combine shallow features of short texts (i.e., LSI, VSM and some other handcraft features) with deep features of short texts (i.e., word embedding matching of short text). Finally, a ranking model (i.e., RankSVM) is used to make the final judgment. In order to evaluate our method, we implement our method on the task of matching posts and responses. Results of experiments show that our method achieves the state-of-the-art performance by using shallow features and deep features.

Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, Yan He

A Feature Extraction Method Based on Word Embedding for Word Similarity Computing

In this paper, we introduce a new NLP task similar to word expansion task or word similarity task, which can discover words sharing the same semantic components (feature sub-space) with seed words. We also propose a Feature Extraction method based on Word Embeddings for this problem. We train word embeddings using state-of-the-art methods like word2vec and models supplied by Stanford NLP Group. Prior Statistical Knowledge and Negative Sampling are proposed and utilized to help extract the Feature Sub-Space. We evaluate our model on WordNet synonym dictionary dataset and compare it to word2vec on synonymy mining and word similarity computing task, showing that our method outperforms other models or methods and can significantly help improve language understanding.

Weitai Zhang, Weiran Xu, Guang Chen, Jun Guo

Word Vector Modeling for Sentiment Analysis of Product Reviews

Recent years, an amount of product reviews on the internet have become an important source of information for potential customers. These reviews do help to research products or services before making purchase decisions. Thus, sentiment analysis of product reviews has become a hot issue in the field of natural language processing and text mining. Considering good performances of unsupervised neural network language models in a wide range of natural language processing tasks, a semi-supervised deep learning model has been proposed for sentiment analysis. The model introduces supervised sentiment labels into traditional neural network language models. It enhances expression ability of sentiment information as well as semantic information in word vectors. Experiments on NLPCC2014 product review datasets demonstrate that our method outperforms the traditional methods and methods of other teams.

Yuan Wang, Zhaohui Li, Jie Liu, Zhicheng He, Yalou Huang, Dong Li

Cross-Lingual Sentiment Classification Based on Denoising Autoencoder

Sentiment classification system relies on high-quality emotional resources. However, these resources are imbalanced in different languages. The way of how to leverage rich labeled data of one language (source language) for the sentiment classification of resource-poor language (target language), namely cross-lingual sentiment classification (CLSC), becomes a focus topic. This paper utilizes rich English resources for Chinese sentiment classification. To eliminate the language gap between English and Chinese, this paper proposes a combination CLSC approach based on denoising autoencoder. First, two classifiers based on denoising autoencoder are learned respectively in English and Chinese views by using English corpus and English-to-Chinese corpus. Second, we classify Chinese test data and Chinese-to-English test data with the two classifiers trained in the two views. Last, the final sentiment classification results are obtained by the combination of the two results in two views. Experiments are carried out on NLP&CC 2013 CLSC dataset including book, DVD and music categories. The results show that our approach achieves the accuracy of 80.02%, which outperforms the current state-of-the-art systems.

Huiwei Zhou, Long Chen, Degen Huang

NLP for Social Media

Aspect-Object Alignment Using Integer Linear Programming

Target extraction is an important task in opinion mining, in which a complete target consists of an aspect and its corresponding object. However, previous work always simply considers the aspect as the target and ignores an important element “object.” Thus the incomplete target is of limited use for practical applications. This paper proposes a novel and important sentiment analysis task: aspect-object alignment, which aims to obtain the correct corresponding object for each aspect, to solve the “object ignoring” problem. We design a two-step framework for this task. We first provide an aspect-object alignment classifier that incorporates three sets of features. However, the objects assigned to aspects in a sentence often contradict each other. To solve this problem, we impose two kinds of constraints: intra-sentence constraints and inter-sentence constraints, which are encoded as linear formulations and use Integer Linear Programming (ILP) as an inference procedure to obtain a final global decision in the second step. The experiments on the corpora of camera domain show the effectiveness of the framework.

Yanyan Zhao, Bing Qin, Ting Liu

Sentiment Classification of Chinese Contrast Sentences

We present the study of sentiment classification of Chinese contrast sentences in this paper, which are one of the commonly used language constructs in text. In a typical review, there are at least around 6% of such sentences. Due to the complex contrast phenomenon, it is hard to use the traditional bag-of-words to model such sentences. In this paper, we propose a Two-Layer Logistic Regression (TLLR) model to leverage such relationship in sentiment classification. According to different connectives, our model can treat different clauses differently in sentiment classification. Experimental results show that TLLR model can effectively improve the performance of sentiment classification of Chinese contrast sentences.

Junjie Li, Yu Zhou, Chunyang Liu, Lin Pang

Emotion Classification of Chinese Microblog Text via Fusion of BoW and eVector Feature Representations

Sentiment Analysis has been a hot research topic in recent years. Emotion classification is more detailed sentiment analysis which cares about more than the polarity of sentiment. In this paper, we present our system of emotion analysis for the Sina Weibo texts on both the document and sentence level, which detects whether a text is sentimental and further decides which emotion classes it conveys. The emotions of focus are seven basic emotion classes: anger, disgust, fear, happiness, like, sadness and surprise. Our baseline system uses supervised machine learning classifier (support vector machine, SVM) based on bag-of-words (BoW) features. In a contrast system, we propose a novel approach to construct an emotion lexicon and to generate a new feature representation of text which is named emotion vector eVector. Our experimental results show that both systems can classify emotion significantly better than random guess. Fusion of both systems obtains additional gain which indicates that they capture certain complementary information.

Chengxin Li, Huimin Wu, Qin Jin

Social Media as Sensor in Real World: Geolocate User with Microblog

People always exist in the two dimensional space, i.e. time and space, in the real world. How to detect users’ locations automatically is significant for many location-based applications such as dietary recommendation and tourism planning. With the rapid development of social media such as Sina Weibo and Twitter, more and more people publish messages at any time which contain their real-time location information. This makes it possible to detect users’ locations automatically by social media. In this paper, we propose a method to detect a user’s city-level locations only based on his/her published posts in social media. Our approach considers two components: a Chinese location library and a model based on words distribution over locations. The former one is used to match whether there is a location name mentioned in the post. The latter one is utilized to mine the implied location information under the non-location words in the post. Furthermore, for a user’s detected location sequence, we consider the transfer speed between two adjacent locations to smooth the sequence in context. Experiments on real dataset from Sina Weibo demonstrate that our approach can outperform baseline methods significantly in terms of

Precision

,

Recall

and

F

1.

Xueqin Sui, Zhumin Chen, Kai Wu, Pengjie Ren, Jun Ma, Fengyu Zhou

A Novel Calibrated Label Ranking Based Method for Multiple Emotions Detection in Chinese Microblogs

The microblogging services become increasingly popular for people to exchange their feelings and opinions. Extracting and analyzing the sentiments in microblogs have drawn extensive attentions from both academia researchers and commercial companies. The previous literature usually focused on classifying the microblogs into positive or negative categories. However, people’s sentiments are much more complex, and multiple fine-grained emotions may coexist in just one short microblog text. In this paper, we regard the emotion analysis as a multi-label learning problem and propose a novel calibrated label ranking based framework for detecting the multiple fine-grained emotions in the Chinese microblogs. We combine the learning-based method and lexicon-based method to build unified emotion classifiers, which alleviate the sparsity of the training microblog dataset. Experiment results using NLPCC 2014 evaluation dataset show that our proposed algorithm has achieved the best performance and significantly outperforms other participators’ methods.

Mingqiang Wang, Mengting Liu, Shi Feng, Daling Wang, Yifei Zhang

Enhance Social Context Understanding with Semantic Chunks

Social context understanding is a fundamental problem on social analysis. Social contexts are usually short, informal and incomplete and these characteristics make methods for formal texts give poor performance on social contexts. However, we discover part of relations between importance words in formal texts are helpful to understand social contexts. We propose a method that extracts semantic chunks using these relations to express social contexts. A semantic chunk is a phrase which is meaningful and significant expression describing the fist of given texts. We exploit semantic chunks by utilizing knowledge learned from semantically parsed corpora and knowledge base. Experimental results on Chinese and English data sets demonstrate that our approach improves the performance significantly.

Siqiang Wen, Zhixing Li, Juanzi Li

NLP for Search Technology and Ads

Estimating Credibility of User Clicks with Mouse Movement and Eye-Tracking Information

Click-through information has been regarded as one of the most important signals for implicit relevance feedback in Web search engines. Because large variation exists in users’ personal characteristics, such as search expertise, domain knowledge, and carefulness, different user clicks should not be treated as equally important. Different from most existing works that try to estimate the credibility of user clicks based on click-through or querying behavior, we propose to enrich the credibility estimation framework with mouse movement and eye-tracking information. In the proposed framework, the credibility of user clicks is evaluated with a number of metrics in which a user in the context of a certain search session is treated as a relevant document classifier. With an experimental search engine system that collects click-through, mouse movement, and eye movement data simultaneously, we find that credible user behaviors could be separated from non-credible ones with a number of interaction behavior features. Further experimental results indicate that relevance prediction performance could be improved with the proposed estimation framework.

Jiaxin Mao, Yiqun Liu, Min Zhang, Shaoping Ma

Cannabis_TREATS_cancer: Incorporating Fine-Grained Ontological Relations in Medical Document Ranking

The previous work has justified the assumption that document ranking can be improved by further considering the coarse-grained relations in various linguistic levels (e.g., lexical, syntactical and semantic). To the best of our knowledge, little work is reported to incorporate the fine-grained ontological relations (e.g., <

cannabis

, TREATS,

cancer

>) in document ranking. Two contributions are worth noting in this work. First, three major combination models (i.e., summation, multiplication, and amplification) are designed to re-calculate the query-document relevance score considering both the term-level Okapi BM25 relevance score and the relation-level relevance score. Second, a vector-based scoring algorithm is proposed to calculate the relation-level relevance score. A few experiments on medical document ranking with CLEF2013 eHealth Lab medical information retrieval dataset show that the proposed document ranking algorithms can be further improved by incorporating the fine-grained ontological relations.

Yunqing Xia, Zhongda Xie, Qiuge Zhang, Huiyuan Wang, Huan Zhao

A Unified Microblog User Similarity Model for Online Friend Recommendation

Nowadays, people usually like to extend their real-life social relations into the online virtual social networks. With the blooming of Web 2.0 technology, huge number of users aggregate in the microblogging services, such as Twitter and Weibo, to express their opinions, record their personal lives and communicate with each other. How to recommend potential good friends for the target user has been a critical problem for both commercial companies and research communities. The key issue for online friend recommendation is to design an appropriate algorithm for user similarity measurement. In this paper, we propose a novel microblog user similarity model for online friend recommendation by linearly combining multiple similarity measurements of microblogs. Our proposed model can give a more comprehensive understanding of the user relationship in the microblogging space. Extensive experiments on a real-world dataset validate that our proposed model outperforms other baseline algorithms by a large margin.

Shi Feng, Le Zhang, Daling Wang, Yifei Zhang

Weakly-Supervised Occupation Detection for Micro-blogging Users

In this paper, we propose a weakly-supervised occupation detection approach which can automatically detect occupation information for micro-blogging users. The weakly-supervised approach makes use of two types of user information (tweets and personal descriptions) through a rule-based user occupation detection and a MCS-based (MCS: a multiple classifier system) user occupation detection. First, the rule-based occupation detection uses the personal descriptions of some users to create pseudo-training data. Second, based on the pseudo-training data, the MCS-based occupation detection uses tweets to do further occupation detection. However, the pseudo-training data is severely skewed and noisy, which brings a big challenge to the MCS-based occupation detection. Therefore, we propose a class-based random sampling method and a cascaded ensemble learning method to overcome these data problems. The experiments show that the weakly-supervised occupation detection achieves a good performance. In addition, although our study is made on Chinese, the approach indeed is language-independent.

Ying Chen, Bei Pei

Normalization of Chinese Informal Medical Terms Based on Multi-field Indexing

Healthcare data mining and business intelligence are attracting huge industry interest in recent years. Engineers encounter a bottleneck when applying data mining tools to textual healthcare records. Many medical terms in the healthcare records are different from the standard form, which are referred to as informal medical terms in this work. Study indicates that in Chinese healthcare records, a majority of the informal terms are abbreviations or typos. In this work, a multi-field indexing approach is proposed, which accomplishes the term normalization task with information retrieval algorithm with four level indices: word, character, pinyin and its initial. Experimental results show that the proposed approach is advantageous over the state-of-the-art approaches.

Yunqing Xia, Huan Zhao, Kaiyu Liu, Hualing Zhu

Question Answering and User Interaction

Answer Extraction with Multiple Extraction Engines for Web-Based Question Answering

Answer Extraction of Web-based Question Answering aims to extract answers from snippets retrieved by search engines. Search results contain lots of noisy and incomplete texts, thus the task becomes more challenging comparing with traditional answer extraction upon off-line corpus. In this paper we discuss the important role of employing multiple extraction engines for Web-based Question Answering. Aggregating multiple engines could ease the negative effect from the noisy search results on single method. We adopt a Pruned Rank Aggregation method which performs pruning while aggregating candidate lists provided by multiple engines. It fully leverages redundancies within and across each list for reducing noises in candidate list without hurting answer recall. In addition, we rank the aggregated list with a Learning to Rank framework with similarity, redundancy, quality and search features. Experiment results on TREC data show that our method is effective for reducing noises in candidate list, and greatly helps to improve answer ranking results. Our method outperforms state-of-the-art answer extraction method, and is sufficient in dealing with the noisy search snippets for Web-based QA.

Hong Sun, Furu Wei, Ming Zhou

Answering Natural Language Questions via Phrasal Semantic Parsing

Understanding natural language questions and converting them into structured queries have been considered as a crucial way to help users access large scale structured knowledge bases. However, the task usually involves two main challenges: recognizing users’ query intention and mapping the involved semantic items against a given knowledge base (KB). In this paper, we propose an efficient pipeline framework to model a user’s query intention as a phrase level dependency DAG which is then instantiated regarding a specific KB to construct the final structured query. Our model benefits from the efficiency of linear structured prediction models and the separation of KB-independent and KB-related modelings. We evaluate our model on two datasets, and the experimental results showed that our method outperforms the state-of-the-art methods on the Free917 dataset, and, with limited training data from Free917, our model can smoothly adapt to new challenging dataset, WebQuestion, without extra training efforts while maintaining promising performances.

Kun Xu, Sheng Zhang, Yansong Feng, Dongyan Zhao

A Fast and Effective Method for Clustering Large-Scale Chinese Question Dataset

Question clustering plays an important role in QA systems. Due to data sparseness and lexical gap in questions, there is no sufficient information to guarantee good clustering results. Besides, previous works pay little attention to the complexity of algorithms, resulting in infeasibility on large-scale datasets. In this paper, we propose a novel similarity measure, which employs word relatedness as additional information to help calculating similarity between questions. Based on the similarity measure and k-means algorithm, semantic k-means algorithm and its extended version are proposed. Experimental results show that the proposed methods have comparable performance with state-of-theart methods and cost less time.

Xiaodong Zhang, Houfeng Wang

Web Mining and Information Extraction

A Hybrid Method for Chinese Entity Relation Extraction

Entity relation extraction is an important task for information extraction, which refers to extracting the relation between two entities from input text. Previous researches usually converted this problem to a sequence labeling problem and used statistical models such as conditional random field model to solve it. This kind of method needs a large, high-quality training dataset. So it has two main drawbacks: 1) for some target relations, it is not difficult to get training instances, but the quality is poor; 2) for some other relations, it is hardly to get enough training data automatically. In this paper, we propose a hybrid method to overcome the shortcomings. To solve the first drawback, we design an improved candidate sentences selecting method which can find out high-quality training instances, and then use them to train our extracting model. To solve the second drawback, we produce heuristic rules to extract entity relations. In the experiment, the candidate sentences selecting method improves the average F1 value by 78.53% and some detailed suggestions are given. And we submitted 364944 triples with the precision rate of 46.3% for the competition of Sougou Chinese entity relation extraction and rank the 4th place in the platform.

Hao Wang, Zhenyu Qi, Hongwei Hao, Bo Xu

Linking Entities in Tweets to Wikipedia Knowledge Base

Entity linking has received much more attention. The purpose of entity linking is to link the mentions in the text to the corresponding entities in the knowledge base. Most work of entity linking is aiming at long texts, such as BBS or blog. Microblog as a new kind of social platform, entity linking in which will face many problems. In this paper, we divide the entity linking task into two parts. The first part is entity candidates’ generation and feature extraction. We use Wikipedia articles information to generate enough entity candidates, and as far as possible eliminate ambiguity candidates to get higher coverage and less quantity. In terms of feature, we adopt belief propagation, which is based on the topic distribution, to get global feature. The experiment results show that our method achieves better performance than that based on common links. When combining global features with local features, the performance will be obviously improved. The second part is entity candidates ranking. Traditional learning to rank methods have been widely used in entity linking task. However, entity linking does not consider the ranking order of non-target entities. Thus, we utilize a boosting algorithm of non-ranking method to predict the target entity, which leads to 77.48% accuracy.

Xianqi Zou, Chengjie Sun, Yaming Sun, Bingquan Liu, Lei Lin

Automatic Recognition of Chinese Location Entity

Recognition of Chinese location entity is an important part of event extraction. In this paper we propose a novel method to identify Chinese location entity based on the divide-and-conquer strategy. Firstly, we use CRF role labeling to identify the basic place name. Secondly, by using semi-automatic way, we build indicator lexicon. Finally, we propose attachment connection algorithm to connect the basic place name with indicator, then we achieve the identification of location entity. In brief, our method decomposes location entity into basic place name and indicator, which is different from traditional methods. Results of the experiments show that the proposed method has an outstanding effect and the F-value gets to 84.79%.

Xuewei Li, Xueqiang Lv, Kehui Liu

Detect Missing Attributes for Entities in Knowledge Bases via Hierarchical Clustering

Automatically constructed knowledge bases often suffer from quality issues such as the lack of attributes for existing entities. Manually finding and filling missing attributes is time consuming and expensive since the volume of knowledge base is growing in an unforeseen speed. We, therefore, propose an automatic approach to suggest missing attributes for entities via hierarchical clustering based on the intuition that similar entities may share a similar group of attributes. We evaluate our method on a randomly sampled set of 20,000 entities from DBPedia. The experimental results show that our method can achieve a high precision and outperform existing methods.

Bingfeng Luo, Huanquan Lu, Yigang Diao, Yansong Feng, Dongyan Zhao

Improved Automatic Keyword Extraction Based on TextRank Using Domain Knowledge

Keyword extraction of scientific articles is beneficial for retrieving scientific articles of a certain topic and grasping the trend of academic development. For the task of keyword extraction for Chinese scientific articles, we adopt the framework of selecting keyword candidates by Document Frequency Accessor Variety(DF-AV) and running TextRank algorithm on a phrase network. To improve domain adaption of keyword extraction, we introduce known keywords of a certain domain as domain knowledge into this framework. Experimental results show that domain knowledge can improve performance of keyword extraction generally.

Guangyi Li, Houfeng Wang

Short Papers

A Topic-Based Reordering Model for Statistical Machine Translation

Reordering models are one of essential components of statistical machine translation. In this paper, we propose a topic-based reordering model to predict orders for neighboring blocks by capturing topic-sensitive reordering patterns. We automatically learn reordering examples from bilingual training data, which are associated with document-level and word-level topic information induced by LDA topic model. These learned reordering examples are used as evidences to train a topic-based reordering model that is built on a maximum entropy (MaxEnt) classifier. We conduct large-scale experiments to validate the effectiveness of the proposed topic-based reordering model on the NIST Chinese-to-English translation task. Experimental results show that our topic-based reordering model achieves significant performance improvement over the conventional reordering model using only lexical information.

Xing Wang, Deyi Xiong, Min Zhang, Yu Hong, Jianmin Yao

Online Chinese-Vietnamese Bilingual Topic Detection Based on RCRP Algorithm with Event Elements

On account of the characteristics of online Chinese-Vietnamese topic detection, we propose a Chinese-Vietnamese bilingual topic model based on the Recurrent Chinese Restaurant Process and integrated with event elements. First, the event elements, including the characters, the place and the time, will be extracted from the new dynamic bilingual news texts. Then the word pairs are tagged and aligned from the bilingual news and comments. Both the event elements and the aligned words are integrated into RCRP algorithm to construct the proposed bilingual topic detection model. Finally, we use the model to determine if the new documents will be grouped into a new category or classified into the existing categories, as a result, to detect a topic. Through the contrast experiment, the proposed model achieves a good effect on topic detection.

Wen-xu Long, Ji-xun Gao, Zheng-tao Yu, Sheng-xiang Gao, Xu-dong Hong

Random Walks for Opinion Summarization on Conversations

Opinion summarization on conversations aims to generate a sentimental summary for a dialogue and is shown to be much more challenging than traditional topic-based summarization and general opinion summarization, due to its specific characteristics. In this study, we propose a graph-based framework to opinion summarization on conversations. In particular, a random walk model is proposed to globally rank the utterances in a conversation. The main advantage of our approach is its ability of integrating various kinds of important information, such as utterance length, opinion, and dialogue structure, into a graph to better represent the utterances in a conversation and the relationship among them. Besides, a global ranking algorithm is proposed to optimize the graph. Empirical evaluation on the Switchboard corpus demonstrates the effectiveness of our approach.

Zhongqing Wang, Liyuan Lin, Shoushan Li, Guodong Zhou

TM-ToT: An Effective Model for Topic Mining from the Tibetan Messages

The microblog platforms, such as Weibo, now accumulate a large scale of data including the Tibetan messages. Discovering the latent topics from such huge volume of Tibetan data plays a significant role in tracing the dynamics of the Tibetan community, which contributes to uncover the public opinion of this community to the government. Although topic models can find out the latent structure from traditional document corpus, their performance on Tibetan messages is unsatisfactory because the short messages cause the severe data spasity challenge. In this paper, we propose a novel model called TM-ToT, which is derived from ToT (Topic over Time) aiming at mining latent topics effectively from the Tibetan messages. Firstly, we assume each topic is a mixture distribution influenced by both word co-occurrences and messages timestamps. Therefore, TM-ToT can capture the changes of each topic over time. Subsequently, we aggregate all messages published by the same author to form a lengthy pseudo-document to tackle the data sparsity problem. Finally, we present a Gibbs sampling implementation for the inference of TM-ToT. We evaluate TM-ToT on a real dataset. In our experiments, TM-ToT outperforms Twitter-LDA by a large margin in terms of perplexity. Furthermore, the quality of the generated latent topics of TM-ToT is promising.

Chengxu Ye, Wushao Wen, Ping Yang

Chinese Microblog Entity Linking System Combining Wikipedia and Search Engine Retrieval Results

Microblog has provided a convenient and instant platform for information publication and acquisition. Microblog’s short, noisy, real-time features make Chinese Microblog entity linking task a new challenge. In this paper, we investigate the linking approach and introduce the implementation of a Chinese Microblog Entity Linking (CMEL) System. In particular, we first build synonym dictionary and process the special identifier. Then we generate candidate set combining Wikipedia and search engine retrieval results. Finally, we adopt improved VSM to get textual similarity for entity disambiguation. The accuracy of CMEL system is 84.35%, which ranks the second place in NLPCC 2014 Evaluation Entity Linking Task.

Zeyu Meng, Dong Yu, Endong Xun

Emotion Cause Detection with Linguistic Construction in Chinese Weibo Text

To identify the cause of emotion is a new challenge for researchers in nature language processing. Currently, there is no existing works on emotion cause detection from Chinese micro-blogging (Weibo) text. In this study, an emotion cause annotated corpus is firstly designed and developed through annotating the emotion cause expressions in Chinese Weibo Text. Up to now, an emotion cause annotated corpus which consists of the annotations for 1,333 Chinese Weibo is constructed. Based on the observations on this corpus, the characteristics of emotion cause expression are identified. Accordingly, a rule-based emotion cause detection method is developed which uses 25 manually complied rules. Furthermore, two machine learning based cause detection methods are developed including a classification-based method using support vector machines and a sequence labeling based method using conditional random fields model. It is the largest available resources in this research area. The experimental results show that the rule-based method achieves 68.30% accuracy rate. Furthermore, the method based on conditional random fields model achieved 77.57% accuracy which is 37.45% higher than the reference baseline method. These results show the effectiveness of our proposed emotion cause detection method.

Lin Gui, Li Yuan, Ruifeng Xu, Bin Liu, Qin Lu, Yu Zhou

News Topic Evolution Tracking by Incorporating Temporal Information

Time stamped texts or text sequences are ubiquitous in real life, such as news reports. Tracking the topic evolution of these texts has been an issue of considerable interest. Recent work has developed methods of tracking topic shifting over long time scales. However, most of these researches focus on a large corpus. Also, they only focus on the text itself and no attempt have been made to explore the temporal distribution of the corpus, which could provide meaningful and comprehensive clues for topic tracking. In this paper, we formally address this problem and put forward a novel method based on the topic model. We investigate the temporal distribution of news reports of a specific event and try to integrate this information with a topic model to enhance the performance of topic model. By focusing on a specific news event, we try to reveal more details about the event, such as, how many stages are there in the event, what aspect does each stage focus on, etc.

Jian Wang, Xianhui Liu, Junli Wang, Weidong Zhao

Backmatter

Weitere Informationen