Skip to main content
Top

2015 | Book

Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data

14th China National Conference, CCL 2015 and Third International Symposium, NLP-NABD 2015, Guangzhou, China, November 13-14, 2015, Proceedings

insite
SEARCH

About this book

This book constitutes the refereed proceedings of the 14th China National Conference on Computational Linguistics, CCL 2014, and of the Third International Symposium on Natural Language Processing Based on Naturally Annotated Big Data, NLP-NABD 2015, held in Guangzhou, China, in November 2015.

The 34 papers presented were carefully reviewed and selected from 283 submissions. The papers are organized in topical sections on lexical semantics and ontologies; semantics; sentiment analysis, opinion mining and text classification; machine translation; multilinguality in NLP; machine learning methods for NLP; knowledge graph and information extraction; discourse, coreference and pragmatics; information retrieval and question answering; social computing; NLP applications.

Table of Contents

Frontmatter

Lexical Semantics and Ontologies

Frontmatter
Building a Collation Element Table for a Large Chinese Character Set in YES

YES is a simplified stroke-based method for sorting Chinese characters. It is free from stroke counting and grouping, and thus much faster and more accurate than the traditional method. This paper presents a collation element table built in YES for a large joint Chinese character set covering (a) all 20,902 characters of Unicode CJK Unified Ideographs, (b) all 11,408 characters in the Complete List of Chinese Characters Used by the Media in 2013, (c) all 13,000 plus characters in the latest versions of Xinhua Dictionary(v11) and Contemporary Chinese Dictionary(v6). Of the 20,902 Chinese characters in Unicode, 97.23% have one-to-one relationship with their stroke order codes in YES, comparing with 90.69% of the traditional method. Enhanced with the secondary and tertiary sorting levels of stroke layout and Unicode value, there is a guarantee of one-to-one relationship between the characters and collation elements. The collation element table has been successfully applied to sorting CC-CEDICT, a Chinese-English dictionary of over 112,000 word entries.

Xiaoheng Zhang, Xiaotong Li
Improved Learning of Chinese Word Embeddings with Semantic Knowledge

While previous studies show that modeling the minimum meaning-bearing units (characters or morphemes) benefits learning vector representations of words, they ignore the semantic dependencies across these units when deriving word vectors. In this work, we propose to improve the learning of Chinese word embeddings by exploiting semantic knowledge. The basic idea is to take the semantic knowledge about words and their component characters into account when designing composition functions. Experiments show that our approach outperforms two strong baselines on word similarity, word analogy, and document classification tasks.

Liner Yang, Maosong Sun
Incorporating Word Clustering into Complex Noun Phrase Identification

Since the professional technical literature include amounts of complex noun phrases, identifying those phrases has an important practical value for such tasks as machine translation. Through analysis of those phrases in Chinese-English bilingual sentence pairs from the aircraft technical publications, we present an annotation specification based on the existing specification to label those phrases and a method for the complex noun phrase identification. In addition to the basic features including the word and the part-of-speech, we incorporate the word clustering features trained by Brown clustering model and Word Vector Class (WVC) model on a large unlabeled data into the machine learning model. Experimental results indicate that the combination of different word clustering features and basic features can leverage system performance, and improve the F-score by 1.83 % in contrast with the method only adding the basic features.

Lihua Xue, Guiping Zhang, Qiaoli Zhou, Na Ye
A Three-Layered Collocation Extraction Tool and Its Application in China English Studies

We design a three-layered collocation extraction tool by integrating syntactic and semantic knowledge and apply it in China English studies. The tool first extracts peripheral collocations in the frequency layer from dependency triples, then extracts semi-peripheral collocations in the syntactic layer by association measures, and last extracts core collocations in the semantic layer with a similar word thesaurus. The syntactic constraints filter out much noise from surface co-occurrences, and the semantic constraints are effective in identifying the very “core” collocations. The tool is applied to automatically extract collocations from a large corpus of China English we compile to explore how China English as a variety of English is nativilized. Then we analyze similarities and differences of the typical China English collocations of a group of verbs. The tool and results can be applied in the compilation of language resources for Chinese-English translation and corpus-based China studies.

Jingxiang Cao, Dan Li, Degen Huang

Semantics

Frontmatter
The Designing and Construction of Domain-oriented Vietnamese-English-Chinese FrameNet

Frame Semantics and the FrameNet are known as an example of a semantic theory model supporting large engineering projects of knowledge representation and maintaining a long-term vitality. At the same time, the initial goal of FrameNet is to build a large online computational dictionary, so the semantic frames are lacking in systematicness and hierarchy from the whole, and did not distinguish between the two concepts “semantic domain” and “topic domain”. These problems make it difficult to unify the concrete goal, the domains, the frame structure, the annotation method and the overall scale of the non-English FrameNet construction and have created some obstacles for multi-language FrameNets to the applications of NLP. As a result, we propose some ideas on Domain-oriented Multilingual Frame Semantic Representation(DOMLFSR). The construction of Domain-oriented Vietnamese-English-Chinese FrameNet(DOV-E-CFN) is a concrete practice of DOMLFSR. On the basis of DOV-E-CFN, we gave a preliminary analysis of event extrction application based on kernel dependency graph(KDG).

Li Lin, Huihui Chen, Yude Bi
Semantic Role Labeling Using Recursive Neural Network

Semantic role labeling (SRL) is an important NLP task for understanding the semantic of sentences in real-world. SRL is a task which assigns semantic roles to different phrases in a sentence for a given word. We design a recursive neural network model for SRL. On the one hand, comparing to traditional shallow models, our model does not dependent on lots of rich hand-designed features. On the other hand, different from early deep models, our model is able to add many shallow features. Further more, our model uses global structure information of parse trees. In our experiment, we evaluate using the CoNLL-2005 data and reach a competitive performance with fewer features.

Tianshi Li, Baobao Chang
A Comparative Analysis of Chinese Simile and Metaphor Based on a Large Scale Chinese Corpus

This paper puts forward the mapping inheritance hypothesis which states that the ultimate goal of mapping is to inherit the attributes of source domains by comparing structures ‘as A as Y’(像Y一样A) and ‘n n/n + n’. Furthermore, we have built a knowledge base for simile and explored the distribution of source domain and its attribute hierarchy. The study shows that the number of S domain words in Chinese simile is different from metaphor, they only have in common 155 S domain words. Although simile and metaphor both tend to choose the semantic category B_object as their source domain, simile expressions are more likely to choose plants and animals, metaphorical expressions are more likely to choose inanimate objects.

Zhimin Wang, Yuxiang Jia, Pierangelo Lacasella
Chinese Textual Entailment Recognition Enhanced with Word Embedding

Textual entailment has been proposed as a unifying generic framework for modeling language variability and semantic inference in different Natural Language Processing (NLP) tasks. By evaluating on NTCIR-11 RITE3 Simplified Chinese subtask data set, this paper firstly demonstrates and compares the performance of Chinese textual entailment recognition models that combine different lexical, syntactic, and semantic features. Then a word embedding based lexical entailment module is added to enhance classification ability of our system further. The experimental results show that the word embedding for lexical semantic relation reasoning is effective and efficient in Chinese textual entailment.

Zhichang Zhang, Dongren Yao, Yali Pang, Xiaoyong Lu

Sentiment Analysis, Opinion Mining and Text Classification

Frontmatter
Negative Emotion Recognition in Spoken Dialogs

Increasing attention has been directed to the study of the automatic emotion recognition in human speech recently. This paper presents an approach for recognizing negative emotions in spoken dialogs at the utterance level. Our approach mainly includes two parts. First, in addition to the traditional acoustic features, linguistic features based on distributed representation are extracted from the text transcribed by an automatic speech recognition (ASR) system. Second, we propose a novel deep learning model, multi-feature stacked denoising autoencoders (MSDA), which can fuse the high-level representations of the acoustic and linguistic features along with contexts to classify emotions. Experimental results demonstrate that our proposed method yields an absolute improvement over the traditional method by 5.2 %.

Xiaodong Zhang, Houfeng Wang, Li Li, Maoxiang Zhao, Quanzhong Li
Incorporating Sample Filtering into Subject-Based Ensemble Model for Cross-Domain Sentiment Classification

Recently, cross-domain sentiment classification is becoming popular owing to its potential applications, such as marketing et al. It seeks to generalize a model, which is trained on a source domain and using it to label samples in the target domain. However, the source and target distributions differ substantially in many cases. To address this issue, we propose a comprehensive model, which takes sample filtering and labeling adaptation into account simultaneously, named joint Sample Filtering with Subject-based Ensemble Model (SF-SE). Firstly, a sentence level Latent Dirichlet Allocation (LDA) model, which incorporates topic and sentiment together (SS-LDA) is introduced. Under this model, a high-quality training dataset is constructed in an unsupervised way. Secondly, inspired by the distribution variance of domain-independent and domain-specific features related to the subject of a sentence, we introduce a Subject-based Ensemble model to efficiently improve the classification performance. Experimental results show that the proposed model is effective for cross-domain sentiment classification.

Liang Yang, Shaowu Zhang, Hongfei Lin, Xianhui Wei

Machine Translation

Frontmatter
Insight into Multiple References in an MT Evaluation Metric

Current evaluation metrics in machine translation (MT) make poor use of multiple reference translations. In this paper we focus on the METEOR metric to gain in-depth insights into how best multiple references can be exploited. Results on five score selection strategies reveal that it is not always wise to choose the best (closest to MT) reference to generate the candidate score. We also propose two weighting approaches by taking into account the recurring information among references. The modified METEOR scores significantly increase the correlation with human judgments on accuracy and fluency evaluation at system level.

Ying Qin, Lucia Specia
A Hybrid Sentence Splitting Method by Comma Insertion for Machine Translation with CRF

When writing formal articles many English writers often use long sentences with few punctuation marks. Since long sentences bring difficulty to machine translation systems, many researchers try to split them using punctuation marks before translation. But dealing with sentences with few punctuation marks is still intractable. In this paper we use a log linear model to insert commas into proper positions to split long sentence, trying to shorten the length of sub-sentence and benefit to machine translation. Experiment results show that our method can reasonably segment long sentences, and improve the quality of machine translation.

Shuli Yang, Chong Feng, Heyan Huang
Domain Adaptation for SMT Using Sentence Weight

We describe a sentence-level domain adaptation translation system, which trained with the sentence-weight model. Our system can take advantage of the domain information in each sentence rather than in the corpus. It is a fine-grained method for domain adaptation. By adding weights which reflect the preference of target domain to the sentences in the training set, we can improve the domain adaptation ability of a translation system. We set up the sentence-weight model depending on the similarity between sentences in the training set and the target domain text. In our method, the similarity is measured by the word frequency distribution. Our experiments on a large-scale Chinese-to-English translation task in news domain validate the effectiveness of our sentence-weight-based adaptation approach, with gains of up to 0.75 BLEU over a non-adapted baseline system.

Xinpeng Zhou, Hailong Cao, Tiejun Zhao

Multilinguality in NLP

Frontmatter
Types and Constructions of Exocentric Adjectives in Tibetan

The construction [NR+V/ADJR+SUF] in Tibetan may be analyzed as nominal phrases or as adjectives of exocentric construction with different meanings. How to segment, tag and auto-translate the two constructions will be a problem in Tibetan natural processing. This paper describes the constructions, types, variants in given situations and the ways of production of the exocentric constructions, which will give enlightenments to relative researches.

Di Jiang
Mongolian Speech Recognition Based on Deep Neural Networks

Mongolian is an influential language. And better Mongolian Large Vocabulary Continuous Speech Recognition (LVCSR) systems are required. Recently, the research of speech recognition has achieved a big improvement by introducing the Deep Neural Networks (DNNs). In this study, a DNN-based Mongolian LVCSR system is built. Experimental results show that the DNN-based models outperform the conventional models which based on Gaussian Mixture Models (GMMs) for the Mongolian speech recognition, by a large margin. Compared with the best GMM-based model, the DNN-based one obtains a relative improvement over 50 %. And it becomes a new state-of-the-art system in this field.

Hui Zhang, Feilong Bao, Guanglai Gao
Tibetan Word Segmentation as Sub-syllable Tagging with Syllable’s Part-of-Speech Property

When Tibetan word segmentation task is taken as a sequence labelling problem, machine learning models such as ME and CRFs can be used to train the segmenter. The performance of the segmenter is related to many factors. In the paper, three factors, namely strategy on abbreviated syllables, tag set, and the syllable’s Part-Of-Speech property, are compared. Experiment data show that: first, if each abbreviate syllable is separated into two units for labelling rather than one, the F-measure improves 0.06 % and 0.10 % on 4-tag set and 6-tag set respectively. Second, if 6-tag set is used rather than 4-tag set, the F-measure improves 0.10 % and 0.14 % on the two strategies on abbreviated syllables respectively. Third, when the syllable’s Part-Of-Speech property is take into account, F-measure improves 0.47 % and 0.41 % respectively than the other two methods without using it on 4-tag set, while it improves 0.45 % and 0.35 % on 6-tag set, which is much more higher than the former improvements. So it’s a better choice to take advantage of the syllable’s Part-Of-Speech property information while using the sub-syllable as the tag unit.

Huidan Liu, Congjun Long, Minghua Nuo, Jian Wu
Learning Distributed Representations of Uyghur Words and Morphemes

While distributed representations have proven to be very successful in a variety of NLP tasks, learning distributed representations for agglutinative languages such as Uyghur still faces a major challenge: most words are composed of many morphemes and occur only once on the training data. To address the data sparsity problem, we propose an approach to learn distributed representations of Uyghur words and morphemes from unlabeled data. The central idea is to treat morphemes rather than words as the basic unit of representation learning. We annotate a Uyghur word similarity dataset and show that our approach achieves significant improvements over CBOW, a state-of-the-art model for computing vector representations of words.

Halidanmu Abudukelimu, Yang Liu, Xinxiong Chen, Maosong Sun, Abudoukelimu Abulizi

Machine Learning Method for NLP

Frontmatter
EHLLDA: A Supervised Hierarchical Topic Model

In this paper, we consider the problem of modeling hierarchical labeled data – such as Web pages and their placement in hierarchical directories. The state-of-the-art model, hierarchical Labeled LDA (hLLDA), assumes that each child of a non-leaf label has equal importance, and that a document in the corpus cannot locate in a non-leaf node. However, in most cases, these assumptions do not meet the actual situation. Thus, in this paper, we introduce a supervised hierarchical topic models: Extended Hierarchical Labeled Latent Dirichlet Allocation (EHLLDA), which aim to relax the assumptions of hLLDA by incorporating prior information of labels into hLLDA. The experimental results show that the perplexity performance of EHLLDA is always better than that of LLDA and hLLDA on all four datasets; and our proposed model is also superior to hLLDA in terms of p@n.

Xian-Ling Mao, Yixuan Xiao, Qiang Zhou, Jun Wang, Heyan Huang
Graph-Based Dependency Parsing with Recursive Neural Network

Graph-based dependency parsing models have achieved state-of-the-art performance, yet their defect in feature representation is obvious: these models enforce strong independence assumptions upon tree components, thus restricting themselves to local, shallow features with limited context information. Besides, they rely heavily on hand-crafted feature templates. In this paper, we extend recursive neural network into dependency parsing. This allows us to efficiently represent the whole sub-tree context and rich structural information for each node. We propose a heuristic search procedure for decoding. Our model can also be used in the reranking framework. With words and pos-tags as the only input features, it gains significant improvement over the baseline models, and shows advantages in capturing long distance dependencies.

Pingping Huang, Baobao Chang
A Neural Network Based Translation Constrained Reranking Model for Chinese Dependency Parsing

Bilingual dependency parsing aims to improve parsing performance with the help of bilingual information. While previous work have shown improvements on either or both sides, most of them mainly focus on designing complicated features and rely on golden translations during training and testing. In this paper, we propose a simple yet effective translation constrained reranking model to improve Chinese dependency parsing. The reranking model is trained using a max-margin neural network without any manually designed features. Instead of using golden translations for training and testing, we relax the restrictions and use sentences generated by a machine translation system, which dramatically extends the scope of our model. Experiments on the translated portion of the Chinese Treebank show that our method outperforms the state-of-the-art monolingual Graph/Transition-based parsers by a large margin (UAS).

Miaohong Chen, Baobao Chang, Yang Liu

Knowledge Graph and Information Extraction

Frontmatter
Distantly Supervised Neural Network Model for Relation Extraction

For the task of relation extraction, distant supervision is an efficient approach to generate labeled data by aligning knowledge base (KB) with free texts. Albeit easy to scale to thousands of different relations, this procedure suffers from introducing wrong labels because the relations in knowledge base may not be expressed by aligned sentences (mentions). In this paper, we propose a novel approach to alleviate the problem of distant supervision with representation learning in the framework of deep neural network. Our model - Distantly Supervised Neural Network (DSNN) - constructs the more powerful mention level representation by tensor-based transformation and further learns the entity pair level representation which aggregates and denoises the features of associated mentions. With this denoised representation, all of the relation labels can be jointly learned. Experimental results show that with minimal feature engineering, our model generally outperforms state-of-the-art methods for distantly supervised relation extraction.

Zhen Wang, Baobao Chang, Zhifang Sui
Learning Entity Representation for Named Entity Disambiguation

In this paper we present a novel disambiguation model, based on neural networks. Most existing studies focus on designing effective man-made features and complicated similarity measures to obtain better disambiguation performance. Instead, our method learns distributed representation of entity to measure similarity without man-made features. Entity representation consists of context document representation and category representation. Document representation of an entity is learned based on deep neural network (DNN), and is directly optimized for a given similarity measure. Convolutional neural network (CNN) is employed to obtain category representation, and shares deep layers with DNN. Both models are trained jointly using massive documents collected from Baike http://baike.baidu.com/. Experiment results show that our method achieves a good performance on two datasets without any manually designed features.

Rui Cai, Houfeng Wang, Junhao Zhang
Exploring Recurrent Neural Networks to Detect Named Entities from Biomedical Text

Biomedical named entity recognition (bio-NER) is a crucial and basic step in many biomedical information extraction tasks. However, traditional NER systems are mainly based on complex hand-designed features which are derived from various linguistic analyses and maybe only adapted to specified area. In this paper, we construct Recurrent Neural Network to identify entity names with word embeddings input rather than hand-designed features. Our contributions mainly include three aspects: (1) we adapt a deep learning architecture Recurrent Neural Network (RNN) to entity names recognition; (2) based on the original RNNs such as Elman-type and Jordan-type model, an improved RNN model is proposed; (3) considering that both past and future dependencies are important information, we combine bidirectional recurrent neural networks based on information entropy at the top layer. The experiments conducted on the BioCreative II GM data set demonstrate RNN models outperform CRF and deep neural networks (DNN), furthermore, the improved RNN model performs better than two original RNN models and the combined method is effective.

Lishuang Li, Liuke Jin, Degen Huang

Discourse, Coreference and Pragmatics

Frontmatter
Predicting Implicit Discourse Relations with Purely Distributed Representations

Discourse relations between two consecutive segments play an important role in many natural language processing (NLP) tasks. However, a large portion of the discourse relations are implicit and difficult to detect due to the absence of connectives. Traditional detection approaches utilize discrete features, such as words, clusters and syntactic production rules, which not only depend strongly on the linguistic resources, but also lead to severe data sparseness. In this paper, we instead propose a novel method to predict the implicit discourse relations based on the purely distributed representations of words, sentences and syntactic features. Furthermore, we learn distributed representations for different kinds of features. The experiments show that our proposed method can achieve the best performance in most cases on the standard data sets.

Haoran Li, Jiajun Zhang, Chengqing Zong

Information Retrieval and Question Answering

Frontmatter
Answer Quality Assessment in CQA Based on Similar Support Sets

Community question answering portal (CQA) has become one of the most important sources for people to seek information from the Internet. With great quantity of online users ready to help, askers are willing to post questions in CQA and are likely to obtain desirable answers. However, the answer quality in CQA varies widely, from helpful answers to abusive spam. Answer quality assessment is therefore of great significance. Most of the existing approaches evaluate answer quality based on the relevance between questions and answers. Due to the lexical gap between questions and answers, these approaches are not quite satisfactory. In this paper, a novel approach is proposed to rank the candidate answers, which utilizes the support sets to reduce the impact of lexical gap between questions and answers. Firstly, similar questions are retrieved and support sets are produced with their high quality answers. Based on the assumption that high quality answers of similar questions would also have intrinsic similarity, the quality of candidate answers are then evaluated through their distance from the support sets in both aspects of content and structure. Unlike most of the existing approaches, previous knowledge from similar question-answer pairs are used to bridged the straight lexical and semantic gaps between questions and answers. Experiments are implemented on approximately 2.15 million real-world question-answer pairs from Yahoo! Answers to verify the effectiveness of our approach. The results on metrics of MAP@K and MRR show that the proposed approach can rank the candidate answers precisely.

Zongsheng Xie, Yuanping Nie, Songchang Jin, Shudong Li, Aiping Li
Learning to Rank Answers for Definitional Question Answering

In definitional question answering (QA), it is essential to rank the candidate answers. In this paper, we propose an online learning algorithm, which dynamically construct the supervisor to reduce the adverse effects of the large number of bad answers and noisy data. We compare our method with two state-of-the-art definitional QA systems and two ranking algorithms, and the experimental results show our method outperforms the others.

Shiyu Wu, Xipeng Qiu, Xuanjing Huang, Junkuo Cao
A WordNet Expansion-Based Approach for Question Targets Identification and Classification

Question target identification and classification is a fundamental and essential research for finding suitable target answer type in a question answering system, aiming for improving question answering performance by filtering out irrelevant candidate answers. This paper presents a new automated approach for question target classification based on WordNet expansion. Our approach identifies question target words using dependency relations and answer type rules through the investigation of sample questions. Leveraging semantic relations, e.g., hyponymy, we expanse the question target words as features and apply a widely used classifier LibSVM to achieve question target classification. Our experiment datasets are the standard UIUC 5500 annotated questions and TREC 10 question dataset. The performance presents that our approach can achieve an accuracy of 87.9 % with fine gained classification on UIUC dataset and 86.8 % on TREC 10 dataset, demonstrating its effectiveness.

Tianyong Hao, Wenxiu Xie, Feifei Xu

Social Computing

Frontmatter
Clustering Chinese Product Features with Multilevel Similarity

This paper presents an unsupervised hierarchical clustering approach for grouping co-referred features in Chinese product reviews. To handle different levels of connections between co-referred product features, we consider three similarity measures, namely the literal similarity, the word embedding-based semantic similarity and the explanatory evaluation based contextual similarity. We apply our approach to two corpora of product reviews in car and mobilephone domains. We demonstrate that combining multilevel similarity is of great value to feature normalization.

Yu He, Jiaying Song, Yuzhuang Nan, Guohong Fu
Improving Link Prediction in Social Networks by User Comments and Sentiment Lexicon

In some online Social Network Services, users are allowed to label their relationship with others, which can be represented as links with signed values. The link prediction problem is to estimate the values of unknown links by the information from the social network. A lot of similarity based metrics and machine learning based methods are proposed. Most of these methods are based on the network topological and node states. In this paper, by considering the information from user comment and sentiment lexicon, our methods improved the performances of link prediction for both similarity based metrics and machine learning based methods.

Feng Liu, Bingquan Liu, Chengjie Sun, Ming Liu, Xiaolong Wang

NLP Applications

Frontmatter
Finite-to-Infinite N-Best POMDP for Spoken Dialogue Management

Partially Observable Markov Decision Process (POMDP) has been widely used as dialogue management in slot-filling Spoken Dialogue System (SDS). But there are still lots of open problems. The contribution of this paper lies in two aspects. Firstly, the observation probability of POMDP is estimated from the N-Best list of Automatic Speech Recognition (ASR) rather than the top one. This modification gives SDS a chance to address the uncertainty of ASR. Secondly, a dynamic binding technique is proposed for slots with infinite values so as to deal with uncertainty of talking object. The proposed methods have been implemented on a teach-and-learn spoken dialogue system. Experimental results show that performance of system improves significantly by introducing the proposed methods.

Guohua Wu, Caixia Yuan, Bing Leng, Xiaojie Wang
Academic Paper Recommendation Based on Heterogeneous Graph

Digital libraries suffer from the overload problem, which makes the researchers have to spend much time to find relevant papers. Fortunately, recommender system can help to find some relevant papers for researchers automatically according to their browsed papers. Previous paper recommendation methods are either citation-based or content-based. In this paper, we propose a novel recommendation method with a heterogeneous graph in which both citation and content knowledge are included. In detail, a heterogeneous graph is constructed to represent both citation and content information within papers. Then, we apply a graph-based similarity learning algorithm to perform our paper recommendation task. Finally, we evaluate our proposed approach on the ACL Anthology Network data set and conduct an extensive comparison with other recommender approaches. The experimental results demonstrate that our approach outperforms traditional methods.

Linlin Pan, Xinyu Dai, Shujian Huang, Jiajun Chen
Learning Document Representation for Deceptive Opinion Spam Detection

Deceptive opinion spam in reviews of products or service is very harmful for customers in decision making. Existing approaches to detect deceptive spam are concern on feature designing. Hand-crafted features can show some linguistic phenomenon, but is time-consuming and can not reveal the connotative semantic meaning of the review. We present a neural network to learn document-level representation. In our model, we not only learn to represent each sentence but also represent the whole document of the review. We apply traditional convolutional neural network to represent the semantic meaning of sentences. We present two variant convolutional neural-network models to learn the document representation. The model taking sentence importance into consideration shows the better performance in deceptive spam detection which enhances the value of F1 by 5 %.

Luyang Li, Wenjing Ren, Bing Qin, Ting Liu
A Practical Keyword Recommendation Method Based on Probability in Digital Publication Domain

The increase of information and knowledge has brought great challenge in knowledge management which includes knowledge storage, information retrieval and knowledge sharing. In digital publication domain, books are segmented into items that focus on target topic for dynamic digital publication. The management of items has great need to annotate items automatically instead of annotating by editor manually. This paper proposed probability based and hybrid method to recommend meaningful keywords for items. Experiment shows that the methods we proposed get more than 90 % precision, recall and f1 value on the digital publication dataset which outperforms the traditional extraction based and tfidf similarity based method in keyword recommendation.

Yuejun Li, Xiao Feng, Shuwu Zhang
Automatic Knowledge Extraction and Data Mining from Echo Reports of Pediatric Heart Disease: Application on Clinical Decision Support

Echocardiography (Echo) reports of the patients with pediatric heart disease contain many disease related information, which provide great support to physicians for clinical decision. Such as treatment customization based on the risk level of the specific patient. With the help of natural language processing (NLP), information can be automatically extracted from free-text reports. Those structured data is much easier to analyze with the existing data mining approaches. In this study, we extract the entity/anatomic site-feature-value (EFV) triples in the Echo reports and predict the risk level on this basis. The prediction accuracy of machine learning and rule-based method are compared based on a manual prepared ideal data, to explore the application of automatic knowledge extraction on clinical decision support.

Yahui Shi, Zuofeng Li, Zheng Jia, Binyang Hu, Meizhi Ju, Xiaoyan Zhang, Haomin Li
Backmatter
Metadata
Title
Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data
Editors
Maosong Sun
Zhiyuan Liu
Min Zhang
Yang Liu
Copyright Year
2015
Electronic ISBN
978-3-319-25816-4
Print ISBN
978-3-319-25815-7
DOI
https://doi.org/10.1007/978-3-319-25816-4

Premium Partner