Skip to main content

2018 | Buch

Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data

17th China National Conference, CCL 2018, and 6th International Symposium, NLP-NABD 2018, Changsha, China, October 19–21, 2018, Proceedings

insite
SUCHEN

Über dieses Buch

This book constitutes the proceedings of the 17th China National Conference on Computational Linguistics, CCL 2018, and the 6th International Symposium on Natural Language Processing Based on Naturally Annotated Big Data, NLP-NABD 2018, held in Changsha, China, in October 2018.

The 33 full papers presented in this volume were carefully reviewed and selected from 84 submissions. They are organized in topical sections named: Semantics; machine translation; knowledge graph and information extraction; linguistic resource annotation and evaluation; information retrieval and question answering; text classification and summarization; social computing and sentiment analysis; and NLP applications.

Inhaltsverzeichnis

Frontmatter

Semantics

Frontmatter
Radical Enhanced Chinese Word Embedding

The conventional Chinese word embedding model is similar to the English word embedding model in modeling text, simply uses the Chinese word or character as the minimum processing unit of the text, without using the semantic information about Chinese characters and the radicals in Chinese words. To this end, we proposed a radical enhanced Chinese word embedding in this paper. The model uses conversion and radical escaping mechanisms to extract the intrinsic information in Chinese corpus. Through the improved parallel dual-channel network model on a CBOW-like model, the word information context is used together with the Chinese character radical information context to predict the target word. Therefore, the word vector generated by the model can fully reflect the semantic information contained in the radicals. Compared with other similar models by word analogy and similarity experiments, the results showed that our model has effectively improved the accuracy of word vector expression and the direct relevance of similar words.

Zheng Chen, Keqi Hu
Syntax Enhanced Research Method of Stylistic Features

Nowadays, research on stylistic features (SF) mainly focuses on two aspects: lexical elements and syntactic structures. The lexical elements act as the content of a sentence and the syntactic structures constitute the framework of a sentence. How to combine both aspects and exploit their common advantages is a challenging issue. In this paper, we propose a Principal Stylistic Features Analysis method (PSFA) to combine these two parts, and then mine the relations between features. From a statistical analysis point of view, many interesting linguistic phenomena can be found. Through the PSFA method, we finally extract some representative features which cover different aspects of styles. To verify the performance of these selected features, classification experiments are conducted. The results show that the elements selected by the PSFA method provide a significantly higher classification accuracy than other advanced methods.

Haiyan Wu, Ying Liu
Addressing Domain Adaptation for Chinese Word Segmentation with Instances-Based Transfer Learning

Recent studies have shown effectiveness in using neural networks for Chinese Word Segmentation (CWS). However, these models, constrained by the domain and size of the training corpus, do not work well in domain adaptation. In this paper, we propose a novel instance-transferring method, which use valuable target domain annotated instances to improve CWS on different domains. Specifically, we introduce semantic similarity computation based on character-based n-gram embedding to select instances. Furthermore, training sentences similar to instances are used to help annotate instances. Experimental results show that our method can effectively boost cross-domain segmentation performance. We achieve state-of-the-art results on Internet literatures datasets, and competitive results to the best reported on micro-blog datasets.

Yanna Zhang, Jinan Xu, Guoyi Miao, Yufeng Chen, Yujie Zhang

Machine Translation

Frontmatter
Collaborative Matching for Sentence Alignment

Existing sentence alignment methods are founded fundamentally on sentence length and lexical correspondences. Methods based on the former follow in general the length proportionality assumption that the lengths of sentences in one language tend to be proportional to that of their translations, and are known to bear poor adaptivity to new languages and corpora. In this paper, we attempt to interpret this assumption from a new perspective via the notion of collaborative matching, based on the observation that sentences can work collaboratively during alignment rather than separately as in previous studies. Our approach is tended to be independent on any specific language and corpus, so that it can be adaptively applied to a variety of texts without binding to any prior knowledge about the texts. We use one-to-one sentence alignment to illustrate this approach and implement two specific alignment methods, which are evaluated on six bilingual corpora of different languages and domains. Experimental results confirm the effectiveness of this collaborative matching approach.

Xiaojun Quan, Chunyu Kit, Wuya Chen
Finding Better Subword Segmentation for Neural Machine Translation

For different language pairs, word-level neural machine translation (NMT) models with a fixed-size vocabulary suffer from the same problem of representing out-of-vocabulary (OOV) words. The common practice usually replaces all these rare or unknown words with a $$\langle $$ ⟨ UNK $$\rangle $$ ⟩ token, which limits the translation performance to some extent. Most of recent work handled such a problem by splitting words into characters or other specially extracted subword units to enable open-vocabulary translation. Byte pair encoding (BPE) is one of the successful attempts that has been shown extremely competitive by providing effective subword segmentation for NMT systems. In this paper, we extend the BPE style segmentation to a general unsupervised framework with three statistical measures: frequency (FRQ), accessor variety (AV) and description length gain (DLG). We test our approach on two translation tasks: German to English and Chinese to English. The experimental results show that AV and DLG enhanced systems outperform the FRQ baseline in the frequency weighted schemes at different significant levels.

Yingting Wu, Hai Zhao
Improving Low-Resource Neural Machine Translation with Weight Sharing

Neural machine translation (NMT) has achieved great success under a great deal of bilingual corpora in the past few years. However, it is much less effective for low-resource language. In order to alleviate the problem, we present two approaches which can improve the performance of low-resource NMT system. The first approach employs the weight sharing of decoder to enhance the target language model of low-resource NMT system. The second approach applies cross-lingual embedding and source sentence representation space sharing to strengthen the encoder of low-resource NMT. Our experiments demonstrate that the proposed method can obtain significant improvements on low-resource neural machine translation than baseline system. On the IWSLT2015 Vietnamese-English translation task, our model can improve the translation quality by an average of 1.43 BLEU scores. Besides, we can also get the increase of 0.96 BLEU scores when translating from Mongolian to Chinese.

Tao Feng, Miao Li, Xiaojun Liu, Yichao Cao
Identifying Word Translations in Scientific Literature Based on Labeled Bilingual Topic Model and Co-occurrence Features

Aiming at the increasingly rich multi language information resources and multi-label data in scientific literature, in order to mining the relevance and correlation in languages, this paper proposed the labeled bilingual topic model and co-occurrence feature based similarity metric which could be adopted to the word translation identifying task. First of all, it could assume that the keywords in the scientific literature are relevant to the abstract in the same article, then extracted the keywords and regard it as labels, labels with topics are assigned and the “latent” topic was instantiated. Secondly, the abstracts in article were trained by the labeled bilingual topic model and got the word representation on the topic distribution. Finally, the most similar word between both languages was matched with similarity metric proposed in this paper. The experiment result shows that the labeled bilingual topic model reaches better precision than “latent” topic model based bilingual model, and co-occurrence features enhance the attractiveness of the bilingual word pairs to improve the identifying effects.

Mingjie Tian, Yahui Zhao, Rongyi Cui
Term Translation Extraction from Historical Classics Using Modern Chinese Explanation

Extracting term translation pairs is of great help for Chinese historical classics translation since term translation is the most time-consuming and challenging part in the translation of historical classics. However, it is tough to recognize the terms directly from ancient Chinese due to the flexible syntactic of ancient Chinese and the word segmentation errors of ancient Chinese will lead to more errors in term translation extraction. Considering most of the terms in ancient Chinese are still reserved in modern Chinese and the terms in modern Chinese are more easily to be identified, we propose a term translation extracting method using multi-features based on character-based model to extract historical term translation pairs from modern Chinese-English corpora instead of ancient Chinese-English corpora. Specifically, we first employ character-based BiLSTM-CRF model to identify historical terms in modern Chinese without word segmentation, which avoids word segmentation error spreading to the term alignment. Then we extract English terms according to initial capitalization rules. At last, we align the English and Chinese terms based on co-occurrence frequency and transliteration feature. The experiment on Shiji demonstrates that the performance of the proposed method is far superior to the traditional method, which confirms the effectiveness of using modern Chinese as a substitute.

Xiaoting Wu, Hanyu Zhao, Chao Che
Research on Chinese-Tibetan Neural Machine Translation

At present, the research on Tibetan machine translation is mainly focused on Tibetan-Chinese machine translation and the research on Chinese-Tibetan machine translation is almost blank. In this paper, the neural machine translation model is applied to the Chinese-Tibetan machine translation task for the first time, the syntax tree is also introduced into the Chinese-Tibetan neural machine translation model for the first time, and a good translation effect is achieved. Besides, the preprocessing methods we use are syllable segmentation on Tibetan corpus and character segmentation on Chinese Corpus, which has a better performance than the word segmentation on both Chinese and Tibetan corpus. The experimental results show that performance of the neural network translation model based on the completely self-attention mechanism is the best in the Chinese-Tibetan machine translation task and the BLEU score is increased by one percentage point.

Wen Lai, Xiaobing Zhao, Xiaqing Li

Knowledge Graph and Information Extraction

Frontmatter
Metadata Extraction for Scientific Papers

Metadata extraction for scientific literature is to automatically annotate each paper with metadata that represents its most valuable information, including problem, method and dataset. Most existing work normally extract keywords or key phrases as concepts for further analysis without their fine-grained types. In this paper, we present a supervised method with three-stages to address the problem. The first step extracts key phrases as metadata candidates, and the second step introduces various features, i.e., statistical features, linguistics features, position features and a novel fine-grained distribution feature which has high relevance with metadata categories, to type the candidates into three foregoing categories. In the evaluation, we conduct extensive experiments on a manually-labeled dataset from ACL Anthology and the results show our proposed method achieves a +3.2% improvement in accuracy compared with strong baseline methods.

Binjie Meng, Lei Hou, Erhong Yang, Juanzi Li
Knowledge Graph Embedding with Logical Consistency

Existing methods for knowledge graph embedding do not ensure the high-rank triples predicted by themselves to be as consistent as possible with the logical background which is made up of a knowledge graph and a logical theory. Users must take great effort to filter consistent triples before adding new triples to the knowledge graph. To alleviate users’ burden, we propose an approach to enhancing existing embedding-based methods to encode logical consistency into the learnt distributed representation for the knowledge graph, enforcing high-rank new triples as consistent as possible. To evaluate this approach, four knowledge graphs with logical theories are constructed from the four great classical masterpieces of Chinese literature. Experimental results on these datasets show that our approach is able to guarantee high-rank triples as consistent as possible while preserving a comparable performance as baseline methods in link prediction and triple classification.

Jianfeng Du, Kunxun Qi, Yuming Shen
An End-to-End Entity and Relation Extraction Network with Multi-head Attention

Relation extraction is an important semantic processing task in natural language processing. The state-of-the-art systems usually rely on elaborately designed features, which are usually time-consuming and may lead to poor generalization. Besides, most existing systems adopt pipeline methods, which treat the task as two separated tasks, i.e., named entity recognition and relation extraction. However, the pipeline methods suffer two problems: (1) Pipeline model over-simplifies the task to two independent parts. (2) The errors will be accumulated from named entity recognition to relation extraction. Therefore, we present a novel joint model for entities and relations extraction based on multi-head attention, which avoids the problems in the pipeline methods and reduces the dependence on features engineering. The experimental results show that our model achieves good performance without extra features. Our model reaches an F-score of 85.7% on SemEval-2010 relation extraction task 8, which has competitive performance without extra feature compared with previous joint models. On publication, codes will be made publicly available.

Lishuang Li, Yuankai Guo, Shuang Qian, Anqiao Zhou
Attention-Based Convolutional Neural Networks for Chinese Relation Extraction

Relation extraction is an important part of many information extraction systems that mines structured facts from texts. Recently, deep learning has achieved good results in relation extraction. Attention mechanism is also gradually applied to networks, which improves the performance of the task. However, the current attention mechanism is mainly applied to the basic features on the lexical level rather than the higher overall features. In order to obtain more information of high-level features for relation predicting, we proposed attention-based piecewise convolutional neural networks (PCNN_ATT), which add an attention layer after the piecewise max pooling layer in order to get significant information of sentence global features. Furthermore, we put forward a data extension method by utilizing an external dictionary HIT IR-Lab Tongyici Cilin (Extended). Experiments results on ACE-2005 and COAE-2016 Chinese datasets both demonstrate that our approach outperforms most of the existing methods.

Wenya Wu, Yufeng Chen, Jinan Xu, Yujie Zhang
A Study on Improving End-to-End Neural Coreference Resolution

This paper studies the methods to improve end-to-end neural coreference resolution. First, we introduce a coreference cluster modification algorithm, which can help modify the coreference cluster to rule out the dissimilar mention in the cluster and reduce errors caused by the global inconsistence of coreference clusters. Additionally, we tune the model from two aspects to get more accurate coreference resolution results. On one hand, the simple scoring function is replaced with a feed-forward neural network when computing the head word scores for later attention mechanism which can help pick out the most important word. On the other hand, the maximum width of a mention is tuned. Our experimental results show that above methods improve the performance of coreference resolution effectively.

Jia-Chen Gu, Zhen-Hua Ling, Nitin Indurkhya
Type Hierarchy Enhanced Heterogeneous Network Embedding for Fine-Grained Entity Typing in Knowledge Bases

Type information is very important in knowledge bases, but some large knowledge bases are lack of type information due to the incompleteness of knowledge bases. In this paper, we propose to use a well-defined taxonomy to help complete the type information in some knowledge bases. Particularly, we present a novel embedding based hierarchical entity typing framework which uses learning to rank algorithm to enhance the performance of word-entity-type network embedding. In this way, we can take full advantage of labeled and unlabeled data. Extensive experiments on two real-world datasets of DBpedia show that our proposed method significantly outperforms 4 state-of-the-art methods, with 2.8% and 4.2% improvement in Mi-F1 and Ma-F1 respectively.

Hailong Jin, Lei Hou, Juanzi Li
Scientific Keyphrase Extraction: Extracting Candidates with Semi-supervised Data Augmentation

Keyphrase extraction can provide effective ways of organizing scientific documents. For this task, neural-based methods usually suffer from performance unstability due to data scarcity. In this paper, we adopt the pipeline two-step method including candidate extraction and keyphrase ranking, where candidate extraction is a key to influence the whole performance. In the candidate extraction step, to overcome the low-recall problem of traditional rule-based method, we propose a novel semi-supervised data augmentation method, where a neural-based tagging model and a discriminative classifier boost each other and get more confident phrases as candidates. With more reasonable candidates, keyphrase are identified with recall promoted. Experiments on SemEval 2017 Task 10 show that our model can achieve competitive results.

Qianying Liu, Daisuke Kawahara, Sujian Li

Linguistic Resource Annotation and Evaluation

Frontmatter
Using a Chinese Lexicon to Learn Sense Embeddings and Measure Semantic Similarity

Word embeddings have recently been widely used to model words in Natural Language Processing (NLP) tasks including semantic similarity measurement. However, word embeddings are not able to capture polysemy, because a polysemous word is represented by a single vector. To address this problem, learning multiple embedding vectors for different senses of a word is necessary and intuitive. We present a novel approach based on a Chinese lexicon to learn sense embeddings. Every sense is represented by a vector that consists of semantic contributions made by senses explaining it. To make full use of the lexicon’s advantages and address its drawbacks, we perform representation expansion to make sparse embedding vectors dense and disambiguate in gloss polysemous words by semantic contribution allocation. Thanks to the use of an intuitive way of noise filtering, we achieve noticeable improvement both in dimensionality reduction and semantic similarity measurement. We perform experiments on a translated version of Miller-Charles dataset and report state-of-the-art performance on semantic similarity measurement. We also apply our approach to SemEval-2012 Task4: Evaluating Chinese Word Similarity, which uses a translated version of wordsim353 as the standard dataset, and our approach also noticeably outperforms conventional approaches.

Zhuo Zhen, Yuquan Chen
Revisiting Correlations between Intrinsic and Extrinsic Evaluations of Word Embeddings

The evaluation of word embeddings has received a considerable amount of attention in recent years, but there have been some debates about whether intrinsic measures can predict the performance of downstream tasks. To investigate this question, this paper presents the first study on the correlation between results of intrinsic evaluation and extrinsic evaluation with Chinese word embeddings. We use word similarity and word analogy as the intrinsic tasks, Named Entity Recognition and Sentiment Classification as the extrinsic tasks. A variety of Chinese word embeddings trained with different corpora and context features are used in the experiments. From the data analysis, we reach some interesting conclusions: there are strong correlations between intrinsic and extrinsic evaluations, and the performance of different tasks can be affected by training corpora and context features to varying degrees.

Yuanyuan Qiu, Hongzheng Li, Shen Li, Yingdi Jiang, Renfen Hu, Lijiao Yang

Information Retrieval and Question Answering

Frontmatter
Question-Answering Aspect Classification with Hierarchical Attention Network

In e-commerce websites, user-generated question-answering text pairs generally contain rich aspect information of products. In this paper, we address a new task, namely Question-answering (QA) aspect classification, which aims to automatically classify the aspect category of a given QA text pair. In particular, we build a high-quality annotated corpus with specifically designed annotation guidelines for QA aspect classification. On this basis, we propose a hierarchical attention network to address the specific challenges in this new task in three stages. Specifically, we firstly segment both question text and answer text into sentences, and then construct (sentence, sentence) units for each QA text pair. Second, we leverage a QA matching attention layer to encode these (sentence, sentence) units in order to capture the aspect matching information between the sentence inside question text and the sentence inside answer text. Finally, we leverage a self-matching attention layer to capture different importance degrees of different (sentence, sentence) units in each QA text pair. Experimental results demonstrate that our proposed hierarchical attention network outperforms some strong baselines for QA aspect classification.

Hanqian Wu, Mumu Liu, Jingjing Wang, Jue Xie, Chenlin Shen
End-to-End Task-Oriented Dialogue System with Distantly Supervised Knowledge Base Retriever

Task-oriented dialog systems usually face the challenge of querying knowledge base. However, it usually cannot be explicitly modeled due to the lack of annotation. In this paper, we introduce an explicit KB retrieval component (KB retriever) into the seq2seq dialogue system. We first use the KB retriever to get the most relevant entry according to the dialogue history and KB, and then apply the copying mechanism to retrieve entities from the retrieved KB in decoding time. Moreover, the KB retriever is trained with distant supervision, which does not need any annotation efforts. Experiments on Stanford Multi-turn Task-oriented Dialogue Dataset shows that our framework significantly outperforms other sequence-to-sequence based baseline models on both automatic and human evaluation.

Libo Qin, Yijia Liu, Wanxiang Che, Haoyang Wen, Ting Liu
Attention-Based CNN-BLSTM Networks for Joint Intent Detection and Slot Filling

Dialogue intent detection and semantic slot filling are two critical tasks in nature language understanding (NLU) for task-oriented dialog systems. In this paper, we present an attention-based encoder-decoder neural network model for joint intent detection and slot filling, which encodes sentence representation with a hybrid Convolutional Neural Networks and Bidirectional Long Short-Term Memory Networks (CNN-BLSTM), and decodes it with an attention-based recurrent neural network with aligned inputs. In the encoding process, our model firstly extracts higher-level phrase representations and local features from each utterance using convolutional neural network, and then propagates historical contextual semantic information with a bidirectional long short-term memory network layer architecture. Accordingly, we could obtain sentence representation by merging the two architectures mentioned above. In the decoding process, we introduce attention mechanism in long short-term memory networks that can provide additional sematic information. We conduct experiment on dialogue intent detection and slot filling tasks with standard data set Airline Travel Information System (ATIS). Experimental results manifest that our proposed model can achieve better overall performance.

Yufan Wang, Li Tang, Tingting He
Multi-Perspective Fusion Network for Commonsense Reading Comprehension

Commonsense Reading Comprehension (CRC) is a significantly challenging task, aiming at choosing the right answer for the question referring to a narrative passage, which may require commonsense knowledge inference. Most of the existing approaches only fuse the interaction information of choice, passage, and question in a simple combination manner from a union perspective, which lacks the comparison information on a deeper level. Instead, we propose a Multi-Perspective Fusion Network (MPFN), extending the single fusion method with multiple perspectives by introducing the difference and similarity fusion. More comprehensive and accurate information can be captured through the three types of fusion. We design several groups of experiments on MCScript dataset [11] to evaluate the effectiveness of the three types of fusion respectively. From the experimental results, we can conclude that the difference fusion is comparable with union fusion, and the similarity fusion needs to be activated by the union fusion. The experimental result also shows that our MPFN model achieves the state-of-the-art with an accuracy of 83.52% on the official test set.

Chunhua Liu, Yan Zhao, Qingyi Si, Haiou Zhang, Bohan Li, Dong Yu

Text Classification and Summarization

Frontmatter
A Hierarchical Hybrid Neural Network Architecture for Chinese Text Summarization

Using sequence-to-sequence models for abstractive text summarization is generally plagued by three problems: inability to deal with out-of-vocabulary words, repetition in summaries and time-consuming in training. The paper proposes a hierarchical hybrid neural network architecture for Chinese text summarization. Three mechanisms, hierarchical attention mechanism, pointer mechanism and coverage mechanism, are integrated into the architecture to improve the performance of summarization. The proposed model is applied to Chinese news headline generation. The experimental results suggest that the model outperforms the baseline in ROUGE scores and the three mechanisms can improve the quality of summaries.

Yunheng Zhang, Leihan Zhang, Ke Xu, Le Zhang
TSABCNN: Two-Stage Attention-Based Convolutional Neural Network for Frame Identification

As an essential sub-task of frame-semantic parsing, Frame Identification (FI) is a fundamentally important research topic in shallow semantic parsing. However, most existing work is based on sophisticated, hand-crafted features which might not be compatible with FI procedure. Besides that, they usually heavily rely on available natural language processing (NLP) toolkits and various lexical resources. Thus existing methods with hand-crafted features may not achieve satisfactory performance. In this paper, we propose a two-stage attention-based convolutional neural network (TSABCNN) to alleviate this problem and capture the most important context features for FI task. In order to dynamically adjust the weight of each feature, we build two levels of attention over instances at input layer and pooling layer respectively. Furthermore, the proposed model is an end-to-end learning framework which does not need any complicated NLP toolkits and feature engineering, and can be applied to any language. Experiments results on FrameNet and Chinese FrameNet (CFN) show the effectiveness of the proposed approach for the FI task.

Hongyan Zhao, Ru Li, Fei Duan, Zepeng Wu, Shaoru Guo
Linked Document Classification by Network Representation Learning

Network Representation Learning (NRL) can learn a latent space representation of each vertex in a topology network structure to reflect linked information. Recently, NRL algorithms have been applied to obtain document embedding in linked document network, such as citation websites. However, most existing document representation methods with NRL are unsupervised and they cannot combine NRL with a concrete task-specific NLP tasks. So in this paper, we propose a unified end-to-end hybrid Linked Document Classification (LDC) model which can capture semantic features and topological structure of documents to improve the performance of document classification. In addition, we investigate to use a more flexible strategy to capture structure similarity to improve the traditional rigid extraction of linked document topology structure. The experimental results suggest that our proposed model outperforms other document classification methods especially in the case of having less training sets.

Yue Zhang, Liying Zhang, Yao Liu
A Word Embedding Transfer Model for Robust Text Categorization

It is common to fine-tune pre-trained word embeddings in text categorization. However, we find that fine-tuning does not guarantee improvement across text categorization datasets, while could introduce considerable parameters to model. In this paper, we study new transfer methods to solve the problems above, and propose “Robustness of OOVs” to provide a perspective to reduce memory consumption further. The experimental results show that the proposed method is proved to be a good alternative to fine-tuning method on large dataset.

Yiming Zhang, Jing Wang, Weijian Deng, Yaojie Lu
Review Headline Generation with User Embedding

In this paper, we conduct a review headline generation task that produces a short headline from a review post by a user. We argue that this task is more challenging than document summarization, because the headlines generated by users vary from person to person. It not only needs to effectively capture the preferences of the users who post the reviews, but also requires to mine the emphasis of the users regarding the review when they write the headlines. To this end, we propose to incorporate the user information as the prior knowledge into the encoder and decoder for general sequence-to-sequence model. Specifically, we introduce user embedding for each user, and then we use these embeddings to initialize the encoder and decoder, or as biases for decoder initialization. We construct a review headline generation dataset, and the experiments on this dataset demonstrate that our models significantly outperform baseline models which do not consider user information.

Tianshang Liu, Haoran Li, Junnan Zhu, Jiajun Zhang, Chengqing Zong

Social Computing and Sentiment Analysis

Frontmatter
A Joint Model for Sentiment Classification and Opinion Words Extraction

In recent years, mining opinions from customer reviews has been widely explored. Aspect-level sentiment analysis is a fine-grained subtask, which aims to detect the sentiment polarity towards a particular target in a sentence. While most previous works focus on sentiment polarity classification, opinion words towards the target are also very important for that they provide details about target and contribute to judging polarity. To this end, we propose a hierarchical network for jointly modeling aspect-level sentiment classification and word-level opinion words extraction. Our joint model acquires superior performance in opinion words extraction and achieves comparable results in sentiment polarity classification on two datasets from SemEval 2014.

Dawei Cong, Jianhua Yuan, Yanyan Zhao, Bing Qin
Network Representation Learning Based on Community and Text Features

Network representation learning (NRL) aims at building a low-dimensional vector for each vertex in a network, which is also increasingly recognized as an important aspect for network analysis. Some current NRL methods only focus on learning representations using the network structure. However, vertices in lots of networks may contain community information or text contents, which could be good for relevant evaluation tasks, such as vertex classification, link prediction and so on. Since it has been proved that DeepWalk is actually equivalent to matrix factorization, we propose community and text-enhanced DeepWalk (CTDW) based on the inductive matrix completion algorithm, which incorporates community features and text features of vertices into NRL under the framework of matrix factorization. In experiments, we evaluate the proposed CTDW compared with other state-of-the-art methods on vertex classification. The experimental results demonstrate that CTDW outperforms other baseline methods on three real-world datasets.

Yu Zhu, Zhonglin Ye, Haixing Zhao, Ke Zhang

NLP Applications

Frontmatter
Learning to Detect Verbose Expressions in Spoken Texts

The analysis and understanding of spoken texts is an important task in artificial intelligence and natural language processing. However, there are many verbose expressions (such as mantras, nonsense, modal particle, etc.) in spoken texts, which brings great challenges to subsequent tasks. This paper devote to detect verbose expressions in spoken texts. Considering the correlation of verbose words/characters in spoken texts, we adapt sequence models to detect them with an end-to-end manner. Moreover, we propose a model with the long-short term memory (LSTM) and modified restrict attention (MRA) mechanism which are able to utilize the mutual influence between long-distance and local words in sentences. In addition, we propose a compare mechanism to model the repetitive verbose expressions. The experimental result shows that compared with the rule-based and direct classification methods, our proposed model increases F1 measure by 54.08% and 18.91%.

Qingbin Liu, Shizhu He, Kang Liu, Shengping Liu, Jun Zhao
Medical Knowledge Attention Enhanced Neural Model for Named Entity Recognition in Chinese EMR

Named entity recognition (NER) in Chinese electronic medical records (EMRs) has become an important task of clinical natural language processing (NLP). However, limited studies have been performed on the clinical NER study in Chinese EMRs. Furthermore, when end-to-end neural network models have improved clinical NER performance, medical knowledge dictionaries such as various disease association dictionaries, which provide rich information of medical entities and relations among them, are rarely utilized in NER model. In this study, we investigate the problem of NER in Chinese EMRs and propose a clinical neural network NER model enhanced with medical knowledge attention by combining the entity mention information contained in external medical knowledge bases with EMR context together. Experimental results on the manually labeled dataset demonstrated that the proposed method can achieve better performance than the previous methods in most cases.

Zhichang Zhang, Yu Zhang, Tong Zhou
Coherence-Based Automated Essay Scoring Using Self-attention

Automated essay scoring aims to score an essay automatically without any human assistance. Traditional methods heavily rely on manual feature engineering, making it expensive to extract the features. Some recent studies used neural-network-based scoring models to avoid feature engineering. Most of them used CNN or RNN to learn the representation of the essay. Although these models can cope with relationships between words within a short distance, they are limited in capturing long-distance relationships across sentences. In particular, it is difficult to assess the coherence of the essay, which is an essential criterion in essay scoring. In this paper, we use self-attention to capture useful long-distance relationships between words so as to estimate a coherence score. We tested our model on two datasets (ASAP and a new non-native speaker dataset). In both cases, our model outperforms the existing state-of-the-art models.

Xia Li, Minping Chen, Jianyun Nie, Zhenxing Liu, Ziheng Feng, Yingdan Cai
Trigger Words Detection by Integrating Attention Mechanism into Bi-LSTM Neural Network—A Case Study in PubMED-Wide Trigger Words Detection for Pancreatic Cancer

A Bi-LSTM based encode/decode mechanism for named entity recognition was studied in this research. In the proposed mechanism, Bi-LSTM was used for encoding, an Attention method was used in the intermediate layers, and an unidirectional LSTM was used as decoder layer. By using element wise product to modify the conventional decoder layers, the proposed model achieved better F-score, compared with other three baseline LSTM-based models. For the purpose of algorithm application, a case study of causal gene discovery in terms of disease pathway enrichment was designed. In addition, the causal gene discovery rate of our proposed method was compared with another baseline methods. The result showed that trigger genes detection effectively increase the performance of a text mining system for causal gene discovery.

Kaiyin Zhou, Xinzhi Yao, Shuguang Wang, Jin-Dong Kim, Kevin Bretonnel Cohen, Ruiying Chen, Yuxing Wang, Jingbo Xia
Backmatter
Metadaten
Titel
Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data
herausgegeben von
Maosong Sun
Ting Liu
Xiaojie Wang
Zhiyuan Liu
Yang Liu
Copyright-Jahr
2018
Electronic ISBN
978-3-030-01716-3
Print ISBN
978-3-030-01715-6
DOI
https://doi.org/10.1007/978-3-030-01716-3