Skip to main content

2017 | Buch

Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data

16th China National Conference, CCL 2017, and 5th International Symposium, NLP-NABD 2017, Nanjing, China, October 13-15, 2017, Proceedings

insite
SUCHEN

Über dieses Buch

This book constitutes the proceedings of the 16th China National Conference on Computational Linguistics, CCL 2017, and the 5th International Symposium on Natural Language Processing Based on Naturally Annotated Big Data, NLP-NABD 2017, held in Nanjing, China, in October 2017.

The 39 full papers presented in this volume were carefully reviewed and selected from 272 submissions. They were organized in topical sections named: Fundamental theory and methods of computational linguistics; Machine translation and multilingual information processing; Knowledge graph and information extraction; Language resource and evaluation; Information retrieval and question answering; Text classification and summarization; Social computing and sentiment analysis; NLP applications; Minority language information processing.

Inhaltsverzeichnis

Frontmatter

Fundamental Theory and Methods of Computational Linguistics

Frontmatter
Arabic Collocation Extraction Based on Hybrid Methods

Collocation Extraction plays an important role in machine translation, information retrieval, secondary language learning, etc., and has obtained significant achievements in other languages, e.g. English and Chinese. There are some studies for Arabic collocation extraction using POS annotation to extract Arabic collocation. We used a hybrid method that included POS patterns and syntactic dependency relations as linguistics information and statistical methods for extracting the collocation from Arabic corpus. The experiment results showed that using this hybrid method for extracting Arabic words can guarantee a higher precision rate, which heightens even more after dependency relations are added as linguistic rules for filtering, having achieved 85.11%. This method also achieved a higher precision rate rather than only resorting to syntactic dependency analysis as a collocation extraction method.

Alaa Mamdouh Akef, Yingying Wang, Erhong Yang
Employing Auto-annotated Data for Person Name Recognition in Judgment Documents

In the last decades, named entity recognition has been extensively studied with various supervised learning approaches depend on massive labeled data. In this paper, we focus on person name recognition in judgment documents. Owing to the lack of human-annotated data, we propose a joint learning approach, namely Aux-LSTM, to use a large scale of auto-annotated data to help human-annotated data (in a small size) for person name recognition. Specifically, our approach first develops an auxiliary Long Short-Term Memory (LSTM) representation by training the auto-annotated data and then leverages the auxiliary LSTM representation to boost the performance of classifier trained on the human-annotated data. Empirical studies demonstrate the effectiveness of our proposed approach to person name recognition in judgment documents with both human-annotated and auto-annotated data.

Limin Wang, Qian Yan, Shoushan Li, Guodong Zhou
Closed-Set Chinese Word Segmentation Based on Convolutional Neural Network Model

This paper proposes a neural model for closed-set Chinese word segmentation. The model follows the character-based approach which assigns a class label to each character, indicating its relative position within the word it belongs to. To do so, it first constructs shallow representations of characters by fusing unigram and bigram information in limited context window via an element-wise maximum operator, and then build up deep representations from wider contextual information with a deep convolutional network. Experimental results have shown that our method achieves better closed-set performance compared with several state-of-the-art systems.

Zhipeng Xie
Improving Word Embeddings for Low Frequency Words by Pseudo Contexts

This paper investigates relations between word semantic density and word frequency. A distributed representations based word average similarity is defined as the measure of word semantic density. We find that the average similarities of low frequency words are always bigger than that of high frequency words, when the frequency approaches to 400 around, the average similarity tends to stable. The finding keeps correct with changes of the size of training corpus, dimension of distributed representations and number of negative samples in skip-gram model. It also keeps on 17 different languages. Basing on the finding, we propose a pseudo context skip-gram model, which makes use of context words of semantic nearest neighbors of target words. Experiment results show our model achieves significant performance improvements in both word similarity and analogy tasks.

Fang Li, Xiaojie Wang
A Pipelined Pre-training Algorithm for DBNs

Deep networks have been widely used in many domains in recent years. However, the pre-training of deep networks is time consuming with greedy layer-wise algorithm, and the scalability of this algorithm is greatly restricted by its inherently sequential nature where only one hidden layer can be trained at one time. In order to speed up the training of deep networks, this paper mainly focuses on pre-training phase and proposes a pipelined pre-training algorithm because it uses distributed cluster, which can significantly reduce the pre-training time at no loss of recognition accuracy. It’s more efficient than greedy layer-wise pre-training algorithm by using the computational cluster. The contrastive experiments between greedy layer-wise and pipelined layer-wise algorithm are conducted finally, so we have carried out a comparative experiment on the greedy layer-wise algorithm and pipelined pre-training algorithms on the TIMIT corpus, result shows that the pipelined pre-training algorithm is an efficient algorithm to utilize distributed GPU cluster. We achieve a 2.84 and 5.9 speed-up with no loss of recognition accuracy when we use 4 slaves and 8 slaves. Parallelization efficiency is close to 0.73.

Zhiqiang Ma, Tuya Li, Shuangtao Yang, Li Zhang
Enhancing LSTM-based Word Segmentation Using Unlabeled Data

Word segmentation problem is widely solved as the sequence labeling problem. The traditional way to this kind of problem is machine learning method like conditional random field with hand-crafted features. Recently, deep learning approaches have achieved state-of-the-art performance on word segmentation task and a popular method of them is LSTM networks. This paper gives a method to introduce numerical statistics-based features counted on unlabeled data into LSTM networks and analyzes how it enhances the performance of word segmentation model. We add pre-trained character-bigram embedding, pointwise mutual information, accessor variety and punctuation variety into our model and compare their performances on different datasets including three datasets from CoNLL-2017 shared task and three datasets of simplified Chinese. We achieve the state-of-the-art performance on two of them and get comparable results on the rest.

Bo Zheng, Wanxiang Che, Jiang Guo, Ting Liu

Machine Translation and Multilingual Information Processing

Frontmatter
Context Sensitive Word Deletion Model for Statistical Machine Translation

Word deletion (WD) errors can lead to poor comprehension of the meaning of source translated sentences in phrase-based statistical machine translation (SMT), and have a critical impact on the adequacy of the translation results generated by SMT systems. In this paper, first we classify the word deletion into two categories, wanted and unwanted word deletions. For these two kinds of word deletions, we propose a maximum entropy based word deletion model to improve the translation quality in phrase-based SMT. Our proposed model are based on features automatically learned from a real-word bitext. In our experiments on Chinese-to-English news and web translation tasks, the results show that our approach is capable of generating more adequate translations compared with the baseline system, and our proposed word deletion model yields a +0.99 BLEU improvement and a $$-2.20$$ TER reduction on the NIST machine translation evaluation corpora.

Qiang Li, Yaqian Han, Tong Xiao, Jingbo Zhu
Cost-Aware Learning Rate for Neural Machine Translation

Neural Machine Translation (NMT) has drawn much attention due to its promising translation performance in recent years. The conventional optimization algorithm for NMT sets a unified learning rate for each gold target word during training. However, words under different probability distributions should be handled differently. Thus, we propose a cost-aware learning rate method, which can produce different learning rates for words with different costs. Specifically, for the gold word which ranks very low or has a big probability gap with the best candidate, the method can produce a larger learning rate and vice versa. The extensive experiments demonstrate the effectiveness of our proposed method.

Yang Zhao, Yining Wang, Jiajun Zhang, Chengqing Zong

Knowledge Graph and Information Extraction

Frontmatter
Integrating Word Sequences and Dependency Structures for Chemical-Disease Relation Extraction

Understanding chemical-disease relations (CDR) from biomedical literature is important for biomedical research and chemical discovery. This paper uses a k-max pooling convolutional neural network (CNN) to exploit word sequences and dependency structures for CDR extraction. Furthermore, an effective weighted context method is proposed to capture semantic information of word sequences. Our system extracts both intra- and inter-sentence level chemical-disease relations, which are merged as the final CDR. Experiments on the BioCreative V CDR dataset show that both word sequences and dependency structures are effective for CDR extraction, and their integration could further improve the extraction performance.

Huiwei Zhou, Yunlong Yang, Zhuang Liu, Zhe Liu, Yahui Men
Named Entity Recognition with Gated Convolutional Neural Networks

Most state-of-the-art models for named entity recognition (NER) rely on recurrent neural networks (RNNs), in particular long short-term memory (LSTM). Those models learn local and global features automatically by RNNs so that hand-craft features can be discarded, totally or partly. Recently, convolutional neural networks (CNNs) have achieved great success on computer vision. However, for NER problems, they are not well studied. In this work, we propose a novel architecture for NER problems based on GCNN — CNN with gating mechanism. Compared with RNN based NER models, our proposed model has a remarkable advantage on training efficiency. We evaluate the proposed model on three data sets in two significantly different languages — SIGHAN bakeoff 2006 MSRA portion for simplified Chinese NER and CityU portion for traditional Chinese NER, CoNLL 2003 shared task English portion for English NER. Our model obtains state-of-the-art performance on these three data sets.

Chunqi Wang, Wei Chen, Bo Xu
Improving Event Detection via Information Sharing Among Related Event Types

Event detection suffers from data sparseness and label imbalance problem due to the expensive cost of manual annotations of events. To address this problem, we propose a novel approach that allows for information sharing among related event types. Specifically, we employ a fully connected three-layer artificial neural network as our basic model and propose a type-group regularization term to achieve the goal of information sharing. We conduct experiments with different configurations of type groups, and the experimental results show that information sharing among related event types remarkably improves the detecting performance. Compared with state-of-the-art methods, our proposed approach achieves a better $$F_1$$ score on the widely used ACE 2005 event evaluation dataset.

Shulin Liu, Yubo Chen, Kang Liu, Jun Zhao, Zhunchen Luo, Wei Luo
Joint Extraction of Multiple Relations and Entities by Using a Hybrid Neural Network

This paper proposes a novel end-to-end neural model to jointly extract entities and relations in a sentence. Unlike most existing approaches, the proposed model uses a hybrid neural network to automatically learn sentence features and does not rely on any Natural Language Processing (NLP) tools, such as dependency parser. Our model is further capable of modeling multiple relations and their corresponding entity pairs simultaneously. Experiments on the CoNLL04 dataset demonstrate that our model using only word embeddings as input features achieves state-of-the-art performance.

Peng Zhou, Suncong Zheng, Jiaming Xu, Zhenyu Qi, Hongyun Bao, Bo Xu
A Fast and Effective Framework for Lifelong Topic Model with Self-learning Knowledge

To discover semantically coherent topics from topic models, knowledge-based topic models have been proposed to incorporate prior knowledge into topic models. Moreover, some researchers propose lifelong topic models (LTM) to mine prior knowledge from topics generated from multi-domain corpus without human intervene. LTM incorporates the learned knowledge from multi-domain corpus into topic models by introducing the Generalized Polya Urn (GPU) model into Gibbs sampling. However, GPU model is nonexchangeable so that topic inference for LTM is computationally expensive. Meanwhile, variational inference is an alternative approach to Gibbs sampling and tend to be faster than Gibbs sampling. Moreover, variational inference can also be flexible for inferring topic models with knowledge, i.e., regularized topic model. In this paper, we propose a fast and effective framework for lifelong topic model, called Regularized Lifelong Topic Model with Self-learning Knowledge (RLTM-SK), with lexical knowledge automatically learnt from the previous topic extraction, then design a variational inference method to estimate the posterior distributions of hidden variables for RLTM-SK. We compare our method with 5 state-of-the-art baselines on a dataset of product reviews from 50 domains. Results show that the performance of our method is comparable to LTM and other knowledge-based topic models. Moreover, our model is consistently faster than the best baseline method, LTM.

Kang Xu, Feng Liu, Tianxing Wu, Sheng Bi, Guilin Qi
Collective Entity Linking on Relational Graph Model with Mentions

Given a source document with extracted mentions, entity linking calls for mapping the mention to an entity in reference knowledge base. Previous entity linking approaches mainly focus on generic statistic features to link mentions independently. However, additional interdependence among mentions in the same document achieved from relational analysis can improve the accuracy. This paper propose a collective entity linking model which effectively leverages the global interdependence among mentions in the same source document. The model unifies semantic relations and co-reference relations into relational inference for semantic information extraction. Graph based linking algorithm is utilized to ensure per mention with only one candidate entity. Experiments on datasets show the proposed model significantly out-performs the state-of-the-art relatedness approaches in term of accuracy.

Jing Gong, Chong Feng, Yong Liu, Ge Shi, Heyan Huang
XLink: An Unsupervised Bilingual Entity Linking System

Entity linking is a task of linking mentions in text to the corresponding entities in a knowledge base. Recently, entity linking has received considerable attention and several online entity linking systems have been published. In this paper, we build an online bilingual entity linking system XLink, which is based on Wikipeida and Baidu Baike. XLink conducts two steps to link the mentions in the input document to entities in knowledge base, namely mention parsing and entity disambiguation. To eliminate dependency of language, we conduct mention parsing without any named entity recognition tools. To ensure the correctness of linking results, we propose an unsupervised generative probabilistic method and utilize text and knowledge joint representations to perform entity disambiguation. Experiments show that our system gets a state-of-the-art performance and a high time efficiency.

Jing Zhang, Yixin Cao, Lei Hou, Juanzi Li, Hai-Tao Zheng
Using Cost-Sensitive Ranking Loss to Improve Distant Supervised Relation Extraction

Recently, many researchers have concentrated on using neural networks to learn features for Distant Supervised Relation Extraction (DSRE). However, these approaches generally employ a softmax classifier with cross-entropy loss, and bring the noise of artificial class NA into classification process. Moreover, the class imbalance problem is serious in the automatically labeled data, and results in poor classification rates on minor classes in traditional approaches.In this work, we exploit cost-sensitive ranking loss to improve DSRE. It first uses a Piecewise Convolutional Neural Network (PCNN) to embed the semantics of sentences. Then the features are fed into a classifier which takes into account both the ranking loss and cost-sensitive. Experiments show that our method is effective and performs better than state-of-the-art methods.

Daojian Zeng, Junxin Zeng, Yuan Dai
Multichannel LSTM-CRF for Named Entity Recognition in Chinese Social Media

Named Entity Recognition (NER) is a tough task in Chinese social media due to a large portion of informal writings. Existing research uses only limited in-domain annotated data and achieves low performance. In this paper, we utilize both limited in-domain data and enough out-of-domain data using a domain adaptation method. We propose a multichannel LSTM-CRF model that employs different channels to capture general patterns, in-domain patterns and out-of-domain patterns in Chinese social media. The extensive experiments show that our model yields 9.8% improvement over previous state-of-the-art methods. We further find that a shared embedding layer is important and randomly initialized embeddings are better than the pretrained ones.

Chuanhai Dong, Huijia Wu, Jiajun Zhang, Chengqing Zong

Language Resource and Evaluation

Frontmatter
Generating Chinese Classical Poems with RNN Encoder-Decoder

We take the generation of Chinese classical poetry as a sequence-to-sequence learning problem, and investigate the suitability of recurrent neural network (RNN) for poetry generation task by various qualitative analyses. Then we build a novel system based on the RNN Encoder-Decoder structure to generate quatrains (Jueju in Chinese), with a keyword as input. Our system can learn semantic meaning within a single sentence, semantic relevance among sentences in a poem, and the use of structural, rhythmical and tonal patterns jointly, without utilizing any constraint templates. Experimental results show that our system outperforms other competitive systems.

Xiaoyuan Yi, Ruoyu Li, Maosong Sun
Collaborative Recognition and Recovery of the Chinese Intercept Abbreviation

One of the important works of Information Content Security is evaluating the theme words of the text. Because of the variety of the Chinese expression, especially of the abbreviation, the supervision of the theme words becomes harder. The goal of this paper is to quickly and accurately discover the intercept abbreviations from the text crawled at the short time period. The paper firstly segments the target texts, and then utilizes the Supported Vector Machine (SVM) to recognize the abbreviations from the wrongly segmented texts as the candidates. Secondly, this paper presents the collaborative methods: Improve the Conditional Random Fields (CRF) to predict the corresponding word to each character of the abbreviation; To solve the problems of the 1:n relationship, collaboratively merge the ranking list from the predict steps with the matched results of the thesaurus of abbreviations. The experiments demonstrate that our method at the recognizing stage is 76.5% of the accuracy and 77.8% of the recall rate. At the recovery step, the accuracy is 62.1%, which is 20.8% higher than the method based on Hidden Markov Model (HMM).

Jinshuo Liu, Yusen Chen, Juan Deng, Donghong Ji, Jeff Pan
Semantic Dependency Labeling of Chinese Noun Phrases Based on Semantic Lexicon

We have presented a simple algorithm to noun phrases interpretation based on hand-crafted knowledge-base containing detailed semantic information. The main idea is to define a set of relations that can hold between the words and use a semantic lexicon including semantic classifications and collocation features to automatically assign relations to noun phrases. We divide the NPs into two kinds of types: NPs with one verb or non-consecutive verbs and NPs with consecutive verbs, and design two different labeling methods according to their syntactic and semantic features. For the first kind of NPs we report high precision, recall and F-score on a dataset with nine semantic relations, and for the second type the results are also promising on a dataset with four relations. We create a valuable manually-annotated resource for noun phrases interpretation, which we make publicly available with the hope to inspire further research in noun phrases interpretation.

Yimeng Li, Yanqiu Shao, Hongkai Yang

Information Retrieval and Question Answering

Frontmatter
Bi-directional Gated Memory Networks for Answer Selection

Answer selection is a crucial subtask of the open domain question answering problem. In this paper, we introduce the Bi-directional Gated Memory Network (BGMN) to model the interactions between question and answer. We match question $$(\varvec{P})$$ and answer (Q) in two directions. In each direction(for example $${\varvec{P}}\rightarrow {\varvec{Q}}$$), sentence representation of P triggers an iterative attention process that aggregates informative evidence of Q. In each iteration, sentence representation of P and aggregated evidence of Q so far are passed through a gate determining the importance of the two when attend to every step of Q. Finally based on the aggregated evidence, the decision is made through a fully connected network. Experimental results on SemEval-2015 Task 3 dataset demonstrate that our proposed method substantially outperforms several strong baselines. Further experiments show that our model is general and can be applied to other sentence-pair modeling tasks.

Wei Wu, Houfeng Wang, Sujian Li
Generating Textual Entailment Using Residual LSTMs

Generating textual entailment (GTE) is a recently proposed task to study how to infer a sentence from a given premise. Current sequence-to-sequence GTE models are prone to produce invalid sentences when facing with complex enough premises. Moreover, the lack of appropriate evaluation criteria hinders researches on GTE. In this paper, we conjecture that the unpowerful encoder is the major bottleneck in generating more meaningful sequences, and improve this by employing the residual LSTM network. With the extended model, we obtain state-of-the-art results. Furthermore, we propose a novel metric for GTE, namely EBR (Evaluated By Recognizing textual entailment), which could evaluate different GTE approaches in an objective and fair way without human effort while also considering the diversity of inferences. In the end, we point out the limitation of adapting a general sequence-to-sequence framework under GTE settings, with some proposals for future research, hoping to generate more public discussion.

Maosheng Guo, Yu Zhang, Dezhi Zhao, Ting Liu
Unsupervised Joint Entity Linking over Question Answering Pair with Global Knowledge

We consider the task of entity linking over question answering pair (QA-pair). In conventional approaches of entity linking, all the entities whether in one sentence or not are considered the same. We focus on entity linking over QA-pair, in which question entity and answer entity are no longer fully equivalent and they are with the explicit semantic relation. We propose an unsupervised method which utilizes global knowledge of QA-pair in the knowledge base(KB). Firstly, we collect large-scale Chinese QA-pairs and their corresponding triples in the knowledge base. Then mining global knowledge such as the probability of relation and linking similarity between question entity and answer entity. Finally integrating global knowledge and other basic features as well as constraints by integral linear programming(ILP) with an unsupervised method. The experimental results show that each proposed global knowledge improves performance. Our best F-measure on QA-pairs is 53.7%, significantly increased 6.5% comparing with the competitive baseline.

Cao Liu, Shizhu He, Hang Yang, Kang Liu, Jun Zhao
Hierarchical Gated Recurrent Neural Tensor Network for Answer Triggering

In this paper, we focus on the problem of answer triggering addressed by Yang et al. (2015), which is a critical component for a real-world question answering system. We employ a hierarchical gated recurrent neural tensor (HGRNT) model to capture both the context information and the deep interactions between the candidate answers and the question. Our result on F value achieves 42.6%, which surpasses the baseline by over 10 %.

Wei Li, Yunfang Wu
Question Answering with Character-Level LSTM Encoders and Model-Based Data Augmentation

This paper presents a character-level encoder-decoder modeling method for question answering (QA) from large-scale knowledge bases (KB). This method improves the existing approach [9] from three aspects. First, long short-term memory (LSTM) structures are adopted to replace the convolutional neural networks (CNN) for encoding the candidate entities and predicates. Second, a new strategy of generating negative samples for model training is adopted. Third, a data augmentation strategy is applied to increase the size of the training set by generating factoid questions using another trained encoder-decoder model. Experimental results on the SimpleQuestions dataset and the Freebase5M KB demonstrates the effectiveness of the proposed method, which improves the state-of-the-art accuracy from 70.3% to 78.8% when augmenting the training set with 70,000 generated triple-question pairs.

Run-Ze Wang, Chen-Di Zhan, Zhen-Hua Ling
Exploiting Explicit Matching Knowledge with Long Short-Term Memory

Recently neural network models are widely applied in text-matching tasks like community-based question answering (cQA). The strong generalization power of neural networks enables these methods to find texts with similar topics but miss detailed matching information. However, as proven by traditional methods, the explicit lexical matching knowledge is important for effective answer retrieval. In this paper, we propose an ExMaLSTM model to incorporate the explicit matching knowledge into the long short-term memory (LSTM) neural network. We extract explicit lexical matching features with prior knowledge and then add them to the local representations of questions. We summarize the overall matching status by using a bi-directional LSTM. The final relevance score is calculated using a gate network, which can dynamically assign appropriate weights to the explicit matching score and the implicit relevance score. We conduct extensive experiments for answer retrieval in a cQA dataset. The results show that our proposed ExMaLSTM model outperforms both the traditional methods and various state-of-the-art neural network models significantly.

Xinqi Bao, Yunfang Wu

Text Classification and Summarization

Frontmatter
Topic-Specific Image Caption Generation

Recently, image caption which aims to generate a textual description for an image automatically has attracted researchers from various fields. Encouraging performance has been achieved by applying deep neural networks. Most of these works aim at generating a single caption which may be incomprehensive, especially for complex images. This paper proposes a topic-specific multi-caption generator, which infer topics from image first and then generate a variety of topic-specific captions, each of which depicts the image from a particular topic. We perform experiments on flickr8k, flickr30k and MSCOCO. The results show that the proposed model performs better than single-caption generator when generating topic-specific captions. The proposed model effectively generates diversity of captions under reasonable topics and they differ from each other in topic level.

Chang Zhou, Yuzhao Mao, Xiaojie Wang
Deep Learning Based Document Theme Analysis for Composition Generation

This paper puts forward theme analysis problem in order to automatically solve composition writing questions in Chinese college entrance examination. Theme analysis is to distillate the embedded semantic information from the given materials or documents. We proposes a hierarchical neural network framework to address this problem. Two deep learning based models under the proposed framework are presented. Besides, two transfer learning strategies based on the proposed deep learning models are tried to deal with the lack of large training data for composition theme analysis problems. Experimental results on two tag recommendation data sets show the effect of the proposed deep learning based theme analysis models. Also, we show the effect of the proposed model with transfer learning on a composition writing questions data set built by ourself.

Jiahao Liu, Chengjie Sun, Bing Qin
UIDS: A Multilingual Document Summarization Framework Based on Summary Diversity and Hierarchical Topics

In this paper, we put forward UIDS, a new high-performing extensible framework for extractive MultiLingual Document Summarization. Our approach looks on a document in a multilingual corpus as an item sequence set, in which each sentence is an item sequence and each item is the minimal semantic unit. Then we formalize the extractive summary as summary diversity sampling problem that considers topic diversity and redundancy at the same time. The topic diversity is reflected using hierarchical topic models, the redundancy is reflected using similarity and the summary diversity is enhanced using Determinantal Point Processes. We then illustrate how this method encompasses a framework that is amenable to compute summaries for MultiLingual Single- and Multi-documents. Experiments on the MultiLing summarization task datasets demonstrate the effectiveness of our approach.

Lei Li, Yazhao Zhang, Junqi Chi, Zuying Huang
Conceptual Multi-layer Neural Network Model for Headline Generation

Neural attention-based models have been widely used recently in headline generation by mapping source document to target headline. However, the traditional neural headline generation models utilize the first sentence of the document as the training input while ignoring the impact of the document concept information on headline generation. In this work, A new neural attention-based model called concept sensitive neural headline model is proposed, which connects the concept information of the document to input text for headline generation and achieves satisfactory results. Besides, we use a multi-layer Bi-LSTM in encoder instead of single layer. Experiments have shown that our model outperforms state-of-the-art systems on DUC-2004 and Gigaword test sets.

Yidi Guo, Heyan Huang, Yang Gao, Chi Lu

Social Computing and Sentiment Analysis

Frontmatter
Local Community Detection Using Social Relations and Topic Features in Social Networks

Local community detection is an important research focus in social network analysis. Most existing methods share the intrinsic limitation of utilizing undirected and unweighted networks. In this paper, we propose a novel local community detection algorithm that fuses social relations and topic features in social networks. By defining a new social similarity, the proposed algorithm can effectively reveal the dynamic characteristics in social networks. In addition, the topic similarity is measured by Jensen–Shannon divergence, in which the topics are extracted from the user-generated content by topic models. Extensive experiments conducted on a real social network dataset demonstrate that our proposed algorithm outperforms methods based on social relations or topic features alone.

Chengcheng Xu, Huaping Zhang, Bingbing Lu, Songze Wu

NLP Applications

Frontmatter
DIM Reader: Dual Interaction Model for Machine Comprehension

Enabling a computer to understand a document so that it can answer comprehension questions is a central, yet unsolved goal of Natural Language Processing, so reading comprehension of text is an important problem in NLP research. In this paper, we propose a novel dual interaction model (called DIM Reader) (Our code is available at https://github.com/dlt/mrc-dim), which constructs dual iterative alternating attention mechanism over multiple hops. The proposed DIM Reader continually refines its view of the query and document while aggregating the information required to answer a query, aiming to compute the attentions not only for the document but also the query side, which will benefit from the mutual information. DIM Reader makes use of multiple turns to effectively exploit and perform deeper inference among queries, documents. We conduct extensive experiments on CNN/DailyMail News datasets, and our model achieves the best results on both machine comprehension datasets among almost published results.

Zhuang Liu, Degen Huang, Kaiyu Huang, Jing Zhang
Multi-view LSTM Language Model with Word-Synchronized Auxiliary Feature for LVCSR

Recently long short-term memory language model (LSTM LM) has received tremendous interests from both language and speech communities, due to its superiorty on modelling long-term dependency. Moreover, integrating auxiliary information, such as context feature, into the LSTM LM has shown improved performance in perplexity (PPL). However, improper feed of auxiliary information won’t give consistent gain on word error rate (WER) in a large vocabulary continuous speech recognition (LVCSR) task. To solve this problem, a multi-view LSTM LM architecture combining a tagging model is proposed in this paper. Firstly an on-line unidirectional LSTM-RNN is built as a tagging model, which can generate word-synchronized auxiliary feature. Then the auxiliary feature from the tagging model is combined with the word sequence to train a multi-view unidirectional LSTM LM. Different training modes for the tagging model and language model are explored and compared. The new architecture is evaluated on PTB, Fisher English and SMS Chinese data sets, and the results show that not only LM PPL promotion is observed, but also the improvements can be well transferred to WER reduction in ASR-rescore task.

Yue Wu, Tianxing He, Zhehuai Chen, Yanmin Qian, Kai Yu
Memory Augmented Attention Model for Chinese Implicit Discourse Relation Recognition

Recently, Chinese implicit discourse relation recognition has attracted more and more attention, since it is crucial to understand the Chinese discourse text. In this paper, we propose a novel memory augmented attention model which represents the arguments using an attention-based neural network and preserves the crucial information with an external memory network which captures each discourse relation clustering structure to support the relation inference. Extensive experiments demonstrate that our proposed model can achieve the new state-of-the-art results on Chinese Discourse Treebank. We further leverage network visualization to show why our attention and memory model are effective.

Yang Liu, Jiajun Zhang, Chengqing Zong
Natural Logic Inference for Emotion Detection

Current research on emotion detection focuses on the recognizing explicit emotion expressions in text. In this paper, we propose an approach based on textual inference to detect implicit emotion expressions, that is, to capture emotion detection as an logical inference issue. The approach builds a natural logic system, in which emotional detection are decomposed into a series of logical inference process. The system also employ inference knowledge from textural inference resources for reasoning complex expressions in emotional texts. Experimental results show the efficiency in detecting implicit emotional expressions.

Han Ren, Yafeng Ren, Xia Li, Wenhe Feng, Maofu Liu

Minority Language Information Processing

Frontmatter
Tibetan Syllable-Based Functional Chunk Boundary Identification

Tibetan syntactic functional chunk parsing is aimed at identifying syntactic constituents of Tibetan sentences. In this paper, based on the Tibetan syntactic functional chunk description system, we propose a method which puts syllables in groups instead of word segmentation and tagging and use the Conditional Random Fields (CRFs) to identify the functional chunk boundary of a sentence. According to the actual characteristics of the Tibetan language, we firstly identify and extract the syntactic markers as identification characteristics of syntactic functional chunk boundary in the text preprocessing stage, while the syntactic markers are composed of the sticky written form and the non-sticky written form. Afterwards we identify the syntactic functional chunk boundary using CRF. Experiments have been performed on a Tibetan language corpus containing 46783 syllables and the precision, recall rate and F value respectively achieves 75.70%, 82.54% and 79.12%. The experiment results show that the proposed method is effective when applied to a small-scale unlabeled corpus and can provide foundational support for many natural language processing applications such as machine translation.

Shumin Shi, Yujian Liu, Tianhang Wang, Congjun Long, Heyan Huang
Harvest Uyghur-Chinese Aligned-Sentences Bitexts from Multilingual Sites Based on Word Embedding

Obtaining bilingual parallel data from the multilingual websites is a long-standing research problem, which is very benefit for resource-scarce languages. In this paper, we present an approach for obtaining parallel data based on word embedding, and our model only rely on a small scale of bilingual lexicon. Our approach benefit from the recent advances of continuous word representations, which can reveal more context information compared with traditional methods. Our experiments show that high-precision and sizable parallel Uyghur-Chinese data can be obtained for lacking bilingual lexicon.

ShaoLin Zhu, Xiao Li, YaTing Yang, Lei Wang, ChengGang Mi
Language Model for Mongolian Polyphone Proofreading

Mongolian text proofreading is the particularly difficult task because of its unique polyphonic alphabet, morphological ambiguity and agglutinative feature, and coding errors are currently pervasive in the Mongolian corpus of electronic edition, which results in Mongolian statistic and retrieval research toughly difficult to carry out. Some conventional approaches have been proposed to solve this problem but with limitations by not considering proofreading of polyphone. In this paper, we address this problem by means of constructing the large-scale resource and conducting n-gram language model based approach. For ease of understanding, the entire proofreading system architecture is also introduced in this paper, since the polyphone proofreading is the important component of it. Experimental results show that our method performs pretty well. Polyphone correction accuracy is relatively improved by 62% and overall system accuracy is relatively promoted by 16.1%.

Min Lu, Feilong Bao, Guanglai Gao
End-to-End Neural Text Classification for Tibetan

As a minority language, Tibetan has received relatively little attention in the field of natural language processing (NLP), especially in current various neural network models. In this paper, we investigate three end-to-end neural models for Tibetan text classification. The experimental results show that the end-to-end models outperform the traditional Tibetan text classification methods. The dataset and codes are available on https://github.com/FudanNLP/Tibetan-Classification.

Nuo Qun, Xing Li, Xipeng Qiu, Xuanjing Huang
Backmatter
Metadaten
Titel
Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data
herausgegeben von
Maosong Sun
Xiaojie Wang
Baobao Chang
Deyi Xiong
Copyright-Jahr
2017
Electronic ISBN
978-3-319-69005-6
Print ISBN
978-3-319-69004-9
DOI
https://doi.org/10.1007/978-3-319-69005-6