Skip to main content

Über dieses Buch

This book constitutes the refereed proceedings of the 6th CCF International Conference on Natural Language Processing, NLPCC 2017, held in Dalian, China, in November 2017.

The 47 full papers and 39 short papers presented were carefully reviewed and selected from 252 submissions. The papers are organized around the following topics: IR/search/bot; knowledge graph/IE/QA; machine learning; machine translation; NLP applications; NLP fundamentals; social networks; and text mining.





Jointly Modeling Intent Identification and Slot Filling with Contextual and Hierarchical Information

Intent classification and slot filling are two critical subtasks of natural language understanding (NLU) in task-oriented dialogue systems. Previous work has made use of either hierarchical or contextual information when jointly modeling intent classification and slot filling, proving that either of them is helpful for joint models. This paper proposes a cluster of joint models to encode both types of information at the same time. Experimental results on different datasets show that the proposed models outperform joint models without either hierarchical or contextual information. Besides, finding the balance between two loss functions of two subtasks is important to achieve best overall performances.

Liyun Wen, Xiaojie Wang, Zhenjiang Dong, Hong Chen

Augmenting Neural Sentence Summarization Through Extractive Summarization

Neural sequence-to-sequence model has achieved great success in abstractive summarization task. However, due to the limit of input length, most of previous works can only utilize lead sentences as the input to generate the abstractive summarization, which ignores crucial information of the document. To alleviate this problem, we propose a novel approach to improve neural sentence summarization by using extractive summarization, which aims at taking full advantage of the document information as much as possible. Furthermore, we present both of streamline strategy and system combination strategy to achieve the fusion of the contents in different views, which can be easily adapted to other domains. Experimental results on CNN/Daily Mail dataset demonstrate both our proposed strategies can significantly improve the performance of neural sentence summarization.

Junnan Zhu, Long Zhou, Haoran Li, Jiajun Zhang, Yu Zhou, Chengqing Zong

Cascaded LSTMs Based Deep Reinforcement Learning for Goal-Driven Dialogue

This paper proposes a deep neural network model for jointly modeling Natural Language Understanding and Dialogue Management in goal-driven dialogue systems. There are three parts in this model. A Long Short-Term Memory (LSTM) at the bottom of the network encodes utterances in each dialogue turn into a turn embedding. Dialogue embeddings are learned by a LSTM at the middle of the network, and updated by the feeding of all turn embeddings. The top part is a forward Deep Neural Network which converts dialogue embeddings into the Q-values of different dialogue actions. The cascaded LSTMs based reinforcement learning network is jointly optimized by making use of the rewards received at each dialogue turn as the only supervision information. There is no explicit NLU and dialogue states in the network. Experimental results show that our model outperforms both traditional Markov Decision Process (MDP) model and single LSTM with Deep Q-Network on meeting room booking tasks. Visualization of dialogue embeddings illustrates that the model can learn the representation of dialogue states.

Yue Ma, Xiaojie Wang, Zhenjiang Dong, Hong Chen

Dialogue Intent Classification with Long Short-Term Memory Networks

Dialogue intent analysis plays an important role for dialogue systems. In this paper, we present a deep hierarchical LSTM model to classify the intent of a dialogue utterance. The model is able to recognize and classify user’s dialogue intent in an efficient way. Moreover, we introduce a memory module to the hierarchical LSTM model, so that our model can utilize more context information to perform classification. We evaluate the two proposed models on a real-world conversational dataset from a Chinese famous e-commerce service. The experimental results show that our proposed model outperforms the baselines.

Lian Meng, Minlie Huang

An Ensemble Approach to Conversation Generation

As an important step of human-computer interaction, conversion generation has attracted much attention and has a rising tendency in recent years. This paper gives a detailed description about an ensemble system for short text conversation generation. The proposed system consists of four subsystems, a quick response candidates selecting module, an information retrieval system, a generation-based system and an ensemble module. An advantage of this system is that multiple versions of generated responses are taken into account resulting a more reliable output. In the NLPCC 2017 shared task “Emotional Conversation Generation Challenge”, the ensemble system generates appropriate responses for Chinese SNS posts and ranks at the top of participant list.

Yimeng Zhuang, Xianliang Wang, Han Zhang, Jinghui Xie, Xuan Zhu

First Place Solution for NLPCC 2017 Shared Task Social Media User Modeling

With the popularity of mobile Internet, many social networking applications provide users with the function to share their personal information. It is of high commercial value to leverage the users’ personal information such as tweets, preferences and locations for user profiling. There are two subtasks working in user profiling. Subtask one is to predict the Point-of-Interest (POI) a user will check in at. We adopted a combination of multiple approach results, including user-based collaborative filtering (CF) and social-based CF to predict the locations. Subtask two is to predict the users’ gender. We divided the users into two groups, depending on whether the user has posted or not. We treat this task subtask as a classification task. Our results achieved first place in both subtasks.

Lingfei Qian, Anran Wang, Yan Wang, Yuhang Huang, Jian Wang, Hongfei Lin

Knowledge Graph/IE/QA


Large-Scale Simple Question Generation by Template-Based Seq2seq Learning

Numerous machine learning tasks achieved substantial advances with the help of large-scale supervised learning corpora over past decade. However, there’s no large-scale question-answer corpora available for Chinese question answering over knowledge bases. In this paper, we present a 28M Chinese Q&A corpora based on the Chinese knowledge base provided by NLPCC2017 KBQA challenge. We propose a novel neural network architecture which combines template-based method and seq2seq learning to generate highly fluent and diverse questions. Both automatic and human evaluation results show that our model achieves outstanding performance (76.8 BLEU and 43.1 ROUGE). We also propose a new statistical metric called DIVERSE to measure the linguistic diversity of generated questions and prove that our model can generate much more diverse questions compared with other baselines.

Tianyu Liu, Bingzhen Wei, Baobao Chang, Zhifang Sui

A Dual Attentive Neural Network Framework with Community Metadata for Answer Selection

Nowadays the community-based question answering (cQA) sites become popular Web service, which have accumulated millions of questions and their associated answers over time. Thus, the answer selection component plays an important role in a cQA system, which ranks the relevant answers to the given question. With the development of this area, problems of noise prevalence and data sparsity become more tough. In our paper, we consider the task of answer selection from two aspects including deep semantic matching and user community metadata representation. We propose a novel dual attentive neural network framework (DANN) to embed question topics and user network structures for answer selection. The representation of questions and answers are first learned by convolutional neural networks (CNNs). Then the DANN learns interactions of questions and answers, which is guided via user network structures and semantic matching of question topics with double attention. We evaluate the performance of our method on the well-known question answering site Stack exchange. The experiments show that our framework outperforms other state-of-the-art solutions to the problem.

Zhiqiang Liu, Mengzhang Li, Tianyu Bai, Rui Yan, Yan Zhang

Geography Gaokao-Oriented Knowledge Acquisition for Comparative Sentences Based on Logic Programming

Multiple-choice questions of comparing one entity with another in a university’s entrance examination like Gaokao in China are very common but require high knowledge skill. As a preliminary attempt to address this problem, we build a geography Gaokao-oriented knowledge acquisition system for comparative sentences based on logic programming to help solve real geography examinations. Our work consists of two consecutive tasks: identify comparative sentences from geographical texts and extract comparative elements from the identified comparative sentences. Specifically, for the former task, logic programming is employed to filter out non-comparative sentences, and for the latter task, the information of dependency grammar and heuristic position is adopted to represent the relations among comparative elements. The experimental results show that our system achieves outstanding performance for practical use.

Xuelian Li, Qian Liu, Man Zhu, Feifei Xu, Yunxiu Yu, Shang Zhang, Zhaoxi Ni, Zhiqiang Gao

Chinese Question Classification Based on Semantic Joint Features

Question classification is an important research content in automatic question-answering system. Chinese question sentences are different from long texts and those short texts like comments on product. They generally contain interrogative words such as who, which, where or how to specify the information required, and include complete grammatical components in the sentence. Based on these characteristics, we propose a more effective feature extraction method for Chinese question classification in this paper. We first extract the head verb of the sentence and its dependency words combined with interrogative words of the sentence as our base features. And then we use latent semantic analysis to help remove semantic noises from the base features. In the end, we expand those features to be semantic representation features by our weighted word-embedding method. Several experimental results show that our semantic joint feature extraction method outperforms classical syntactic based or content vector based method and superior to convolutional neural network based sentence classification method.

Xia Li, HanFeng Liu, ShengYi Jiang

A Chinese Question Answering System for Single-Relation Factoid Questions

Aiming at the task of open domain question answering based on knowledge base in NLPCC 2017, we build a question answering system which can automatically find the promised entities and predicates for single-relation questions. After a features based entity linking component and a word vector based candidate predicates generation component, deep convolutional neural networks are used to rerank the entity-predicate pairs, and all intermediary scores are used to choose the final predicted answers. Our approach achieved the F1-score of 47.23% on test data which obtained the first place in the contest of NLPCC 2017 Shared Task 5 (KBQA sub-task). Furthermore, there are also a series of experiments which can help other developers understand the contribution of every part of our system.

Yuxuan Lai, Yanyan Jia, Yang Lin, Yansong Feng, Dongyan Zhao

Enhancing Document-Based Question Answering via Interaction Between Question Words and POS Tags

The document-based question answering is to select the answer from a set of candidate sentence for a given question. Most Existing works focus on the sentence-pair modeling, but ignore the peculiars of question-answer pairs. This paper proposes to model the interaction between question words and POS tags, as a special kind of information that is peculiar to question-answer pairs. Such information is integrated into a neural model for answer selection. Experimental results on DBQA Task have shown that our model has achieved better results, compared with several state-of-the-art systems. In addition, it also achieves the best result on NLPCC 2017 Shared Task on DBQA.

Zhipeng Xie

Machine Learning


A Deep Learning Way for Disease Name Representation and Normalization

Disease name normalization aims at mapping various disease names to standardized disease vocabulary entries. Disease names have such a wide variation that dictionary lookup method couldn’t get a high accuracy on this task. Dnorm is the first machine learning approach for this task. It is not robust enough due to strong dependence on training dataset. In this article, we propose a deep learning way for disease name representation and normalization. Representations of composing words can be learned from large unlabelled literature corpus. Rich semantic and syntactic properties of disease names are encoded in the representations during the process. With the new way of representations for disease names, a higher accuracy is achieved in the normalization task.

Hongwei Liu, Yun Xu

Externally Controllable RNN for Implicit Discourse Relation Classification

Without discourse connectives, recognizing implicit discourse relations is a great challenge and a bottleneck for discourse parsing. The key factor lies in proper representing the two discourse arguments as well as modeling their interactions. This paper proposes two novel neural networks, i.e., externally controllable LSTM (ECLSTM) and attention-augmented GRU (AAGRU), which can be stacked to incorporate arguments’ interactions into their representing process. The two networks are variants of Recurrent Neural Network (RNN) but equipped with externally controllable cells that their working processes can be dynamically regulated. ECLSTM is relatively conservative and easily comprehensible while AAGRU works better for small datasets. Multilevel RNN with smaller hidden state allows critical information to be gradually exploited, and thus enables our model to fit deeper structures with slightly increased complexity. Experiments on the Penn Discourse Treebank (PDTB) benchmark show that our method achieves significant performance gain over vanilla LSTM/CNN models and competitive with previous state-of-the-art models.

Xihan Yue, Luoyi Fu, Xinbing Wang

Random Projections with Bayesian Priors

The technique of random projection is one of dimension reduction, where high dimensional vectors in $$\mathbb R^D$$ are projected down to a smaller subspace in $$\mathbb R^k$$. Certain forms of distances or distance kernels such as Euclidean distances, inner products [10], and $$l_p$$ distances [12] between high dimensional vectors are approximately preserved in this smaller dimensional subspace. Word vectors which are represented in a bag of words model can thus be projected down to a smaller subspace via random projections, and their relative similarity computed via distance metrics. We propose using marginal information and Bayesian probability to improve the estimates of the inner product between pairs of vectors, and demonstrate our results on actual datasets.

Keegan Kang

A Convolutional Attention Model for Text Classification

Neural network models with attention mechanism have shown their efficiencies on various tasks. However, there is little research work on attention mechanism for text classification and existing attention model for text classification lacks of cognitive intuition and mathematical explanation. In this paper, we propose a new architecture of neural network based on the attention model for text classification. In particular, we show that the convolutional neural network (CNN) is a reasonable model for extracting attentions from text sequences in mathematics. We then propose a novel attention model base on CNN and introduce a new network architecture which combines recurrent neural network with our CNN-based attention model. Experimental results on five datasets show that our proposed models can accurately capture the salient parts of sentences to improve the performance of text classification.

Jiachen Du, Lin Gui, Ruifeng Xu, Yulan He

Shortcut Sequence Tagging

Deep stacked RNNs are usually hard to train. Recent studies have shown that shortcut connections across different RNN layers bring substantially faster convergence. However, shortcuts increase the computational complexity of the recurrent computations. To reduce the complexity, we propose the shortcut block, which is a refinement of the shortcut LSTM blocks. Our approach is to replace the self-connected parts ($$c_t^l$$) with shortcuts ($$h_t^{l-2}$$) in the internal states. We present extensive empirical experiments showing that this design performs better than the original shortcuts. We evaluate our method on CCG supertagging task, obtaining a 8% relatively improvement over current state-of-the-art results.

Huijia Wu, Jiajun Zhang, Chengqing Zong

Machine Translation


Look-Ahead Attention for Generation in Neural Machine Translation

The attention model has become a standard component in neural machine translation (NMT) and it guides translation process by selectively focusing on parts of the source sentence when predicting each target word. However, we find that the generation of a target word does not only depend on the source sentence, but also rely heavily on the previous generated target words, especially the distant words which are difficult to model by using recurrent neural networks. To solve this problem, we propose in this paper a novel look-ahead attention mechanism for generation in NMT, which aims at directly capturing the dependency relationship between target words. We further design three patterns to integrate our look-ahead attention into the conventional attention model. Experiments on NIST Chinese-to-English and WMT English-to-German translation tasks show that our proposed look-ahead attention mechanism achieves substantial improvements over state-of-the-art baselines.

Long Zhou, Jiajun Zhang, Chengqing Zong

Modeling Indicative Context for Statistical Machine Translation

Contextual information is very important to select the appropriate phrases in statistical machine translation (SMT). The selection of different target phrases is sensitive to different parts of source contexts. Previous approaches based on either local contexts or global contexts neglect impacts of different contexts and are not always effective to disambiguate translation candidates. As a matter of fact, the indicative contexts are expected to play more important roles for disambiguation. In this paper, we propose to leverage the indicative contexts for translation disambiguation. Our model assigns phrase pairs confidence scores based on different source contexts which are then intergraded into the SMT log-linear model to help select translation candidates. Experimental results show that our proposed method significantly improves translation performance on the NIST Chinese-to-English translation tasks compared with the state-of-the-art SMT baseline.

Shuangzhi Wu, Dongdong Zhang, Shujie Liu, Ming Zhou

A Semantic Concept Based Unknown Words Processing Method in Neural Machine Translation

The problem of unknown words in neural machine translation (NMT), which not only affects the semantic integrity of the source sentences but also adversely affects the generating of the target sentences. The traditional methods usually replace the unknown words according to the similarity of word vectors, these approaches are difficult to deal with rare words and polysemous words. Therefore, this paper proposes a new method of unknown words processing in NMT based on the semantic concept of the source language. Firstly, we use the semantic concept of source language semantic dictionary to find the candidate in-vocabulary words. Secondly, we propose a method to calculate the semantic similarity by integrating the source language model and the semantic concept network, to obtain the best replacement word. Experiments on English to Chinese translation task demonstrate that our proposed method can achieve more than 2.6 BLEU points over the conventional NMT method. Compared with the traditional method based on word vector similarity, our method can also obtain an improvement by nearly 0.8 BLEU points.

Shaotong Li, Jinan Xu, Guoyi Miao, Yujie Zhang, Yufeng Chen

Research on Mongolian Speech Recognition Based on FSMN

Deep Neural Network (DNN) model has been achieved a significant result over the Mongolian speech recognition task, however, compared to Chinese, English or the others, there are still opportunities for further enhancements. This paper presents the first application of Feed-forward Sequential Memory Network (FSMN) for Mongolian speech recognition tasks to model long-term dependency in time series without using recurrent feedback. Furthermore, by modeling the speaker in the feature space, we extract the i-vector features and combine them with the Fbank features as the input to validate their effectiveness in Mongolian ASR tasks. Finally, discriminative training was firstly conducted over the FSMN by using maximum mutual information (MMI) and state-level minimum Bayes risk (sMBR), respectively. The experimental results show that: FSMN possesses better performance than DNN in the Mongolian ASR, and by using i-vector features combined with Fbank features as FSMN input and discriminative training, the word error rate (WER) is relatively reduced by 17.9% compared with the DNN baseline.

Yonghe Wang, Feilong Bao, Hongwei Zhang, Guanglai Gao

Using Bilingual Segments to Improve Interactive Machine Translation

Recent research on machine translation has achieved substantial progress. However, the machine translation results are still not error-free, and need to be post-edited by a human translator (user) to produce correct translations. Interactive machine translation enhanced the human-computer collaboration through having human validate the longest correct prefix in the suggested translation. In this paper, we refine the interactivity protocol to provide more natural collaboration. Users are allowed to validate bilingual segments, which give more direct guidance to the decoder and more hints to the users. Besides, validating bilingual segments is easier than identifying correct segments from the incorrect translations. Experimental results with real users show that the new protocol improved the translation efficiency and translation quality on three Chinese-English translation tasks.

Na Ye, Ping Xu, Chuang Wu, Guiping Zhang

Vietnamese Part of Speech Tagging Based on Multi-category Words Disambiguation Model

POS tagging is a fundamental work in Natural Language Processing, which determines the subsequent processing quality, and the ambiguity of multi-category words directly affects the accuracy of Vietnamese POS tagging. At present, the POS tagging of English and Chinese has achieved better results, but the accuracy of Vietnamese POS tagging is still to be improved. For address this problem, this paper proposes a novel method of Vietnamese POS tagging based on multi-category words disambiguation model and Part of Speech dictionary, the multi-category words dictionary and the non-multi-category words dictionary are generated from the Vietnamese dictionary, which are used to build POS tagging corpus. 396,946 multi-category words have been extracted from the corpus, by using statistical method, the maximum entropy disambiguation model of Vietnamese part of speech is constructed, based on it, the multi-category words and the non-multi-category words are tagged. Experimental results show that the method proposed in the paper is higher than the existing model, which is proved that the method is feasible and effective.

Zhao Chen, Liu Yanchao, Guo Jianyi, Chen Wei, Yan Xin, Yu Zhengtao, Chen Xiuqin

NLP Applications


Unsupervised Automatic Text Style Transfer Using LSTM

In this paper, we focus on the problem of text style transfer which is considered as a subtask of paraphrasing. Most previous paraphrasing studies have focused on the replacements of words and phrases, which depend exclusively on the availability of parallel or pseudo-parallel corpora. However, existing methods can not transfer the style of text completely or be independent from pair-wise corpora. This paper presents a novel sequence-to-sequence (Seq2Seq) based deep neural network model, using two switches with tensor product to control the style transfer in the encoding and decoding processes. Since massive parallel corpora are usually unavailable, the switches enable the model to conduct unsupervised learning, which is an initial investigation into the task of text style transfer to the best of our knowledge. The results are analyzed quantitatively and qualitatively, showing that the model can deal with paraphrasing at different text style transfer levels.

Mengqiao Han, Ou Wu, Zhendong Niu

Optimizing Topic Distributions of Descriptions for Image Description Translation

Image Description Translation (IDT) is a task to automatically translate the image captions (i.e., image descriptions) into the target language. Current statistical machine translation (SMT) cannot perform as well as usual in this task because there is lack of topic information provided for translation model generation. In this paper, we focus on acquiring the possible contexts of the captions so as to generate topic models with rich and reliable information. The image matching technique is utilized in acquiring the relevant Wikipedia texts to the captions, including the captions of similar Wikipedia images, the full articles that involve the images and the paragraphs that semantically correspond to the images. On the basis, we go further to approach topic modelling using the obtained contexts. Our experimental results show that the obtained topic information enhances the SMT of image caption, yielding a performance gain of no less than 1% BLUE score.

Jian Tang, Yu Hong, Mengyi Liu, Jiashuo Zhang, Jianmin Yao

Automatic Document Metadata Extraction Based on Deep Networks

Metadata information extraction from academic papers is of great value to many applications such as scholar search, digital library, and so on. This task has attracted much attention from researchers in the past decades, and many templates-based or statistical machine learning (e.g. SVM, CRF, etc.)-based extraction methods have been proposed, while this task is still a challenge because of the variety and complexity of page layout. To address this challenge, we try introducing the deep learning networks to this task in this paper, since deep learning has shown great power in many areas like computer vision (CV) and natural language processing (NLP). Firstly, we employ the deep learning networks to model the image information and the text information of paper headers respectively, which allow our approach to perform metadata extraction with little information loss. Then we formulate the problem, metadata extraction from a paper header, as two typical tasks of different areas: object detection in the area of CV, and sequence labeling in the area of NLP. Finally, the two deep networks generated from the above two tasks are combined together to give extraction results. The primary experiments show that our approach achieves state-of-the-art performance on several open datasets. At the same time, this approach can process both image data and text data, and does not need to design any classification feature.

Runtao Liu, Liangcai Gao, Dong An, Zhuoren Jiang, Zhi Tang

A Semantic Representation Enhancement Method for Chinese News Headline Classification

Recently there has been an increasing research interest in short text such as news headline. Due to the inherent sparsity of short text, the current text classification methods perform badly when applied to the classification of news headlines. To overcome this problem, a novel method which enhances the semantic representation of headlines is proposed in this paper. Firstly, we add some keywords extracted from the most similar news to expand the word features. Secondly, we use the corpus in news domain to pre-train the word embedding so as to enhance the word representation. Moreover, Fasttext classifier, which uses a liner method to classify text with fast speed and high accuracy, is adopted for news headline classification. On the task for Chinese news headline categorization in NLPCC2017, the proposed method achieved 83.1% of the F-measure, which got the first rank in 33 teams.

Zhongbo Yin, Jintao Tang, Chengsen Ru, Wei Luo, Zhunchen Luo, Xiaolei Ma

Abstractive Document Summarization via Neural Model with Joint Attention

Due to the difficulty of abstractive summarization, the great majority of past work on document summarization has been extractive, while the recent success of sequence-to-sequence framework has made abstractive summarization viable, in which a set of recurrent neural networks models based on attention encoder-decoder have achieved promising performance on short-text summarization tasks. Unfortunately, these attention encoder-decoder models often suffer from the undesirable shortcomings of generating repeated words or phrases and inability to deal with out-of-vocabulary words appropriately. To address these issues, in this work we propose to add an attention mechanism on output sequence to avoid repetitive contents and use the subword method to deal with the rare and unknown words. We applied our model to the public dataset provided by NLPCC 2017 shared task3. The evaluation results show that our system achieved the best ROUGE performance among all the participating teams and is also competitive with some state-of-the-art methods.

Liwei Hou, Po Hu, Chao Bei

An Effective Approach for Chinese News Headline Classification Based on Multi-representation Mixed Model with Attention and Ensemble Learning

In NLPCC 2017 shared task two, we propose an efficient approach for Chinese news headline classification based on multi-representation mixed model with attention and ensemble learning. Firstly, we model the headline semantic both on character and word level via Bi-directional Long Short-Term Memory (BiLSTM), with the concatenation of output states from hidden layer as the semantic representation. Meanwhile, we adopt attention mechanism to highlight the key characters or words related to the classification decision, and we get a preliminary test result. Then, for samples with lower confidence level in the preliminary test result, we utilizing ensemble learning to determine the final category of the whole test samples by sub-models voting. Testing on the NLPCC 2017 official test set, the overall F1 score of our model eventually reached 0.8176, which can be ranked No. 3.

Zhonglei Lu, Wenfen Liu, Yanfang Zhou, Xuexian Hu, Binyu Wang

NLP Fundamentals


Domain-Specific Chinese Word Segmentation with Document-Level Optimization

Previous studies normally formulate Chinese word segmentation as a character sequence labeling task and optimize the solution in sentence-level. In this paper, we address Chinese word segmentation as a document-level optimization problem. First, we apply a state-of-the-art approach, i.e., long short-term memory (LSTM), to perform character classification; Then, we propose a global objective function on the basis of character classification and achieve global optimization via Integer Linear Programming (ILP). Specifically, we propose several kinds of global constrains in ILP to capture various segmentation knowledge, such as segmentation consistency and domain-specific regulations, to achieve document-level optimization, besides label transition knowledge to achieve sentence-level optimization. Empirical studies demonstrate the effectiveness of the proposed approach to domain-specific Chinese word segmentation.

Qian Yan, Chenlin Shen, Shoushan Li, Fen Xia, Zekai Du

Will Repeated Reading Benefit Natural Language Understanding?

Repeated Reading (re-read), which means to read a sentence twice to get a better understanding, has been applied to machine reading tasks. But there have not been rigorous evaluations showing its exact contribution to natural language processing. In this paper, we design four tasks, each representing a different class of NLP tasks: (1) part-of-speech tagging, (2) sentiment analysis, (3) semantic relation classification, (4) event extraction. We take a bidirectional LSTM-RNN architecture as standard model for these tasks. Based on the standard model, we add repeated reading mechanism to make the model better “understand” the current sentence by reading itself twice. We compare three different repeated reading architectures: (1) Multi-level attention (2) Deep BiLSTM (3) Multi-pass BiLSTM, enforcing apples-to-apples comparison as much as possible. Our goal is to understand better in what situation repeated reading mechanism can help NLP task, and which of the three repeated reading architectures is more appropriate to repeated reading. We find that repeated reading mechanism do improve performance on some tasks (sentiment analysis, semantic relation classification, event extraction) but not on others (POS tagging). We discuss how these differences may be caused in each of the tasks. Then we give some suggestions for researchers to follow when choosing whether to use repeated model and which repeated model to use when faced with a new task. Our results thus shed light on the usage of repeated reading in NLP tasks.

Lei Sha, Feng Qian, Zhifang Sui

A Deep Convolutional Neural Model for Character-Based Chinese Word Segmentation

This paper proposes a deep convolutional neural model for character-based Chinese word segmentation. It first constructs position embeddings to encode unigram and bigram features that are directly related to single positions in input sentence, and then adaptively builds up hierarchical position representations with a deep convolutional net. In addition, a multi-task learning strategy is used to further enhance this deep neural model by treating multiple supervised CWS datasets as different tasks. Experimental results have shown that our neural model outperforms the existing neural ones, and the model equipped with multi-task learning has successfully achieved state-of-the-art F-score performance for standard benchmarks: 0.964 on PKU dataset and 0.978 on MSR dataset.

Zhipeng Xie, Junfeng Hu

Chinese Zero Pronoun Resolution: A Chain to Chain Approach

Chinese zero pronoun (ZP) resolution plays a critical role in discourse analysis. Different from traditional mention to mention approaches, this paper proposes a chain to chain approach to improve the performance of ZP resolution from three aspects. Firstly, consecutive ZPs are clustered into coreferential chains, each working as one independent anaphor as a whole. In this way, those ZPs far away from their overt antecedents can be bridged via other consecutive ZPs in the same coreferential chains and thus better resolved. Secondly, common noun phrases (NPs) are automatically grouped into coreferential chains using traditional approaches, each working as one independent antecedent candidate as a whole. Then, ZP resolution is made between ZP coreferential chains and common NP coreferential chains. In this way, the performance can be much improved due to the effective reduction of search space by pruning singletons and negative instances. Finally, additional features from ZP and common NP coreferential chains are employed to better represent anaphors and their antecedent candidates, respectively. Comprehensive experiments on the OntoNotes corpus show that our chain to chain approach significantly outperforms the state-of-the-art mention to mention approaches. To our knowledge, this is the first work to resolve zero pronouns in a chain to chain way.

Kong Fang, Zhou Guodong

Towards Better Chinese Zero Pronoun Resolution from Discourse Perspective

Chinese zero pronoun (ZP) resolution plays an important role in natural language understanding. This paper focuses on improving Chinese ZP resolution from discourse perspective. In particular, various kinds of discourse information are employed in both stages of ZP resolution. During the ZP detection stage, we first propose an elementary discourse unit (EDU) based method to generate ZP candidates from discourse perspective and then exploit relevant discourse context to help better identify ZPs. During the ZP resolution stage, we employ a tree-style discourse rhetorical structure to improve the resolution. Evaluation on OntoNotes shows the significant importance of discourse information to the performance of ZP resolution. To the best of our knowledge, this is the first work to improve Chinese ZP resolution from discourse perspective.

Sheng Cheng, Kong Fang, Zhou Guodong

Neural Domain Adaptation with Contextualized Character Embedding for Chinese Word Segmentation

There has a large scale annotated newswire data for Chinese word segmentation. However, some research proves that the performance of the segmenter has significant decrease when applying the model trained on the newswire to other domain, such as patent and literature. The same character appeared in different words may be in different position and with different meaning. In this paper, we introduce contextualized character embedding to neural domain adaptation for Chinese word segmentation. The contextualized character embedding aims to capture the useful dimension in embedding for target domain. The experiment results show that the proposed method achieves competitive performance with previous Chinese word segmentation domain adaptation methods.

Zuyi Bao, Si Li, Sheng Gao, Weiran Xu

BiLSTM-Based Models for Metaphor Detection

Metaphor is a pervasive phenomenon in our daily use of natural language. Metaphor detection has been playing an important role in a variety of NLP tasks. Most existing approaches to this task rely heavily on the use of human-crafted features built from linguistic knowledge resource, which greatly limits their applicability. This paper presents four BiLSTM-based models for metaphor detection. The first three models use a sub-sequence as the input to BiLSTM network, each with a special kind of sub-sequence extracted from the input sentence. The last model is an ensemble model which aggregate the outputs from the first three models to get the final output. Experimental results have shown the effectiveness of our models.

Shichao Sun, Zhipeng Xie

Hyper-Gated Recurrent Neural Networks for Chinese Word Segmentation

Recently, recurrent neural networks (RNNs) have been increasingly used for Chinese word segmentation to model the contextual information without the limit of context window. In practice, two kinds of gated RNNs, long short-term memory (LSTM) and gated recurrent unit (GRU), are often used to alleviate the long dependency problem. In this paper, we propose the hyper-gated recurrent neural networks for Chinese word segmentation, which enhance the gates to incorporate the historical information of gates. Experiments on the benchmark datasets show that our model outperforms the baseline models as well as the state-of-the-art methods.

Zhan Shi, Xinchi Chen, Xipeng Qiu, Xuanjing Huang

Effective Semantic Relationship Classification of Context-Free Chinese Words with Simple Surface and Embedding Features

This paper describes the system we submitted to Task 1, i.e., Chinese Word Semantic Relation Classification, in NLPCC 2017. Given a pair of context-free Chinese words, this task is to predict the semantic relationships of them among four categories: Synonym, Antonym, Hyponym and Meronym. We design and investigate several surface features and embedding features containing word level and character level embeddings together with supervised machine learning methods to address this task. Officially released results show that our system ranks above average.

Yunxiao Zhou, Man Lan, Yuanbin Wu

Classification of Chinese Word Semantic Relations

Classification of word semantic relation is a challenging task in natural language processing (NLP) field. In many practical applications, we need to distinguish words with different semantic relations. Much work relies on semantic resources such as Tongyici Cilin and HowNet, which are limited by the quality and size. Recently, methods based on word embedding have received increasing attention for their flexibility and effectiveness in many NLP tasks. Furthermore, word vector offset implies words semantic relation to some extent. This paper proposes a novel framework for identifying the Chinese word semantic relation. We combine semantic dictionary, word vector and linguistic knowledge into a classification system. We conduct experiments on the Chinese Word Semantic Relation Classification shared task of NLPCC 2017. We rank No.1 with the result of F1 value 0.859. The results demonstrate that our method is very scientific and effective.

Changliang Li, Teng Ma

Social Network


Identification of Influential Users Based on Topic-Behavior Influence Tree in Social Networks

Identifying influential users in social networks is of significant interest, as it can help improve the propagation of ideas or innovations. Various factors can affect the relationships and the formulation of influence between users. Although many studies have researched this domain, the effect of the correlation between messages and behaviors in measuring users’ influence in social networks has not been adequately focused on. As a result, influential users can not be accurately evaluated. Thus, we propose a topic-behavior influence tree algorithm that identifies influential users using six types of relationships in the following factors: message content, hashtag titles, retweets, replies, and mentions. By maximizing the number of affected users and minimizing the propagation path, we can improve the accuracy of identifying influential users. The experimental results compared with state-of-the-art algorithms on various datasets and visualization on TUAW dataset validate the effectiveness of the proposed algorithm.

Jianjun Wu, Ying Sha, Rui Li, Qi Liang, Bo Jiang, Jianlong Tan, Bin Wang

Hierarchical Dirichlet Processes with Social Influence

The hierarchical Dirichlet process model has been successfully used for extracting the topical or semantic content of documents and other kinds of sparse count data. Along with the growth of social media, there have been simultaneous increases in the amounts of textual information and social structural information. To incorporate the information contained in these structures, in this paper, we propose a novel non-parametric model, social hierarchical Dirichlet process (sHDP), to solve the problem. We assume that the topic distributions of documents are similar to each other if their authors have relations in social networks. The proposed method is extended from the hierarchical Dirichlet process model. We evaluate the utility of our method by applying it to three data sets: papers from NIPS proceedings, a subset of articles from Cora, and microblogs with social network. Experimental results demonstrate that the proposed method can achieve better performance than state-of-the-art methods in all three data sets.

Jin Qian, Yeyun Gong, Qi Zhang, Xuanjing Huang

A Personality-Aware Followee Recommendation Model Based on Text Semantics and Sentiment Analysis

As the popularity of micro-blogging sites, followee recommendation plays an important role in information sharing over microblogging platforms. But as the popularity of microblogging sites increases, the difficulty of deciding who to follow also increases. The interests and emotions of users are often varied in their real lives. On the contrary, some other features of micro-blog are always unchangeable and they cannot describe the users characteristics very well. To solve this problem, we propose a personality-aware followee recommendation model (PSER) based on text semantics and sentiment analysis, a novel personality followee recommendation scheme over microblogging systems based on user attributes and the big-five personality model. It quantitatively analyses the effects of user personality in followee selection by combining personality traits with text semantics of micro-blogging and sentiment analysis of users. We conduct comprehensive experiments on a large-scale dataset collected from Sina Weibo, the most popular mircoblogging system in China. The results show that our scheme greatly outperforms existing schemes in terms of precision and an accurate appreciation of this model tied to a quantitative analysis of personality is crucial for potential followees selection, and thus, enhance recommendation.

Pan Xiao, YongQuan Fan, YaJun Du

A Novel Community Detection Method Based on Cluster Density Peaks

Community structure is the basic structure of a social network. Nodes of a social network can naturally form communities. More specifically, nodes are densely connected with each other within the same community while sparsely between different communities. Community detection is an important task in understanding the features of networks and graph analysis. At present there exist many community detection methods which aim to reveal the latent community structure of a social network, such as graph-based methods and heuristic-information-based methods. However, the approaches based on graph theory are complex and with high computing expensive. In this paper, we extend the density concept and propose a density peaks based community detection method. This method firstly computes two metrics-the local density $$\rho $$ and minimum climb distance $$\delta $$ -for each node in a network, then identify the nodes with both higher $$\rho $$ and $$\delta $$ in local fields as each community center. Finally, rest nodes are assigned with corresponding community labels. The complete process of this method is simple but efficient. We test our approach on four classic baseline datasets. Experimental results demonstrate that the proposed method based on density peaks is more accurate and with low computational complexity.

Donglei Liu, Yipeng Su, Xudong Li, Zhendong Niu

Text Mining


Review Rating with Joint Classification and Regression Model

Review rating is a sentiment analysis task which aims to predict a recommendation score for a review. Basically, classification and regression models are two major approaches to review rating, and these two approaches have their own characteristics and strength. For instance, the classification model can flexibly utilize distinguished models in machine learning, while the regression model can capture the connections between different rating scores. In this study, we propose a novel approach to review rating, namely joint LSTM, by exploiting the advantages of both review classification and regression models. Specifically, our approach employs an auxiliary Long-Short Term Memory (LSTM) layer to learn the auxiliary representation from the classification setting, and simultaneously join the auxiliary representation into the main LSTM layer for the review regression setting. In the learning process, the auxiliary classification LSTM model and the main regression LSTM model are jointly learned. Empirical studies demonstrate that our joint learning approach performs significantly better than using either individual classification or regression model on review rating.

Jian Xu, Hao Yin, Lu Zhang, Shoushan Li, Guodong Zhou

Boosting Collective Entity Linking via Type-Guided Semantic Embedding

Entity Linking (EL) is the task of mapping mentions in natural-language text to their corresponding entities in a knowledge base (KB). Type modeling for mention and entity could be beneficial for entity linking. In this paper, we propose a type-guided semantic embedding approach to boost collective entity linking. We use Bidirectional Long Short-Term Memory (BiLSTM) and dynamic convolutional neural network (DCNN) to model the mention and the entity respectively. Then, we build a graph with the semantic relatedness of mentions and entities for the collective entity linking. Finally, we evaluate our approach by comparing the state-of-the-art entity linking approaches over a wide range of very different data sets, such as TAC-KBP from 2009 to 2013, AIDA, DBPediaSpotlight, N3-Reuters-128, and N3-RSS-500. Besides, we also evaluate our approach with a Chinese Corpora. The experiments reveal that the modeling for entity type can be very beneficial to the entity linking.

Weiming Lu, Yangfan Zhou, Haijiao Lu, Pengkun Ma, Zhenyu Zhang, Baogang Wei

Biomedical Domain-Oriented Word Embeddings via Small Background Texts for Biomedical Text Mining Tasks

Most word embedding methods are proposed with general purpose which take a word as a basic unit and learn embeddings by words’ external contexts. However, in the field of biomedical text mining, there are many biomedical entities and syntactic chunks which can enrich the semantic meaning of word embeddings. Furthermore, large scale background texts for training word embeddings are not available in some scenarios. Therefore, we propose a novel biomedical domain-specific word embeddings model based on maximum-margin (BEMM) to train word embeddings using small set of background texts, which incorporates biomedical domain information. Experimental results show that our word embeddings overall outperform other general-purpose word embeddings on some biomedical text mining tasks.

Lishuang Li, Jia Wan, Degen Huang

Homographic Puns Recognition Based on Latent Semantic Structures

Homographic puns have a long history in human writing, being a common source of humor in jokes and other comedic works. It remains a difficult challenge to construct computational models to discover the latent semantic structures behind homographic puns so as to recognize puns. In this work, we design several latent semantic structures of homographic puns based on relevant theory and design sets of effective features of each structure, and then we apply an effective computational approach to identify homographic puns. Results on the SemEval2017 Task7 and Pun of the Day datasets indicate that our proposed latent semantic structures and features have sufficient effectiveness to distinguish between homographic pun and non-homographic pun texts. We believe that our novel findings will facilitate and stimulate the booming field of computational pun research in the future.

Yufeng Diao, Liang Yang, Dongyu Zhang, Linhong Xu, Xiaochao Fan, Di Wu, Hongfei Lin

Short Papers


Constructing a Chinese Conversation Corpus for Sentiment Analysis

Sentiment analysis plays an important role in many applications. This paper introduces our ongoing work related to the sentiment analysis on Chinese conversation. The main purpose is to construct a Chinese conversation corpus for sentiment analysis and provide a benchmark result on this corpus. To explore the effectiveness of machine learning based approaches for sentiment analysis on Chinese conversation, we firstly collected conversational data from some online English learning websites and our instant messages, and manually annotated it with three sentiment polarities and 22 fine-grained emotion classes. Then we applied multiple representative classification methods to evaluate the corpus. The evaluation results provide good suggestions for the future research. And we will release the corpus with gold standards publicly for research purposes.

Yujun Zhou, Changliang Li, Bo Xu, Jiaming Xu, Lei Yang, Bo Xu

Improving Retrieval Quality Using PRF Mechanism from Event Perspective

Pseudo-relevance feedback (PRF) has proven to be an effective mechanism for improving retrieval quality. However, using general PRF mechanism would usually be demonstrated with poor performance when the retrieval objective is an event. Intuitively, event-oriented query often involves special properties of event object, which cannot easily be expressed with keyword-based event query, and might cause the deviation from target event to feedback documents. In this paper, an original, simple yet effective event-oriented PRF mechanism (EO-PRF) that takes into account the drawbacks of PRF mechanism from an event perspective to improve retrieval quality is proposed. This EO-PRF mechanism innovates by making use of some extra event knowledge to improve retrieval quality by integrating target event information with the initial query. Empirical evaluations based on TREC-TS 2015 dataset and standard benchmarks, namely mainstream non-feedback retrieval method, and state-of-the-art pseudo feedback methods, demonstrate the effectiveness of the proposed EO-PRF mechanism in event-oriented retrieval.

Pengming Wang, Peng Li, Rui Li, Bin Wang

An Information Retrieval-Based Approach to Table-Based Question Answering

We propose a simple yet effective information retrieval based approach to answer complex questions with open domain web tables. Specifically, given a question and a table, we rank all table cells based on their representations, and select the cells of the highest ranking score as the answer. To represent a cell, we design rich features which leverage both the semantic information of the question and the structure information of the table. The experiments are conducted on WIKITABLEQUESTIONS dataset in which the questions have complex semantics. Compared to a semantic parsing based method, our approach improves the accuracy score by 6.03 points.

Junwei Bao, Nan Duan, Ming Zhou, Tiejun Zhao

Building Emotional Conversation Systems Using Multi-task Seq2Seq Learning

This paper describes our system designed for the NLPCC 2017 shared task on emotional conversation generation. Our model adopts a multi-task Seq2Seq learning framework to capture the textual information of post sequence and generate responses for each type of emotions simultaneously. Evaluation results suggest that our model is competitive on emotional generation, which achieves 0.9658 on average emotion accuracy. We also observe the emotional interaction in human conversation, and try to explain it as empathy at the psychological level. Finally, our model achieves 325 on total score, 0.545 on average score and won the fourth place on total score.

Rui Zhang, Zhenyu Wang, Dongcheng Mai

NLPCC 2017 Shared Task Social Media User Modeling Method Summary by DUTIR_923

User attribute classification plays an important role in the Internet advertising, public opinion monitoring. While the user points of interest prediction helps the online social media services creating more value. In this paper, aiming at solving the user attributes classification tasks we combine the feature engineering and deep Learning method to reach a higher rank. User attribute classification task is divided into two sub-tasks, in sub-task one, we use the user’s POI (point of interest) check-in history and popular POI location information to predict the next POI that user may visit the future. Sub-task 2 needs to predict the gender of the user. We use the Stacking method to carry out the feature fusion method to complete the feature extraction, based on the output of the logistic regression model then features will be sent to XGBoost model to perform the prediction. In addition, we also used the Convolution neural network model to dig out the user tweets information. Here we replace the conventional Max Pooling method with Attention Pooling in order to minimum the information lost in neural network training. Finally, two methods are given to give a more accurate result.

Dongzhen Wen, Liang Yang, HengChao Li, Kai Guo, Peng Fei, HongFei Lin

Babbling - The HIT-SCIR System for Emotional Conversation Generation

This paper describes the HIT-SCIR emotional response agent “Babbling” to the NLPCC 2017 Shared Task 4 on emotional conversation generation. Babbling consists of two parts, one is a rule based model for picking generic responses and the other is a neural work based model. For the latter part, we apply the encoder-decoder [1] framework to generate emotional response given the post and assigned emotion label. To improve the content coherency, we use LTS [2] for acquiring a better first word. To generate responses with consistent emotions, we employ the emotion embeddings to guide emotionalizing process. To produce more content coherent and emotion consistent responses, we include the attention mechanism [3] and its extension, multi-hop attention (MTA) [4]. The rule based part and neural network based part are ranked the second and fifth place respectively according to the total score.

Jianhua Yuan, Huaipeng Zhao, Yanyan Zhao, Dawei Cong, Bing Qin, Ting Liu

Unsupervised Slot Filler Refinement via Entity Community Construction

Given an entity (query), slot filling aims to find and extract the values (slot fillers) of its specific attributes (slot types) from a large-scale of document collections. Most existing work of slot filling models slot fillers separately and only considers direct relations between slot fillers and query, ignoring other slot fillers in context. In this paper we propose an unsupervised slot filler refinement approach via entity community construction to filter out the incorrect fillers collaboratively. The community-based framework mainly consists of (1) filler community generated by a point-wise mutual information-based hierarchical clustering, and (2) query community constructed by a co-occurrence graph model.

Zengzhuang Xu, Rui Song, Bowei Zou, Yu Hong

Relation Linking for Wikidata Using Bag of Distribution Representation

Knowledge graphs (KGs) are essential repositories of structured and semi-structured knowledge which benefit various NLP applications. To utilize the knowledge in KGs to help machines to better understand plain texts, one needs to bridge the gap between knowledge and texts. In this paper, a Relation Linking System for Wikidata (RLSW) is proposed to link the relations in KGs to plain texts. The proposed system uses the knowledge in Wikidata as seeds and clusters relation mentions in text with a novel phrase similarity algorithm. To enhance the system’s ability of handling unseen expressions and make use of the location information of words to reduce false positive rate, a bag of distribution pattern modeling method is proposed. Experimental results show that the proposed approach improves traditional methods, including word based pattern and syntax feature enriched system such as OLLIE.

Xi Yang, Shiya Ren, Yuan Li, Ke Shen, Zhixing Li, Guoyin Wang

Neural Question Generation from Text: A Preliminary Study

Automatic question generation aims to generate questions from a text passage where the generated questions can be answered by certain sub-spans of the given passage. Traditional methods mainly use rigid heuristic rules to transform a sentence into related questions. In this work, we propose to apply the neural encoder-decoder model to generate meaningful and diverse questions from natural language sentences. The encoder reads the input text and the answer position, to produce an answer-aware input representation, which is fed to the decoder to generate an answer focused question. We conduct a preliminary study on neural question generation from text with the SQuAD dataset, and the experiment results show that our method can produce fluent and diverse questions.

Qingyu Zhou, Nan Yang, Furu Wei, Chuanqi Tan, Hangbo Bao, Ming Zhou

Answer Selection in Community Question Answering by Normalizing Support Answers

Answer selection in community question answering (cQA) is a common task in natural language processing. Recent progress focuses on not only pure question-answer (QA) match but also support answers [4]. In this paper, we argue that the performance can drop dramatically if noisy support answers are selected. To tackle the above issue, we propose a novel way to leverage the contributions of support answers: the match scores which are firstly normalized by the correlations between the question and the corresponding similar questions, such that the negative effect from the noisy answers can be reduced. The model applies word-to-word attention to improve QA match and employs cosine similarity as the normalization factor for support answers. Compared with previous work, experiments on the Yahoo! Answers L4 dataset show that our model achieves superior P@1 and MRR results.

Zhihui Zheng, Daohe Lu, Qingcai Chen, Haijun Yang, Yang Xiang, Youcheng Pan, Wei Zhong

An Empirical Study on Incorporating Prior Knowledge into BLSTM Framework in Answer Selection

Deep learning has become the state-of the art solution to answer selection. One distinguishing advantage of deep learning is that it avoids manual engineering via its end-to-end structure. But in the literature, substantial practices of introducing prior knowledge into the deep learning process are still observed with positive effect. Following this thread, this paper investigates the contribution of incorporating different prior knowledge into deep learning via an empirical study. Under a typical BLSTM framework, 3 levels, totaling 27 features are jointly integrated into the answer selection task. Experiment result confirms that incorporating prior knowledge can enhances the model, and different levels of linguistic features can improve the performance consistantly.

Yahui Li, Muyun Yang, Tiejun Zhao, Dequan Zheng, Sheng Li

Enhanced Embedding Based Attentive Pooling Network for Answer Selection

Document-based Question Answering tries to rank the candidate answers for given questions, which needs to evaluate matching score between the question sentence and answer sentence. Existing works usually utilize convolution neural network (CNN) to adaptively learn the latent matching pattern between the question/answer pair. However, CNN can only perceive the order of a word in a local windows, while the global order of the windows is ignored due to the window-sliding operation. In this report, we design an enhanced CNN ( with extended order information (e.g. overlapping position and global order) into inputting embedding, such rich representation makes it possible to learn an order-aware matching in CNN. Combining with standard convolutional paradigm like attentive pooling, pair-wise training and dynamic negative sample, this end-to-end CNN achieve a good performance on the DBQA task of NLPCC 2017 without any other extra features.

Zhan Su, Benyou Wang, Jiabin Niu, Shuchang Tao, Peng Zhang, Dawei Song

A Retrieval-Based Matching Approach to Open Domain Knowledge-Based Question Answering

In this paper, we propose a retrieval and knowledge-based question answering system for the competition task in NLPCC 2017. Regarding the question side, our system uses a ranking model to score candidate entities to detect a topic entity from questions. Then similarities between the question and candidate relation chains are computed, based on which candidate answer entities are ranked. By returning the highest scored answer entity, our system finally achieves the F1-score of 41.96% on test set of NLPCC 2017. Our current system focuses on solving single-relation questions, but it can be extended to answering multiple-relation questions.

Han Zhang, Muhua Zhu, Huizhen Wang

Improved Compare-Aggregate Model for Chinese Document-Based Question Answering

Document-based question answering (DBQA) is a sub-task in question answering. It aims to measure the matching relation between questions and answers, which can be regarded as sentence matching problem. In this paper, we introduce a Compare-Aggregate architecture to handle the word-level comparison and aggregation. To deal with the noisy information in traditional attention mechanism, the k-top attention mechanism is proposed to filter out irrelevant words. Subsequently, we propose a combined model to merge matching relation learned by Compare-Aggregate model with shallow features to generate the final matching score. We evaluate our model on Chinese Document-based Question Answering (DBQA) task. The experimental results show the effectiveness of our proposed improved methods. And our final combined model achieves second place result on the DBQA task of NLPCC-ICCPOL 2017 Shared Task. The paper provides the technical details of the proposed algorithm.

Ziliang Wang, Weijie Bian, Si Li, Guang Chen, Zhiqing Lin

Transfer Deep Learning for Low-Resource Chinese Word Segmentation with a Novel Neural Network

Recent studies have shown effectiveness in using neural networks for Chinese word segmentation. However, these models rely on large-scale data and are less effective for low-resource datasets because of insufficient training data. We propose a transfer learning method to improve low-resource word segmentation by leveraging high-resource corpora. First, we train a teacher model on high-resource corpora and then use the learned knowledge to initialize a student model. Second, a weighted data similarity method is proposed to train the student model on low-resource data. Experiment results show that our work significantly improves the performance on low-resource datasets: 2.3% and 1.5% F-score on PKU and CTB datasets. Furthermore, this paper achieves state-of-the-art results: 96.1%, and 96.2% F-score on PKU and CTB datasets.

Jingjing Xu, Shuming Ma, Yi Zhang, Bingzhen Wei, Xiaoyan Cai, Xu Sun

AHNN: An Attention-Based Hybrid Neural Network for Sentence Modeling

Deep neural networks (DNNs) are powerful models that achieved excellent performance on many fields, especially in Nature Language Processing (NLP). Convolutional neural networks (CNN) and Recurrent neural networks (RNN) are two mainstream architectures of DNNs, are wildly explored to handle NLP tasks. However, those two type models adopt totally different ways to work. CNN is supposed to be good at capturing local features while RNN is considered to be able to summarize global information. In this paper, we combine the strengths of both architectures and propose a hybird model AHNN: Attention-based hybrid Neural Network, and use it in sentence modeling study. The AHNN utilizes attention based bidirectional dynamic lstm to obtain a better representation of global sentence information, then uses a parallel convolutional layer which has three different size filters and a max pooling layer to obtain significant local information. Finally, the two results are used together to feed into an expert layer to obtain results. Experiments show that the proposed architecture AHNN is able to summarize the context of the sentence and capture significant local features of sentence which is important for sentence modeling. We evaluate the proposed architecture AHNN on NLPCC News Headline Categorization test set and achieve 0.8098 test accuracy, it is a competitive performance compare with other teams in this task.

Xiaomin Zhang, Li Huang, Hong Qu

Improving Chinese-English Neural Machine Translation with Detected Usages of Function Words

One of difficulties in Chinese-English machine translation is that the grammatical meaning expressed by morphology or syntax in target translations is usually determined by Chinese function words or word order. In order to address this issue, we develop classifiers to automatically detect usages of common Chinese function words based on Chinese Function usage Knowledge Base (CFKB) and initially propose a function word usage embedding model to incorporate detection results into neural machine translation (NMT). Experiments on the NIST Chinese-English translation task demonstrate that the proposed method can obtain significant improvements on the quality of both translation and word alignment over the NMT baseline.

Kunli Zhang, Hongfei Xu, Deyi Xiong, Qiuhui Liu, Hongying Zan

Using NMT with Grammar Information and Self-taught Mechanism in Translating Chinese Symptom and Disease Terminologies

Neural Machine Translation (NMT) based on the encoder-decoder architecture is a proposed approach to machine translation, and has achieved promising results comparable to those of traditional approaches such as statistical machine translation. However, a NMT system usually needs a large number of parallel corpora to train the model, which is difficult to get in some specific areas, e.g. symptom and disease terminologies. In this paper, we propose two approaches to make full use of the source-side monolingual data to make up the lack of parallel corpora. The first approach uses part-of-speech of source-side symptom and disease terminologies to get their grammar information. The second approach employs a self-taught learning algorithm to get more synthetic parallel data. The proposed NMT model obtains significant improvements in translating symptom and disease terminologies from Chinese into English. Improvements up to 2.13 BLEU points are gained, compared with the NMT baseline system.

Lu Zeng, Qi Wang, Lingfei Zhang

Learning Bilingual Lexicon for Low-Resource Language Pairs

Learning bilingual lexicon from monolingual data is a novel idea in natural language process which can benefit many low-resource language pairs. In this paper, we present an approach for obtaining bilingual lexicon from monolingual data. Our method only requires a small seed bilingual lexicon and we use the Canonical Correlation Analysis to construct a shared latent space to explain two monolingual embeddings how to be linked. Experimental results show that a considerable precision and size bilingual lexicon can be learned in Chinese-Uyghur and Chinese-Kazakh monolingual data.

ShaoLin Zhu, Xiao Li, YaTing Yang, Lei Wang, ChengGang Mi

Exploring the Impact of Linguistic Features for Chinese Readability Assessment

Readability assessment plays an important role in selecting proper reading materials for language learners, and is applicable for many NLP tasks such as text simplification and document summarization. In this study, we designed 100 factors to systematically evaluate the impact of four levels of linguistic features (shallow, POS, syntactic, discourse) on predicting text difficulty for L1 Chinese learners. We further selected 22 significant features with regression. Our experiment results show that the 100-feature model and the 22-feature model both achieve the same predictive accuracies as the BOW baseline for the majority of the text difficulty levels, and significantly better than baseline for the others. Using 18 out of the 22 features, we derived one of the first readability formulas for contemporary simplified Chinese language.

Xinying Qiu, Kebin Deng, Likun Qiu, Xin Wang

A Semantic-Context Ranking Approach for Community-Oriented English Lexical Simplification

Lexical simplification under a given vocabulary scope for specified communities would potentially benefit many applications such as second language learning and cognitive disabilities education. This paper proposes a new concise ranking strategy for incorporating semantic and context for lexical simplification to a restricted scope. Our approach utilizes WordNet-based similarity calculation for semantic expansion and ranking. It then uses Part-of-Speech tagging and Google 1T 5-gram corpus for context-based ranking. Our experiments are based on a publicly available data sets. Through the comparison with baseline methods including Google Word2vec and four-step method, our approach achieves best F1 measure as 0.311 and Oot F1 measure as 0.522, respectively, demonstrating its effectiveness in combining semantic and context for English lexical simplification.

Tianyong Hao, Wenxiu Xie, John Lee

A Multiple Learning Model Based Voting System for News Headline Classification

This paper presents the framework and methodologies of Soochow university team’s news headline classification system for NLPCC 2017 shared task 2. The submitted systems aim to automatically classify each Chinese news headline into one or more predefined categories. We develop a voting system based on convolutional neural networks (CNN), gated recurrent units (GRU), and support vector machine (SVM). Experimental results show that our method achieves a Macro-F1 score of about 81%, outperforming most strong competitors, and ranking at 6th in the 32 participants.

Fenhong Zhu, Xiaozheng Dong, Rui Song, Yu Hong, Qiaoming Zhu

Extractive Single Document Summarization via Multi-feature Combination and Sentence Compression

In this paper, we attempt to extract and generate the short summary for the news article with the length limit of 60 Chinese characters. Firstly, we preprocess the news article by segmenting sentences and words, and then extract four kinds of central words to form the keyword dictionary based on parsing tree. After that, the four kinds of features, i.e. the sentence weight, the sentence similarity, the sentence position and the length of sentence, will be employed to measure the significance of each sentence. Finally, we extract two sentences in the descending order of significance score and compress them to get the summary for each news article. This approach can analyze the grammatical elements from original sentences in order to generate compression rules and trim syntactic elements according to their parsing trees. The evaluation results show that our system is efficient in Chinese news summarization.

Maofu Liu, Yan Yu, Qiaosong Qi, Huijun Hu, Han Ren

A News Headlines Classification Method Based on the Fusion of Related Words

Short text classification is a challenging work as a result of several words, usually fewer than 20 words, in each text which brings about a problem of feature sparsity. In this paper, we propose a method of extending short text to cope with the problem of data sparsity. Additionally, we combine extension of short text, which forms a new representation with the word vector of each word in the short text trained by word2vec model on large-scale corpus. Furthermore, the new representation works as input for neural bag-of-words (NBOW) model. We evaluate this method on NLPCC 2017 Evaluation Task 2. The experimental results show that extension of short text extension with NBOW model outperforms baselines and can achieve excellent performance on the news headline classification task.

Yongguan Wang, Binjie Meng, Pengyuan Liu, Erhong Yang

Resolving Chinese Zero Pronoun with Word Embedding

Elliptical sentences are frequently seen in Chinese, especially in some particular situations, such as dialogues, which is challengeable to understand specific semantic. Chinese zero pronoun resolution, which recovers a noun phrase in the elliptical position, is an effective method to help machines understand natural languages. Traditional methods use the features, which are extracted from syntactic parsing trees manually. However, the long running time and the inaccuracy of automatic parsing algorithms have a bad influence on practical applications. In this work, we propose a new method based on long-short-term memory network that calculates dense vector representations for mention pairs without using features from syntactic parsing trees. These representations, which capture significant semantics for zero pronoun resolution, are built on distributed representation of words in surrounding contexts and candidate antecedents. Our method contributes to reducing the manual work of extracting features from parsing tress, which improves the F1-score of Chinese zero pronoun resolution system. Experimental results on OnotoNotes 5.0 Chinese dataset show our method achieves better performance compared with the state-of-the-art method.

Bingquan Liu, Xinkai Du, Ming Liu, Chengjie Sun, Guidong Zheng, Chao Zou

Active Learning for Chinese Word Segmentation on Judgements

This paper aims to perform the task of Chinese Word Segmentation on judgements. For this task, the main challenge is the lack of the annotated corpus. To alleviate this challenge, this paper proposes an active learning approach. Specifically, on the basis of a few initial annotated samples, a new active learning approach is proposed to annotate some informative characters, and then select the context around these characters for annotation. In the active learning approach, it not only considers the uncertainty of the sample, but also leverages the redundancy of the sample for the selection of informative characters. Furthermore, this paper adopts the local annotation strategy, which select a substrings around the informative characters rather than the whole sentences and thus could also reduce the annotation. The empirical study demonstrates that the proposed approach effectively reduces the annotation cost and performances better than other baseline sample selection strategies under the same scale of annotation.

Qian Yan, Limin Wang, Shoushan Li, Huan Liu, Guodong Zhou

Study on the Chinese Word Semantic Relation Classification with Word Embedding

This paper describes our solution to the NLPCC 2017 shared task on Chinese word semantic relation classification. Our proposed method won second place for this task. The evaluation result of our method on the test set is 76.8% macro F1 on the four types of semantic relation classification, i.e., synonym, antonym, hyponym, and meronym. In our experiments, we try basic word embedding, linear regression and convolutional neural networks (CNNs) with the pre-trained word embedding. The experimental results show that CNNs have better performance than other methods. Also, we find that the proposed method can achieve competitive results with small training corpus.

E. Shijia, Shengbin Jia, Yang Xiang

HDP-TUB Based Topic Mining Method for Chinese Micro-blogs

Topic models are important tools for mining the potential topics of text. However, the existing topic model is mostly derived from latent Dirichlet allocation (LDA), which requires the number of topics to be specified in advance. In order to mine the topic of Chines micro-blogs automatically, we propose a nonparametric Bayesian model, named HDP-TUB model, which is derived from hierarchical Dirichlet Process (HDP). In this model, we assume non-exchangeability of data, and use temporal information, user information and theme tags (TUB) to solve the sparsity problem caused by the short text. In order to construct the HDP-TUB model, the CRF (Chinese Restaurant Franchise) method is extended to integrate the temporal information, user information and topic tag information. Experiments show that the HDP-TUB model outperforms the LDA model and the HDP model in the perplexity and the difference between topics.

Yaorong Zhang, Bo Yang, Li Yi, Yi Liu, Yangsen Zhang

Detecting Deceptive Review Spam via Attention-Based Neural Networks

In recent years, the influence of deceptive review spam has further strengthened in purchasing decisions, election choices and product design. Detecting deceptive review spam has attracted more and more researchers. Existing work makes utmost efforts to explore effective linguistic and behavioral features, and utilizes the off-the-shelf classification algorithms to detect spam. But the models are usually compromised training results on the whole datasets. They failed to distinguish whether a review is linguistically suspicious or behaviorally suspicious or both. In this paper, we propose an attention-based neural networks to detect deceptive review spam by distinguishingly using linguistic and behavioral features. Experimental results on real commercial public datasets show the effectiveness of our model over the state-of-the-art methods.

Xuepeng Wang, Kang Liu, Jun Zhao

A Tensor Factorization Based User Influence Analysis Method with Clustering and Temporal Constraint

User influence analysis in social media has attracted tremendous interest from both the sociology and social data mining. It is becoming a hot topic recently. However, most approaches ignore the temporal characteristic that hidden behind the comments and articles of users. In this paper, we introduce a Tensor Factorization based on User Cluster (TFUC) model to predict the ranking of users’ influence in micro blogs. Initially, TFUC obtain an influential users cluster by neural network clustering algorithm. Then, TFUC choose influential users to construct tensor model. A time matrix restrain TFUC expect CP decomposition and ranked users by their influence score that obtained from predicted tensor at last. Our experimental results show that the MAP of TFUC is higher than existing influence models with 3.4% at least.

Xiangwen Liao, Lingying Zhang, Lin Gui, Kam-Fai Wong, Guolong Chen

Cross-Lingual Entity Matching for Heterogeneous Online Wikis

Knowledge bases play an increasing important role in many applications. However, many knowledge bases mainly focus on English knowledge, and have only a few knowledge for low-resource languages (LLs). If we can map the entities in LLs to these in high-resource languages (HLs), many knowledge such as relation between entities can be transferred from HLs to LLs.In this paper, we propose an efficient and effective Cross-Lingual Entity Matching approach (CL-EM) to enrich the existing cross-lingual links by learning to rank framework with the learned language-independent features, including cross-lingual topic features and document embedding features. In the experiments, we verified our approach on the existing cross-lingual links between Chinese Wikipedia and English Wikipedia by comparing it with other state-of-art approaches. In addition, we also discovered 141,754 new cross-lingual links between Baidu Baike and English Wikipedia, which almost doubles the number of the existing cross-lingual links.

Weiming Lu, Peng Wang, Huan Wang, Jiahui Liu, Hao Dai, Baogang Wei

A Unified Probabilistic Model for Aspect-Level Sentiment Analysis

Aspect-level sentiment analysis aims to delve deep into opinionated text to discover sentiments expressed about specific aspects of the discussed topics. Aspect detection is often achieved by topic modelling. Probabilistic modelling has been one of the more popular approaches for both topic modelling and sentiment analysis. Incorporating Part-Of-Speech (POS) information and modelling the emphasis placed on each topic have been shown to improve the quality of such models. Previous approaches to aspect-level sentiment analysis typically model only some of these components or rely on external tools or resources to provide some of the information. In this paper, we develop a new, unified probabilistic model that can capture topics, topic weights, syntactic classes, and sentiment levels from unstructured text without relying on any external sources of information. Our solution builds on the ideas of the existing probabilistic models but generalizes them into a unified framework with some novel extensions.

Daniel Stantic, Fei Song

An Empirical Study on Learning Based Methods for User Consumption Intention Classification

Recently, huge amount of text with user consumption intentions have been published on the social media platform, such as Twitter and Weibo, and classifying the intentions of users has great values for both scientific research and commercial applications. User consumption analysis in social media concerns about the text content representation and intention classification, whose solutions mainly focus on the traditional machine learning and the emerging deep learning techniques. In this paper, we conduct a comprehensive empirical study on the user intension classification problem with learning based techniques using different text representation methods. We compare different machine learning, deep learning methods and various combinations of them in tweet text presentation and users’ consumption intention classification. The experimental results show that LSTM models with pre-trained word vector representation can achieve the best classification performance.

Mingzhou Yang, Daling Wang, Shi Feng, Yifei Zhang

Overview of the NLPCC 2017 Shared Task: Chinese Word Semantic Relation Classification

Word semantic relation classification is a challenging task for natural language processing, so we organize a semantic campaign on this task at NLPCC 2017. The dataset covers four kinds of semantic relations (synonym, antonym, hyponym and meronym), and there are 500 word pairs per category. Together 17 teams submit their results. In this paper, we describe the data construction and experimental setting, make an analysis on the evaluation results, and make a brief introduction to some of the participating systems.

Yunfang Wu, Minghua Zhang

Overview of the NLPCC 2017 Shared Task: Emotion Generation Challenge

It has been a long-term goal for AI to perceive and express emotions. Inspired by Emotional Chatting Machine [1], we propose a challenge task to investigate how well a chatting machine can express emotion by generating a textual response to an input post. The task is defined as follows: given a post and a pre-specified emotion class of the generated response, the task is to generate a response that is appropriate in both topic and emotion. This challenge has attracted more 40 teams registered, and finally there are 10 teams who submitted results. In this overview paper, we will report the details of this challenge, including task definition, data preparation, annotation schema, submission statistics, and evaluation results.

Minlie Huang, Zuoxian Ye, Hao Zhou

Overview of the NLPCC-ICCPOL 2017 Shared Task: Social Media User Modeling

In this paper, we give the overview of the social media user modeling shared task in the NLPCC-ICCPOL 2017. We first review the background of social media user modeling, and then describe two social media user modeling tasks in this year’s NLPCC-ICCPOL, including the construction of the benchmark datasets and the evaluation metrics. The evaluation results of submissions from participating teams are presented in the experimental part.

Fuzheng Zhang, Defu Lian, Xing Xie

Overview of the NLPCC 2017 Shared Task: Single Document Summarization

In this paper, we give an overview for the shared task at the 6th CCF Conference on Natural Language Processing & Chinese Computing (NLPCC 2017): single document summarization. Document summarization aims at conveying important information and generating significantly short summaries for original long documents. This task focused on summarizing the news articles and released a large corpus, TTNews corpus (TTNews corpus can be downloaded at, which was collected for single document summarization in Chinese. In this paper, we will introduce the task, the corpus, the participating teams and the evaluation results.

Lifeng Hua, Xiaojun Wan, Lei Li

Overview of the NLPCC 2017 Shared Task: Chinese News Headline Categorization

In this paper, we give an overview for the shared task at the CCF Conference on Natural Language Processing & Chinese Computing (NLPCC 2017): Chinese News Headline Categorization. The dataset of this shared task consists 18 classes, 12,000 short texts along with corresponded labels for each class. The dataset and example code can be accessed at

Xipeng Qiu, Jingjing Gong, Xuanjing Huang

Overview of the NLPCC 2017 Shared Task: Open Domain Chinese Question Answering

In this paper, we give the overview of the open domain Question Answering (or open domain QA) shared task in the NLPCC 2017. We first review the background of QA, and then describe two open domain Chinese QA tasks in this year’s NLPCC, including the construction of the benchmark datasets and the evaluation metrics. The evaluation results of submissions from participating teams are presented in the experimental part.

Nan Duan, Duyu Tang


Weitere Informationen

BranchenIndex Online

Die B2B-Firmensuche für Industrie und Wirtschaft: Kostenfrei in Firmenprofilen nach Lieferanten, Herstellern, Dienstleistern und Händlern recherchieren.




Der Hype um Industrie 4.0 hat sich gelegt – nun geht es an die Umsetzung. Das Whitepaper von Protolabs zeigt Unternehmen und Führungskräften, wie sie die 4. Industrielle Revolution erfolgreich meistern. Es liegt an den Herstellern, die besten Möglichkeiten und effizientesten Prozesse bereitzustellen, die Unternehmen für die Herstellung von Produkten nutzen können. Lesen Sie mehr zu: Verbesserten Strukturen von Herstellern und Fabriken | Konvergenz zwischen Soft- und Hardwareautomatisierung | Auswirkungen auf die Neuaufstellung von Unternehmen | verkürzten Produkteinführungszeiten
Jetzt gratis downloaden!