Skip to main content

2018 | Buch

Natural Language Processing and Chinese Computing

7th CCF International Conference, NLPCC 2018, Hohhot, China, August 26–30, 2018, Proceedings, Part II

herausgegeben von: Min Zhang, Vincent Ng, Prof. Dongyan Zhao, Sujian Li, Prof. Dr. Hongying Zan

Verlag: Springer International Publishing

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

This two volume set of LNAI 11108 and LNAI 11109 constitutes the refereed proceedings of the 7th CCF Conference on Natural Language Processing and Chinese Computing, NLPCC 2018, held in Hohhot, China, in August 2018.

The 55 full papers and 31 short papers presented were carefully reviewed and selected from 308 submissions. The papers of the first volume are organized in the following topics: conversational Bot/QA/IR; knowledge graph/IE; machine learning for NLP; machine translation; and NLP applications. The papers of the second volume are organized as follows: NLP for social network; NLP fundamentals; text mining; and short papers.

Inhaltsverzeichnis

Frontmatter

NLP for Social Network

Frontmatter
A Fusion Model of Multi-data Sources for User Profiling in Social Media

User profiling in social media plays an important role in different applications. Most of the existing approaches for user profiling are based on user-generated messages, which is not sufficient for inferring user attributes. With the continuous accumulation of data in social media, integrating multi-data sources has become the inexorable trend for precise user profiling. In this paper, we take advantage of text messages, user metadata, followee information and network representations. In order to integrate seamlessly multi-data sources, we propose a novel fusion model that effectively captures the complementarity and diversity of different sources. In addition, we address the problem of friendship-based network from previous studies and introduce celebrity ties which enrich the social network and boost the connectivity of different users. Experimental results show that our method outperforms several state-of-the-art methods on a real-world dataset.

Liming Zhang, Sihui Fu, Shengyi Jiang, Rui Bao, Yunfeng Zeng
First Place Solution for NLPCC 2018 Shared Task User Profiling and Recommendation

Social networking sites have been growing at an unprecedented rate in recent years. User profiling and personalized recommendation plays an important role in social networking, such as targeting advertisement and personalized news feed. For NLPCC Task 8, there are two subtasks. Subtask one is User Tags Prediction (UTP), which is to predict tags related to a user. We consider UTP as a Multi Label Classification (MLC) problem and proposed a CNN-RNN framework to explicitly exploit the label dependencies. The proposed framework employs CNN to get the user profile representation and the RNN module captures the dependencies among labels. Subtask two, User Following Recommendation (UFR), is to recommend friends to the users. There are mainly two approaches: Collaborative Filtering (CF) and Most Popular Friends (MPF), and we adopted a combination of both. Our experiments show that both of our methods yield clear improvements in F1@K compared to other algorithms and achieved first place in both subtasks.

Qiaojing Xie, Yuqian Wang, Zhenjing Xu, Kaidong Yu, Chen Wei, ZhiChen Yu
Summary++: Summarizing Chinese News Articles with Attention

We present Summary++, the model that competed in NLPCC2018’s Summary task. In this paper, we describe in detail of the task, our model, the results and other aspects during our experiments. The task is News article summarization in Chinese, where one sentence is generated per article. We use a neural encoder decoder attention model with pointer generator network, and modify it to focus on words attented to rather than words predicted. Our model archive second place in the task with a score of 0.285. The highlights of our model is that it run at character level, no extra features (e.g. part of speech, dependency structure) were used and very little preprocessing were done.

Juan Zhao, Tong Lee Chung, Bin Xu, Minghu Jiang

NLP Fundamentals

Frontmatter
Paraphrase Identification Based on Weighted URAE, Unit Similarity and Context Correlation Feature

A deep learning model adaptive to both sentence-level and article-level paraphrase identification is proposed in this paper. It consists of pairwise unit similarity feature and semantic context correlation feature. In this model, sentences are represented by word and phrase embedding while articles are represented by sentence embedding. Those phrase and sentence embedding are learned from parse trees through Weighted Unfolding Recursive Autoencoders (WURAE), an unsupervised learning algorithm. Then, unit similarity matrix is calculated by matching the pairwise lists of embedding. It is used to extract the pairwise unit similarity feature through CNN and k-max pooling layers. In addition, semantic context correlation feature is taken into account, which is captured by the combination of CNN and LSTM. CNN layers learn collocation information between adjacent units while LSTM extracts the long-term dependency feature of the text based on the output of CNN. This model is experimented on a famous English sentence paraphrase corpus, MSRPC, and a Chinese article paraphrase corpus. The results show that the deep semantic feature of text could be extracted based on WURAE, unit similarity and context correlation feature. We release our code of WURAE, deep learning model for paraphrase identification and pre-trained phrase end sentence embedding data for use by the community.

Jie Zhou, Gongshen Liu, Huanrong Sun
Which Embedding Level is Better for Semantic Representation? An Empirical Research on Chinese Phrases

Word embeddings have been used as popular features in various Natural Language Processing(NLP) tasks. To overcome the coverage problem of statistics, compositional model is proposed, which embeds basic units of a language, and compose structures of higher hierarchy, like idiom, phrase, and named entity. In that case, selecting the right level of basic-unit embedding to represent semantics of higher hierarchy unit is crucial. This paper investigates this problem by Chinese phrase representation task, in which language characters and words are viewed as basic units. We define a phrase representation evaluation tasks by utilizing Wikipedia. We propose four intuitionistic composing methods from basic embedding to higher level representation, and investigate the performance of the two basic units. Empirical results show that with all composing methods, word embedding out performs character embedding on both tasks, which indicates that word level is more suitable for composing semantic representation.

Kunyuan Pang, Jintao Tang, Ting Wang
Improving Word Embeddings for Antonym Detection Using Thesauri and SentiWordNet

Word embedding is a distributed representation of words in a vector space. It involves a mathematical embedding from a space with one dimension per word to a continuous vector space with much lower dimension. It performs well on tasks including synonym and hyponym detection by grouping similar words. However, most existing word embeddings are insensitive to antonyms, since they are trained based on word distributions in a large amount of text data, where antonyms usually have similar contexts. To generate word embeddings that are capable of detecting antonyms, we firstly modify the objective function of Skip-Gram model, and then utilize the supervised synonym and antonym information in thesauri as well as the sentiment information of each word in SentiWordNet. We conduct evaluations on three relevant tasks, namely GRE antonym detection, word similarity, and semantic textual similarity. The experiment results show that our antonym-sensitive embedding outperforms common word embeddings in these tasks, demonstrating the efficacy of our methods.

Zehao Dou, Wei Wei, Xiaojun Wan
Neural Chinese Word Segmentation with Dictionary Knowledge

Chinese word segmentation (CWS) is an important task for Chinese NLP. Recently, many neural network based methods have been proposed for CWS. However, these methods require a large number of labeled sentences for model training, and usually cannot utilize the useful information in Chinese dictionary. In this paper, we propose two methods to exploit the dictionary information for CWS. The first one is based on pseudo labeled data generation, and the second one is based on multi-task learning. The experimental results on two benchmark datasets validate that our approach can effectively improve the performance of Chinese word segmentation, especially when training data is insufficient.

Junxin Liu, Fangzhao Wu, Chuhan Wu, Yongfeng Huang, Xing Xie
Recognizing Macro Chinese Discourse Structure on Label Degeneracy Combination Model

Discourse structure analysis is an important task in Natural Language Processing (NLP) and it is helpful to many NLP tasks, such as automatic summarization and information extraction. However, there are only a few researches on Chinese macro discourse structure analysis due to the lack of annotated corpora. In this paper, combining structure recognition with nuclearity recognition, we propose a Label Degeneracy Combination Model (LD-CM) to find the solution of structure recognition in the solution space of nuclearity recognition. Experimental results on the Macro Chinese Discourse TreeBank (MCDTB) show that our model improves the accuracy by 1.21%, compared with the baseline system.

Feng Jiang, Peifeng Li, Xiaomin Chu, Qiaoming Zhu, Guodong Zhou
LM Enhanced BiRNN-CRF for Joint Chinese Word Segmentation and POS Tagging

Word segmentation and part-of-speech tagging are two preliminary but fundamental components of Chinese natural language processing. With the upsurge of deep learning, end-to-end models are built without handcrafted features. In this work, we model Chinese word segmentation and part-of-speech tagging jointly on the basis of state-of-the-art BiRNN-CRF architecture. LSTM is adopted as the basic recurrent unit. Apart from utilizing pre-trained character embeddings and trigram features, we incorporate neural language model and conduct multi-task training. Highway layers are applied to tackle the discordance issue of the naive co-training. Experimental results on CTB5, CTB7, and PPD datasets show the effectiveness of the proposed method.

Jianhu Zhang, Gongshen Liu, Jie Zhou, Cheng Zhou, Huanrong Sun
Chinese Grammatical Error Correction Using Statistical and Neural Models

This paper introduces the Alibaba NLP team’s system for NLPCC 2018 shared task of Chinese Grammatical Error Correction (GEC). Chinese as a Second Language (CSL) learners can use this system to correct grammatical errors in texts they wrote. We proposed a method to combine statistical and neural models for the GEC task. This method consists of two modules: the correction module and the combination module. In the correction module, two statistical models and one neural model generate correction candidates for each input sentence. Those two statistical models are a rule-based model and a statistical machine translation (SMT)-based model. The neural model is a neural machine translation (NMT)-based model. In the combination module, we implemented it in a hierarchical manner. We first combined models at a lower level, which means we trained several models with different configurations and combined them. Then we combined those two statistical models and a neural model at the higher level. Our system reached the second place on the leaderboard released by the official.

Junpei Zhou, Chen Li, Hengyou Liu, Zuyi Bao, Guangwei Xu, Linlin Li

Text Mining

Frontmatter
Multi-turn Inference Matching Network for Natural Language Inference

Natural Language Inference (NLI) is a fundamental and challenging task in Natural Language Processing (NLP). Most existing methods only apply one-pass inference process on a mixed matching feature, which is a concatenation of different matching features between a premise and a hypothesis. In this paper, we propose a new model called Multi-turn Inference Matching Network (MIMN) to perform multi-turn inference on different matching features. In each turn, the model focuses on one particular matching feature instead of the mixed matching feature. To enhance the interaction between different matching features, a memory component is employed to store the history inference information. The inference of each turn is performed on the current matching feature and the memory. We conduct experiments on three different NLI datasets. The experimental results show that our model outperforms or achieves the state-of-the-art performance on all the three datasets.

Chunhua Liu, Shan Jiang, Hainan Yu, Dong Yu
From Humour to Hatred: A Computational Analysis of Off-Colour Humour

Off-colour humour is a category of humour which is considered by many to be in poor taste or overly vulgar. Most commonly, off-colour humour contains remarks on particular ethnic group or gender, violence, domestic abuse, acts concerned with sex, excessive swearing or profanity. Blue humour, dark humour and insult humour are types of off-colour humour. Blue and dark humour unlike insult humour are not outrightly insulting in nature but are often misclassified because of the presence of insults and harmful speech. As the primary contributions of this paper we provide an original data-set consisting of nearly 15,000 instances and a novel approach towards resolving the problem of separating dark and blue humour from offensive humour which is essential so that free-speech on the internet is not curtailed. Our experiments show that deep learning methods outperforms other n-grams based approaches like SVM’s, Naive Bayes and Logistic Regression by a large margin.

Vikram Ahuja, Radhika Mamidi, Navjyoti Singh
Classification of the Structure of Square Hmong Characters and Analysis of Its Statistical Properties

Analysis of the character structure characteristics can lay an information foundation for the intelligent processing of square Hmong characters. Combined with the analysis of character structure characteristics, this paper presents a definition of the linearization of square Hmong characters, a definition of equivalence class division of the structure of square Hmong characters, and proposes a decision algorithm of structure equivalence class. According to the above algorithm, the structure of square Hmong characters is divided into eight equivalent classes. Analysis of the statistical properties, including the cumulative probability distribution, complexity, and information entropy of square Hmong characters appearing in practical documents, shows that, first, more than 90% of square Hmong characters appearing in practical documents are composed of two components, and more than 80% of these characters possess a left-right, top-bottom, or lower-left-enclosed structure, second, the number of mean components in a square Hmong character is slightly greater than 2, third, the information entropy of the structure of Hmong characters is within the interval (1.19, 2.16). Results reveal that square Hmong characters appearing frequently in practical documents follow the principle of simple structure orientation.

Li-Ping Mo, Kai-Qing Zhou, Liang-Bin Cao, Wei Jiang
Stock Market Trend Prediction Using Recurrent Convolutional Neural Networks

Short-term prediction of stock market trend has potential application for personal investment without high-frequency-trading infrastructure. Existing studies on stock market trend prediction have introduced machine learning methods with handcrafted features. However, manual labor spent on handcrafting features is expensive. To reduce manual labor, we propose a novel recurrent convolutional neural network for predicting stock market trend. Our network can automatically capture useful information from news on stock market without any handcrafted feature. In our network, we first introduce an entity embedding layer to automatically learn entity embedding using financial news. We then use a convolutional layer to extract key information affecting stock market trend, and use a long short-term memory neural network to learn context-dependent relations in financial news for stock market trend prediction. Experimental results show that our model can achieve significant improvement in terms of both overall prediction and individual stock predictions, compared with the state-of-the-art baseline methods.

Bo Xu, Dongyu Zhang, Shaowu Zhang, Hengchao Li, Hongfei Lin
Ensemble of Binary Classification for the Emotion Detection in Code-Switching Text

This paper describes the methods for the DeepIntell who participated the task1 in the NLPCC2018. The task1 is to label the emotion in a code-switching text. Note that, there may be more than one emotion in a post in this task. Hence, the assessment task is a multi-label classification task. At the same time, the post contains more than one language, and the emotion can be expressed by either monolingual or bilingual form. In this paper, we propose a novel method of converting multi-label classification into binary classification task and ensemble learning for code-switching text with sampling and emotion lexicon. Experiments show that the proposed method has achieved better performance in the code-switching text task.

Xinghua Zhang, Chunyue Zhang, Huaxing Shi
A Multi-emotion Classification Method Based on BLSTM-MC in Code-Switching Text

Most of the previous emotion classifications are based on binary or ternary classifications, and the final emotion classification results contain only one type of emotion. There is little research on multi-emotional coexistence, which has certain limitations on the restoration of human’s true emotions. Aiming at these deficiencies, this paper proposes a Bidirectional Long-Short Term Memory Multiple Classifiers (BLSTM-MC) model to study the five classification problems in code-switching text, and obtains text contextual relations through BLSTM-MC model. It fully considers the relationship between different emotions in a single post, at the same time, the Attention mechanism is introduced to find the importance of different features and predict all emotions expressed by each post. The model achieved third place in all submissions in the conference NLP&&CC_task1 2018.

Tingwei Wang, Xiaohua Yang, Chunping Ouyang, Aodong Guo, Yongbin Liu, Zhixing Li

Short Papers

Frontmatter
Improved Character-Based Chinese Dependency Parsing by Using Stack-Tree LSTM

Almost all the state-of-the-art methods for Character-based Chinese dependency parsing ignore the complete dependency subtree information built during the parsing process, which is crucial for parsing the rest part of the sentence. In this paper, we introduce a novel neural network architecture to capture dependency subtree feature. We extend and improve recent works in neural joint model for Chinese word segmentation, POS tagging and dependency parsing, and adopt bidirectional LSTM to learn n-gram feature representation and context information. The neural network and bidirectional LSTMs are trained jointly with the parser objective, resulting in very effective feature extractors for parsing. Finally, we conduct experiments on Penn Chinese Treebank 5, and demonstrate the effectiveness of the approach by applying it to a greedy transition-based parser. The results show that our model outperforms the state-of-the-art neural joint models in Chinese word segmentation, POS tagging and dependency parsing.

Hang Liu, Mingtong Liu, Yujie Zhang, Jinan Xu, Yufeng Chen
Neural Question Generation with Semantics of Question Type

This paper focuses on automatic question generation (QG) that transforms a narrative sentence into an interrogative sentence. Recently, neural networks have been used in this task due to its extraordinary ability of semantics encoding and decoding. We propose an approach which incorporates semantics of the possible question type. We utilize the Convolutional Neural Network (CNN) for predicting question type of the answer phrases in the narrative sentence. In order to incorporate the question type semantics into the generating process, we classify the question type which the answer phrases refer to. In addition, We use Bidirectional Long Short Term Memory (Bi-LSTM) to construct the question generating model. The experiment results show that our method outperforms the baseline system with the improvement of 1.7% on BLEU-4 score and beyonds the state-of-the-art.

Xiaozheng Dong, Yu Hong, Xin Chen, Weikang Li, Min Zhang, Qiaoming Zhu
A Feature-Enriched Method for User Intent Classification by Leveraging Semantic Tag Expansion

User intent identification and classification has become a vital topic of query understanding in human-computer dialogue applications. The identification of users’ intent is especially crucial for assisting system to understand users’ queries so as to classify the queries accurately to improve users’ satisfaction. Since the posted queries are usually short and lack of context, conventional methods heavily relying on query n-grams or other common features are not sufficient enough. This paper proposes a compact yet effective user intention classification method named as ST-UIC based on a constructed semantic tag repository. The method proposes to use a combination of four kinds of features including characters, non-key-noun part-of-speech tags, target words, and semantic tags. The experiments are based on a widely applied dataset provided by the First Evaluation of Chinese Human-Computer Dialogue Technology. The result shows that the method achieved a F1 score of 0.945, exceeding a list of baseline methods and demonstrating its effectiveness in user intent classification.

Wenxiu Xie, Dongfa Gao, Ruoyao Ding, Tianyong Hao
Event Detection via Recurrent Neural Network and Argument Prediction

This paper tackles the task of event detection, which involves identifying and categorizing the events. Currently event detection remains a challenging task due to the difficulty at encoding the event semantics in complicate contexts. The core semantics of an event may derive from its trigger and arguments. However, most of previous studies failed to capture the argument semantics in event detection. To address this issue, this paper first provides a rule-based method to predict candidate arguments on the event types of possibilities, and then proposes a recurrent neural network model RNN-ARG with the attention mechanism for event detection to capture meaningful semantic regularities form these predicted candidate arguments. The experimental results on the ACE 2005 English corpus show that our approach achieves competitive results compared with previous work.

Wentao Wu, Xiaoxu Zhu, Jiaming Tao, Peifeng Li
Employing Multiple Decomposable Attention Networks to Resolve Event Coreference

Event coreference resolution is a challenging NLP task due to this task needs to understand the semantics of events. Different with most previous studies used probability-based or graph-based models, this paper introduces a novel neural network, MDAN (Multiple Decomposable Attention Networks), to resolve document-level event coreference from different views, i.e., event mention, event arguments and trigger context. Moreover, it applies a document-level global inference mechanism to further resolve the coreference chains. The experimental results on two popular datasets ACE and TAC-KBP illustrate that our model outperforms the two state-of-the-art baselines.

Jie Fang, Peifeng Li, Guodong Zhou
Cross-Scenario Inference Based Event-Event Relation Detection

Event-Event Relation Detection (RD$$_{2e}$$) aims to detect the relations between a pair of news events, such as Causal relation between Criminal and Penal events. In general, RD$$_{2e}$$ is a challenging task due to the lack of explicit linguistic feature signaling the relations. We propose a cross-scenario inference method for RD$$_{2e}$$. By utilizing conceptualized scenario expression and graph-based semantic distance perception, we retrieve semantically similar historical events from Gigaword. Based on explicit relations of historical events, we infer implicit relations of target events by means of transfer learning. Experiments on 10 relation types show that our method outperforms the supervised models.

Yu Hong, Jingli Zhang, Rui Song, Jianmin Yao
SeRI: A Dataset for Sub-event Relation Inference from an Encyclopedia

Mining sub-event relations of major events is an important research problem, which is useful for building event taxonomy, event knowledge base construction, and natural language understanding. To advance the study of this problem, this paper presents a novel dataset called SeRI (Sub-event Relation Inference). SeRI includes 3,917 event articles from English Wikipedia and the annotations of their sub-events. It can be used for training or evaluating a model that mines sub-event relation from encyclopedia-style texts. Based on this dataset, we formally define the task of sub-event relation inference from an encyclopedia, propose an experimental setting and evaluation metrics and evaluate some baseline approaches’ performance on this dataset.

Tao Ge, Lei Cui, Baobao Chang, Zhifang Sui, Furu Wei, Ming Zhou
Densely Connected Bidirectional LSTM with Applications to Sentence Classification

Deep neural networks have recently been shown to achieve highly competitive performance in many computer vision tasks due to their abilities of exploring in a much larger hypothesis space. However, since most deep architectures like stacked RNNs tend to suffer from the vanishing-gradient and overfitting problems, their effects are still understudied in many NLP tasks. Inspired by this, we propose a novel multi-layer RNN model called densely connected bidirectional long short-term memory (DC-Bi-LSTM) in this paper, which essentially represents each layer by the concatenation of its hidden state and all preceding layers hidden states, followed by recursively passing each layers representation to all subsequent layers. We evaluate our proposed model on five benchmark datasets of sentence classification. DC-Bi-LSTM with depth up to 20 can be successfully trained and obtain significant improvements over the traditional Bi-LSTM with the same or even fewer parameters. Moreover, our model has promising performance compared with the state-of-the-art approaches.

Zixiang Ding, Rui Xia, Jianfei Yu, Xiang Li, Jian Yang
An End-to-End Scalable Iterative Sequence Tagging with Multi-Task Learning

Multi-task learning (MTL) models, which pool examples arisen out of several tasks, have achieved remarkable results in language processing. However, multi-task learning is not always effective when compared with the single-task methods in sequence tagging. One possible reason is that existing methods to multi-task sequence tagging often reply on lower layer parameter sharing to connect different tasks. The lack of interactions between different tasks results in limited performance improvement. In this paper, we propose a novel multi-task learning architecture which could iteratively utilize the prediction results of each task explicitly. We train our model for part-of-speech (POS) tagging, chunking and named entity recognition (NER) tasks simultaneously. Experimental results show that without any task-specific features, our model obtains the state-of-the-art performance on both chunking and NER.

Lin Gui, Jiachen Du, Zhishan Zhao, Yulan He, Ruifeng Xu, Chuang Fan
A Comparable Study on Model Averaging, Ensembling and Reranking in NMT

Neural machine translation has become a benchmark method in machine translation. Many novel structures and methods have been proposed to improve the translation quality. However, it is difficult to train and turn parameters. In this paper, we focus on decoding techniques that boost translation performance by utilizing existing models. We address the problem from three aspects—parameter, word and sentence level, corresponding to checkpoint averaging, model ensembling and candidates reranking which all do not need to retrain the model. Experimental results have shown that the proposed decoding approaches can significantly improve the performance over baseline model.

Yuchen Liu, Long Zhou, Yining Wang, Yang Zhao, Jiajun Zhang, Chengqing Zong
Building Corpus with Emoticons for Sentiment Analysis

Corpus is an essential resource for data driven natural language processing systems, especially for sentiment analysis. In recent years, people increasingly use emoticons on social media to express their emotions, attitudes or preferences. We believe that emoticons are a non-negligible feature of sentiment analysis tasks. However, few existing works focused on sentiment analysis with emoticons. And there are few related corpora with emoticons. In this paper, we create a large scale Chinese Emoticon Sentiment Corpus of Movies (CESCM). Different to other corpora, there are a wide variety of emoticons in this corpus. In addition, we did some baseline sentiment analysis work on CESCM. Experimental results show that emoticons do play an important role in sentiment analysis. Our goal is to make the corpus widely available, and we believe that it will offer great support to sentiment analysis research and emoticon research.

Changliang Li, Yongguan Wang, Changsong Li, Ji Qi, Pengyuan Liu
Construction of a Multi-dimensional Vectorized Affective Lexicon

Affective analysis has received growing attention from both research community and industry. However, previous works either cannot express the complex and compound states of human’s feelings or rely heavily on manual intervention. In this paper, by adopting Plutchik’s wheel of emotions, we propose a lowcost construction method that utilizes word embeddings and high-quality small seed-sets of affective words to generate multi-dimensional affective vector automatically. And a large-scale affective lexicon is constructed as a verification, which could map each word to a vector in the affective space. Meanwhile, the construction procedure uses little supervision or manual intervention, and could learn affective knowledge from huge amount of raw corpus automatically. Experimental results on affective classification task and contextual polarity disambiguation task demonstrate that the proposed affective lexicon outperforms other state-of-the-art affective lexicons.

Yang Wang, Chong Feng, Qian Liu
Convolution Neural Network with Active Learning for Information Extraction of Enterprise Announcements

We propose using convolution neural network (CNN) with active learning for information extraction of enterprise announcements. The training process of supervised deep learning model usually requires a large amount of training data with high-quality reference samples. Human production of such samples is tedious, and since inter-labeler agreement is low, very unreliable. Active learning helps assuage this problem by automatically selecting a small amount of unlabeled samples for humans to hand correct. Active learning chooses a selective set of samples to be labeled. Then the CNN is trained on the labeled data iteratively, until the expected experimental effect is achieved. We propose three sample selection methods based on certainty criterion. We also establish an enterprise announcements dataset for experiments, which contains 10410 samples totally. Our experiment results show that the amount of labeled data needed for a given extraction accuracy can be reduced by more than 45.79% compared to that without active learning.

Lei Fu, Zhaoxia Yin, Yi Liu, Jun Zhang
Research on Construction Method of Chinese NT Clause Based on Attention-LSTM

The correct definition and recognition of sentences is the basis of NLP. For the characteristics of Chinese text structure, the theory of NT clause was proposed from the perspective of micro topics. Based on this theory, this paper proposes a novel method for construction NT clause. Firstly, this paper proposes a neural network model based on Attention and LSTM (Attention-LSTM), which can identify the location of the missing Naming, and uses manually annotated corpus to train the Attention-LSTM. Secondly, in the process of constructing NT clause, the trained Attention-LSTM is used to identify the location of the missing Naming. Then the NT clause can be constructed. The accuracy of the experimental result is 81.74% (+4.5%). This paper can provide support for the task of text understanding, such as Machine Translation, Information Extraction, Man-machine Dialogue.

Teng Mao, Yuyao Zhang, Yuru Jiang, Yangsen Zhang
The Research and Construction of Complaint Orders Classification Corpus in Mobile Customer Service

Complaint orders in mobile customer service are the records of complaint description, which professional knowledge and information on customer’s complaint intention are kept. Complaint orders classification is important and necessary to be established and completed for further mining, analysis and improve the quality of customer service. Constructed corpus is the basis of research. The lack of complaint orders classification corpus (COCC) in mobile customer service has limited the research of complaint orders classification. This paper first employs K-means algorithm and professional knowledge to determine complaint orders classification labels. Then we craft the annotation rules for complaint orders, and then construct complaint orders classification corpus. The corpus consists of 130044 complaint orders annotated. Finally, we statistically analyze the corpus constructed, and the agreement of each question class reaches over 91%. It indicates that the corpus constructed could provide a great support for complaint orders classification and specialized analysis.

Junli Xu, Jiangjiang Zhao, Ning Zhao, Chao Xue, Linbo Fan, Zechuan Qi, Qiang Wei
The Algorithm of Automatic Text Summarization Based on Network Representation Learning

The graph models are an important method in automatic text summarization. However, there will be problems of vector sparseness and information redundancy in text map to graph. In this paper, we propose a graph clustering summarization algorithm based on network representation learning. The sentences graph was construed by TF-IDF, and controlled the number of edges by a threshold. The Node2Vec is used to embedding the graph, and the sentences were clustered by k-means. Finally, the Modularity is used to control the number of clusters, and generating a brief summary of the document. The experiments on the MultiLing 2013 show the proposed algorithm improves the F-Score in ROUGE-1 and ROUGE-2.

Xinghao Song, Chunming Yang, Hui Zhang, Xujian Zhao
Semi-supervised Sentiment Classification Based on Auxiliary Task Learning

Sentiment classification is an important task in the community of Nature Language Processing. This task aims to determine the sentiment category towards a piece of text. One challenging problem of this task is that it is difficult to obtain a large number of labeled samples. Therefore, a large number of studies are focused on semi-supervised learning, i.e., learning information from unlabeled samples. However, one disadvantage of the previous methods is that the unlabeled samples and the labeled samples are studied in different models, and there is no interaction between them. Based on this, this paper tackles the problem by proposing a semi-supervised sentiment classification based on auxiliary task learning, namely Aux-LSTM, which is used to assist learning the sentiment classification task with a small amount of human-annotated samples by training auto-annotated samples. Specifically, the two tasks are allowed to share the auxiliary LSTM layer, and the auxiliary expression obtained by the auxiliary LSTM layer is used to assist the main task. Empirical studies demonstrate that the proposed method can effectively improve the experimental performance.

Huan Liu, Jingjing Wang, Shoushan Li, Junhui Li, Guodong Zhou
A Normalized Encoder-Decoder Model for Abstractive Summarization Using Focal Loss

Abstractive summarization based on seq2seq model is a popular research topic today. And pre-trained word embedding is a common unsupervised method to improve deep learning model’s performance in NLP. However, during applying this method directly to the seq2seq model, we find it does not achieve the same good result as other fields because of an over training problem. In this paper, we propose a normalized encoder-decoder structure to address it, which can prevent the semantic structure of pre-trained word embedding from being destroyed during training. Moreover, we use a novel focal loss function to help our model focus on those examples with low score for getting better performance. We conduct the experiments on NLPCC2018 share task 3: single document summary. Result showed that these two mechanisms are extremely useful, helping our model achieve state-of-the-art ROUGE scores and get the first place in this task from the current rankings.

Yunsheng Shi, Jun Meng, Jian Wang, Hongfei Lin, Yumeng Li
A Relateness-Based Ranking Method for Knowledge-Based Question Answering

In this paper, we report technique details of our approach for the NLPCC 2018 shared task knowledge-based question answering. Our system uses a word-based maximum matching method to find entity candidates. Then, we combine editor distance, character overlap and word2vec cosine similarity to rank SRO triples of each entity candidate. Finally, the object of the top 1 score SRO is selected as the answer of the question. The result of our system achieves 62.94% of answer exact matching on the test set.

Han Ni, Liansheng Lin, Ge Xu
A Sequence to Sequence Learning for Chinese Grammatical Error Correction

Grammatical Error Correction (GEC) is an important task in natural language processing. In this paper, we introduce our system on NLPCC 2018 Shared Task 2 Grammatical Error Correction. The task is to detect and correct grammatical errors that occurred in Chinese essays written by non-native speakers of Mandarin Chinese. Our system is mainly based on the convolutional sequence-to-sequence model. We regard GEC as a translation task from the language of “bad” Chinese to the language of “good” Chinese. We describe the building process of the model in details. On the test data of NLPCC 2018 Shared Task 2, our system achieves the best precision score, and the $$F_{0.5}$$ score is 29.02. Our final results ranked third among the participants.

Hongkai Ren, Liner Yang, Endong Xun
Ensemble of Neural Networks with Sentiment Words Translation for Code-Switching Emotion Detection

Emotion detection in code-switching texts aims to identify the emotion labels of text which contains more than one language. The difficulties of this task include problems in bridging the gap between languages and capturing crucial semantic information for classification. To address these issues, we propose an ensemble model with sentiment words translation to build a powerful system. Our system first constructs an English-Chinese sentiment dictionary to make a connection between two languages. Afterwards, we separately train several models include CNN, RCNN and Attention based LSTM model. Then combine their classification results to improve the performance. The experiment result shows that our method has a good effect and achieves the second place among nineteen systems.

Tianchi Yue, Chen Chen, Shaowu Zhang, Hongfei Lin, Liang Yang
NLPCC 2018 Shared Task User Profiling and Recommendation Method Summary by DUTIR_9148

User profiling and personalized recommendation plays an important role in many business applications such as precision marketing and targeting advertisement. Since user data is heterogeneous, leveraging the heterogeneous information for user profiling and personalized recommendation is still a challenge. In this paper, we propose effective methods to solve two subtasks working in user profiling and recommendation. Subtask one is to predict users’ tags, we treat this subtask as a binary classification task, we combine users’ profile vector and social Large-scale Information Network Embedding (LINE) vector as user features, and use tag information as tag features, then apply a deep learning approach to predict which tags are related to a user. Subtask two is to predict the users a user would like to follow in the future. We adopt social-based collaborative filtering (CF) to solve this task. Our results achieve second place in both subtasks.

Xiaoyu Chen, Jian Wang, Yuqi Ren, Tong Liu, Hongfei Lin
Overview of NLPCC 2018 Shared Task 1: Emotion Detection in Code-Switching Text

This paper presents the overview of the shared task, emotion detection in code-switching text, in NLPCC 2018. The submitted systems are expected to automatically determine the emotions in the Chinese-English code-switching text. Different from monolingual text, code-switching text contains more than one language, and the emotion can be expressed by either monolingual or bilingual form. Hence, the challenges are: how to integrate both monolingual and bilingual forms to detect emotion, and how to bridge the gap to between two languages. Our shared task has 19 team participants. The highest F-score was 0.515. In this paper, we introduce the task, the corpus, the participating teams, and the evaluation results.

Zhongqing Wang, Shoushan Li, Fan Wu, Qingying Sun, Guodong Zhou
Overview of the NLPCC 2018 Shared Task: Automatic Tagging of Zhihu Questions

In this paper, we give an overview for the shared task at the CCF Conference on Natural Language Processing & Chinese Computing (NLPCC 2018): Automatic Tagging of Zhihu Questions. The dataset is collected from the Chinese question-answering web site Zhihu, which consists 25551 tags and 721608 training samples in this shared task. This is a multi-label text classification task, and each question can have as much as five relevant tags. The dataset can be assessed at http://tcci.ccf.org.cn/conference/2018/taskdata.php.

Bo Huang, Zhenyu Zhao
Overview of the NLPCC 2018 Shared Task: Grammatical Error Correction

In this paper, we present an overview of the Grammatical Error Correction task in the NLPCC 2018 shared tasks. We give detailed descriptions of the task definition and the data for training as well as evaluation. We also summarize the approaches investigated by the participants of this task. Such approaches demonstrate the state-of-the-art of Grammatical Error Correction for Mandarin Chinese. The data set and evaluation tool used by this task is available at https://github.com/zhaoyyoo/NLPCC2018_GEC.

Yuanyuan Zhao, Nan Jiang, Weiwei Sun, Xiaojun Wan
Overview of the NLPCC 2018 Shared Task: Multi-turn Human-Computer Conversations

In this paper, we give an overview of multi-turn human-computer conversations at NLPCC 2018 shared task. This task consists of two sub-tasks: conversation generation and retrieval with given context. Data-sets for both training and testing are collected from Weibo, where there are 5 million conversation sessions for training and 40,000 non-overlapping conversation sessions for evaluating. Details of the shared task, evaluation metric, and submitted models will be given successively.

Juntao Li, Rui Yan
Overview of the NLPCC 2018 Shared Task: Open Domain QA

We give the overview of the open domain QA shared task in the NLPCC 2018. In this year, we release three sub-tasks including Chinese knowledge-based question answering (KBQA) task, Chinese knowledge-based question generation (KBQG) task, and English knowledge-based question understanding (KBQU) task. The evaluation results of final submissions from participating teams will be presented in the experimental part.

Nan Duan
Overview of the NLPCC 2018 Shared Task: Single Document Summarization

In this report, we give an overview of the shared task about single document summarization at the seventh CCF Conference on Natural Language Processing and Chinese Computing (NLPCC 2018). Short summaries for articles are consumed frequently on mobile news apps. Because of the limited display space on the mobile phone screen, it is required to create concise text for the main idea of an article. This task aims at promoting technology development for single document summarization. We describe the task, the corpus, the participating teams and their results.

Lei Li, Xiaojun Wan
Overview of the NLPCC 2018 Shared Task: Social Media User Modeling

In this paper, we give the overview of the social media user modeling shared task in the NLPCC 2018. We first review the background of social media user modeling, and then describe two social media user modeling tasks in this year’s NLPCC, including the construction of the benchmark datasets and the evaluation metrics. The evaluation results of submissions from participating teams are presented in the experimental part.

Fuzheng Zhang, Xing Xie
Overview of the NLPCC 2018 Shared Task: Spoken Language Understanding in Task-Oriented Dialog Systems

This paper presents the overview for the shared task at the 7th CCF Conference on Natural Language Processing & Chinese Computing (NLPCC 2018): Spoken Language Understanding (SLU) in Task-oriented Dialog Systems. SLU usually consists of two parts, namely intent identification and slot filling. The shared task made publicly available a Chinese dataset of over 5.8 K sessions, which is a sample of the real query log from a commercial task-oriented dialog system and includes 26 K utterances. The contexts within a session are taken into consideration when a query within the session was annotated. To help participating systems correct ASR errors of slot values, this task also provides a dictionary of values for each enumerable type of slot. 16 teams entered the task and submitted a total of 40 SLU results. In this paper, we will review the task, the corpus, and the evaluation results.

Xuemin Zhao, Yunbo Cao
WiseTag: An Ensemble Method for Multi-label Topic Classification

Multi-label topic classification aims to assign one or more relevant topic labels to a text. This paper presents the WiseTag system, which performs multi-label topic classification based on an ensemble of four single models, namely a KNN-based model, an Information Gain-based model, a Keyword Matching-based model and a Deep Learning-based model. These single models are carefully designed so that they are diverse enough to improve the performance of the ensemble model. In the NLPCC 2018 shared task 6 “Automatic Tagging of Zhihu Questions”, the proposed WiseTag system achieves an F1 score of 0.4863 on the test set, and ranks no. 4 among all the teams.

Guanqing Liang, Hsiaohsien Kao, Cane Wing-Ki Leung, Chao He
Backmatter
Metadaten
Titel
Natural Language Processing and Chinese Computing
herausgegeben von
Min Zhang
Vincent Ng
Prof. Dongyan Zhao
Sujian Li
Prof. Dr. Hongying Zan
Copyright-Jahr
2018
Electronic ISBN
978-3-319-99501-4
Print ISBN
978-3-319-99500-7
DOI
https://doi.org/10.1007/978-3-319-99501-4

Premium Partner