main-content

## Über dieses Buch

This two-volume set of LNAI 13028 and LNAI 13029 constitutes the refereed proceedings of the 10th CCF Conference on Natural Language Processing and Chinese Computing, NLPCC 2021, held in Qingdao, China, in October 2021.

The 66 full papers, 23 poster papers, and 27 workshop papers presented were carefully reviewed and selected from 446 submissions. They are organized in the following areas: Fundamentals of NLP; Machine Translation and Multilinguality; Machine Learning for NLP; Information Extraction and Knowledge Graph; Summarization and Generation; Question Answering; Dialogue Systems; Social Media and Sentiment Analysis; NLP Applications and Text Mining; and Multimodality and Explainability.

## Inhaltsverzeichnis

### Coreference Resolution: Are the Eliminated Spans Totally Worthless?

Up to date, various neural-based methods have been proposed for joint mention span detection and coreference resolution. However, existing studies on coreference resolution mainly depend on mention representations, while the rest spans in the text are largely ignored and directly eliminated. In this paper, we aim at investigating whether those eliminated spans are totally worthless, or to what extent they can help improve the performance of coreference resolution. To achieve this goal, we propose to refine the representation of mentions with global spans including these eliminated ones leveraged. On this basis, we further introduce an additional loss term in this work to encourage the diversity between different entity clusters. Experimental results on the document-level CoNLL-2012 Shared Task English dataset show that the eliminated spans are indeed useful and our proposed approaches show promising results in coreference resolution.

Xin Tan, Longyin Zhang, Guodong Zhou

### Chinese Macro Discourse Parsing on Dependency Graph Convolutional Network

The macro-level discourse parsing, as a fundamental task of macro discourse analysis, mainly focuses on converting a document into a hierarchical discourse tree at paragraph level. Most existing methods follow micro-level studies and suffer from the issues of semantic representation and the semantic interaction of the larger macro discourse units. Therefore, we propose a macro-level discourse parser based on the dependency graph convolutional network to enhance the semantic representation of the large discourse unit and the semantic interaction between those large discourse units. Experimental results on both the Chinese MCDTB and English RST-DT show that our model outperforms several state-of-the-art baselines.

Yaxin Fan, Feng Jiang, Xiaomin Chu, Peifeng Li, Qiaoming Zhu

### Predicting Categorial Sememe for English-Chinese Word Pairs via Representations in Explainable Sememe Space

Sememe is the minimum unambiguous semantic unit in human language. Sememe knowledge bases(SKB) have been proven to be effective in many NLP tasks. Categorial sememe, indicating the basic category of word sense to bridge the lexicon and semantics, is indispensable in SKB. However, manual categorial sememe annotation is costly. This paper proposes a new task to automatically build SKB: English-Chinese Word Pair Categorial Sememe Prediction. The bilingual information is utilized to resolve the ambiguity challenge. Our method proposes the sememe space, in which sememes, words, and word senses are represented as vectors with interpretable semantics, to bridge the semantic gap between sememes and words. Extensive experiments and analyses validate the effectiveness of the proposed method. Using this method, we predict categorial sememes for 113,014 new word senses, and the prediction MAP is 85.8%. Further we conduct expert annotations based on prediction results and increase HowNet nearly by 50%. We will publish all the data and code.

Baoju Liu, Lei Hou, Xin Lv, Juanzi Li, Jinghui Xiao

### Multi-level Cohesion Information Modeling for Better Written and Dialogue Discourse Parsing

Discourse parsing has attracted more and more attention due to its importance on Natural Language Understanding. Accordingly, various neural models proposed and have achieved certain success. However, due to the scale limitation of corpus, outstanding performance still depends on additional features. Different from previous neural studies employing simple flat word level EDU (Elementary Discourse Unit) representation, we improve the performance of discourse parsing by employing cohesion information (In this paper, we regard lexical chain and coreference chain as cohesion information) enhanced EDU representation. In particular, firstly we use WordNet and a coreference resolution model to extract lexical and coreference chain respectively and automatically. Secondly, we construct EDU level graph based on the extracted chains. Finally, using Graph Attention Network, we incorporate the obtained cohesion information into EDU representation to improve discourse parsing. Experiments on RST-DT, CDTB and STAC show our proposed cohesion information enhanced EDU representation can benefit both written and dialogue discourse parsing, compared with the baseline model we duplicated.

Jinfeng Wang, Longyin Zhang, Fang Kong

### ProPC: A Dataset for In-Domain and Cross-Domain Proposition Classification Tasks

Correctly identifying the types of propositions helps to understand the logical relationship between sentences, and is of great significance to natural language understanding, reasoning and generation. However, in previous studies: 1) Only explicit propositions are concerned, while most propositions in texts are implicit; 2) Only detect whether it is a proposition, but it is more meaningful to identify which proposition type it belongs to; 3) Only in the encyclopedia domain, whereas propositions exist widely in various domains. We present ProPC, a dataset for in-domain and cross-domain propositions classification. It consists of 15,000 sentences, 4 different classifications, in 5 different domains. We define two new tasks: 1) In-domain proposition classification, which is to identify the proposition type of a given sentence (not limited to explicit proposition); 2) Cross-domain proposition classification, which takes encyclopedia as the source domain and the other 4 domains as the target domain. We use the Matching, Bert and RoBERTa as our baseline methods and run experiments on each task. The result shows that machine indeed can learn the characteristics of various types of propositions from explicit propositions and classify implicit propositions, but the ability of domain generalization still needs to be strengthened. Our dataset, ProPC, is publicly available at https://github.com/NLUSoCo/ProPC .

Mengyang Hu, Pengyuan Liu, Lin Bo, Yuting Mao, Ke Xu, Wentao Su

### CTRD: A Chinese Theme-Rheme Discourse Dataset

Discourse topic structure is the key to the cohesion of the discourse and reflects the essence of the text. Current Chinese discourse corpus are constructed mainly based on rhetoric and semantic relations, which ignore the functional information in discourse. To alleviate this problem, we introduce a new Chinese discourse analysis dataset called CTRD, which stands for Chinese Theme-Rheme Discourse dataset. Different from previous discourse banks, CTRD was annotated according to a novel discourse annotation scheme based on the Chinese theme-rheme theory and thematic progression patterns from Halliday’s systemic functional grammar. As a result, we manually annotated 525 news documents from OntoNotes 4.0 with a Kappa value greater than 0.6. And preliminary experiments on this corpus verify the computability of CTRD. Finally, we make CTRD available at https://github.com/ydc/ctrd .

Biao Fu, Yiqi Tong, Dawei Tian, Yidong Chen, Xiaodong Shi, Ming Zhu

### Learning to Select Relevant Knowledge for Neural Machine Translation

Most memory-based methods use encoded retrieved pairs as the translation memory (TM) to provide external guidance, but there still exist some noisy words in the retrieved pairs. In this paper, we propose a simple and effective end-to-end model to select useful sentence words from the encoded memory and incorporate them into the NMT model. Our model uses a novel memory selection mechanism to avoid the noise from similar sentences and provide external guidance simultaneously. To verify the positive influence of selected retrieved words, we evaluate our model on the single-domain dataset namely JRC-Acquis and multi-domain dataset comprised of existing benchmarks including WMT, IWSLT, JRC-Acquis, and OpenSubtitles. Experimental results demonstrate our method can improve the translation quality under different scenarios.

Jian Yang, Juncheng Wan, Shuming Ma, Haoyang Huang, Dongdong Zhang, Yong Yu, Zhoujun Li, Furu Wei

### Contrastive Learning for Machine Translation Quality Estimation

Machine translation quality estimation (QE) aims to evaluate the result of translation without reference. Existing approaches require large amounts of training data or model-related features, leading to impractical applications in real world. In this work, we propose a contrastive learning framework to train QE model with limited parallel data. Concretely, we use denoising autoencoder to create negative samples based on sentence reconstruction. Then the QE model is trained to distinguish the golden pair from the negative samples in a contrastive manner. To this end, we propose two contrastive learning architectures, namely Contrastive Classification and Contrastive Ranking. Experiments on four language pairs of MLQE dataset show that our method achieves strong results in both zero-shot and supervised settings. To the best of our knowledge, this is the first trial of contrastive learning on QE.

Hui Huang, Hui Di, Jian Liu, Yufeng Chen, Kazushige Ouchi, Jinan Xu

### Sentence-State LSTMs For Sequence-to-Sequence Learning

Transformer is currently the dominant method for sequence to sequence problems. In contrast, RNNs have become less popular due to the lack of parallelization capabilities and the relatively lower performance. In this paper, we propose to use a parallelizable variant of bi-directional LSTMs (BiLSTMs), namely sentence-state LSTMs (S-LSTM), as an encoder for sequence-to-sequence tasks. The complexity of S-LSTM is only $$\mathcal {O}(n)$$ O ( n ) as compared to $$\mathcal {O}(n^2)$$ O ( n 2 ) of Transformer. On four neural machine translation benchmarks, we empirically find that S-SLTM can achieve significantly better performances than BiLSTM and convolutional neural networks (CNNs). When compared to Transformer, our model gives competitive performance while being 1.6 times faster during inference.

Xuefeng Bai, Yafu Li, Zhirui Zhang, Mingzhou Xu, Boxing Chen, Weihua Luo, Derek Wong, Yue Zhang

### Guwen-UNILM: Machine Translation Between Ancient and Modern Chinese Based on Pre-Trained Models

Ancient Chinese literatures are not only the unique cultural heritage of China but also the treasures of world civilization. Nevertheless, it has become quite difficult for modern people to comprehend or even create ancient works with the evolution of language in the long history. Translation is therefore playing a key role in bridging the two eras. This paper is to develop an automatic translation method between ancient and modern Chinese literature. To start with, an open sourced sentence level parallel corpus of ancient-modern Chinese is established since there is no available parallel corpus open for use. As the seq2seq-based machine translation models do not work well on this task, the pre-trained model UNILM is then applied in our method considering the monolingual characteristics of this task. Furthermore, the ancient Chinese pre-trained model - Guwen-BERT is utilized to further improve the performance of the method. The quality of translation is evaluated by both Human Evaluation and two automatic metrics: a) case-sensitive BLEU scores and b) Imagery Conservation (I.C), which is first developed in this paper. The experimental results under all metrics show that our method can generate higher quality of translation.

Zinong Yang, Ke-jia Chen, Jingqiang Chen

### Adaptive Transformer for Multilingual Neural Machine Translation

Multilingual neural machine translation (MNMT) with a single encoder-decoder model has attracted much interest due to its simple deployment and low training cost. However, the all-shared translation model often yields degraded performance due to the modeling capacity limitations and language diversity. Moreover, it has been revealed in recent studies that the shared parameters lead to negative language interference although they may also facilitate knowledge transfer across languages. In this work, we propose an adaptive architecture for multilingual modeling, which divides the parameters in MNMT sub-layers into shared and language-specific ones. We train the model to learn and balance the shared and unique features with different degrees of parameter sharing. We evaluate our model on one-to-many and many-to-one translation tasks. Experiments on IWSLT dataset show that our proposed model remarkably outperforms the multilingual baseline model and achieves comparable or even better performance compared with the bilingual model.

Junpeng Liu, Kaiyu Huang, Jiuyi Li, Huan Liu, Degen Huang

### Improving Non-autoregressive Machine Translation with Soft-Masking

In recent years, non-autoregressive machine translation has achieved great success due to its promising inference speedup. Non-autoregressive machine translation reduces the decoding latency by generating the target words in single-pass. However, there is a considerable gap in the accuracy between non-autoregressive machine translation and autoregressive machine translation. Because it removes the dependencies between the target words, non-autoregressive machine translation tends to generate repetitive words or wrong words, and these repetitive or wrong words lead to low performance. In this paper, we introduce a soft-masking method to alleviate this issue. Specifically, we introduce an autoregressive discriminator, which will output the probabilities hinting which embeddings are correct. Then according to the probabilities, we add mask on the copied representations, which enables the model to consider which words are easy to be predicted. We evaluated our method on three benchmarks, including WMT14 EN $$\rightarrow$$ → DE, WMT16 EN $$\rightarrow$$ → RO, and IWSLT14 DE $$\rightarrow$$ → EN. The experimental results demonstrate that our method can outperform the baseline by a large margin with a bit of speed sacrifice.

Shuheng Wang, Shumin Shi, Heyan Huang

### AutoNLU: Architecture Search for Sentence and Cross-sentence Attention Modeling with Re-designed Search Space

The rise of BERT style pre-trained models has significantly improved natural language understanding (NLU) tasks. However, for industrial usage, we still have to rely on more traditional models for efficiency. Thus, in this paper, we present AutoNLU, which is designed for modeling sentence representation and cross-sentence attention in an automatic network architecture search (NAS) manner. We have two main contributions. First, we design a novel and comprehensive search space that consists of encoder operations and aggregator operations, and important design choices. Second, aiming for sentence-pair tasks, we use NAS to automatically model how the representations of two sentences interact with and attend to each other. A reinforcement learning (RL) based search algorithm is enhanced by cross operation and cross layer parameter sharing for efficient and reliable search. Model training is done by distilling knowledge from BERT models. By experimenting on SST-2, RTE, Sci-Tail and CoNLL 2003, we verify that our learned models are better at learning from BERT teachers than other baseline models. Ablation studies on Sci-Tail show that our search space design is valid, and our proposed strategies are helpful for improving the search results (The source code will be made public available.).

Wei Zhu

### AutoTrans: Automating Transformer Design via Reinforced Architecture Search

Though the transformer architectures have shown dominance in many natural language understanding tasks, there are still unsolved issues for the training of transformer models, especially the need for a principled way of warm-up which has shown importance for stable training of a transformer, as well as whether the task at hand prefer to scale the attention product or not. In this paper, we empirically explore automating the design choices in the transformer model, i.e., how to set layer-norm, whether to scale, number of layers, number of heads, activation function, etc., so that one can obtain a transformer architecture that better suits the tasks at hand. RL is employed to navigate along search space, and special parameter sharing strategies are designed to accelerate the search. It is shown that sampling a proportion of training data per epoch during search help to improve the search quality. Experiments on the CoNLL03, Multi-30k and WMT-14 shows that the searched transformer model can outperform the standard transformers. In particular, we show that our learned model can be trained more robustly with large learning rates without warm-up.

Wei Zhu, Xiaoling Wang, Yuan Ni, Guotong Xie

### A Word-Level Method for Generating Adversarial Examples Using Whole-Sentence Information

Yufei Liu, Dongmei Zhang, Chunhua Wu, Wei Liu

### RAST: A Reward Augmented Model for Fine-Grained Sentiment Transfer

In this paper, we propose a novel model RAST (Reward Augmented Sentiment Transfer) for fine-grained sentiment transfer. Existing methods usually suffer from two major drawbacks, i.e., blurre d sentiment distinction and unsatisfactory content preservation. Considering the above issues, we design two kinds of rewards to better control sentiment and content. Specially, we develop a pairwise comparative discriminator that enforces to generate sentences with clear distinctions for different sentiment intensities. Moreover, we utilize an effective sampling strategy to obtain pseudo-parallel sentences with minor changes on the input sentence to enhance content preservation. Experiments on a benchmark dataset show that the proposed model outperforms several competitive approaches.

Xiaoxuan Hu, Hengtong Zhang , Wayne Xin Zhao, Yaliang Li, Jing Gao, Ji-Rong Wen

### Pre-trained Language Models for Tagalog with Multi-source Data

Pre-trained language models (PLMs) for Tagalog can be categorized into two kinds: monolingual models and multilingual models. However, existing monolingual models are only trained in small-scale Wikipedia corpus and multilingual models fail to deal with Tagalog-specific knowledge needed for various downstream tasks. We train three existing models on a much larger corpus: BERT-uncased-base, ELECTRA-uncased-base and RoBERTa-base. At the pre-training stage, we construct a large-scale news text corpus for pre-training in addition to the existing open-source corpora. Experimental results show that our pre-trained models achieve consistently competitive results in various Tagalog-specific natural language processing (NLP) tasks including part-of-speech (POS) tagging, hate speech classification, dengue classification and natural language inference (NLI). Among them, POS tagging dataset is a self-constructed dataset aiming to alleviate the insufficient labeled resource for Tagalog. We will release all pre-trained models and datasets to the community, hoping to facilitate the future development of Tagalog NLP applications.

Shengyi Jiang, Yingwen Fu, Xiaotian Lin, Nankai Lin

### Accelerating Pretrained Language Model Inference Using Weighted Ensemble Self-distillation

Pretrained language models (PLMs) have achieved remarkable results in various natural language processing tasks. As the performance of the model increases, it is also accompanied by more computational consumption and longer inference time, which makes deploying PLMs in edge devices for low-latency applications challenging. To address this issue, recent studies have recommended applying either model compression or early-exiting techniques to accelerate the inference. However, model compression permanently discards the modules of the model, leading to a decline in model performance. Train the PLMs backbone and the early-exiting classifier separately with early-exiting strategies. It not only brings extra training cost but also loses semantic information from higher layers, resulting in unreliable decisions of early-exiting classifiers. In this study, a weighted ensemble self-distillation method was proposed to improve the early-exiting strategy, which well balanced the performance and the inference time. It enables early-exiting classifiers to obtain rich semantic information from different layers with an attention mechanism according to the contribution of each layer to the final prediction. Furthermore, it simultaneously performs weighted ensemble self-distillation and fine-tuning of the PLMs backbone so that the PLMs can be fine-tuned in the training process of the early-exiting classifier to preserve the performance as much as possible. The experimental results show that the inference of the proposed model was accelerated at the minimum cost of performance loss, thus outperforming the previous early-exiting models. The code is available at: https://github.com/JunKong5/WestBERT .

Jun Kong, Jin Wang, Xuejie Zhang

### Employing Sentence Compression to Improve Event Coreference Resolution

Most previous studies on event coreference resolution usually focused on measuring the similarity between two event sentences. However, a sentence may contain more than one event and the redundant event information will interfere with the calculation of event similarity. To address the above issue, this paper proposes an event coreference resolution framework based on event sentence compression mechanism, which used an AutoEncoder-based model to compress the extracted event sentences based on the event triggers. Meanwhile, the information interaction between the compressed sentences and their original event sentences is used to supplement the missing important information in the compressed sentences. Experimental results on both KBP 2016 and KBP 2017 datasets show that our proposed model outperforms several state-of-the-art baselines.

Xinyu Chen, Sheng Xu, Peifeng Li, Qiaoming Zhu

### BRCEA: Bootstrapping Relation-Aware Cross-Lingual Entity Alignment

Entity alignment aims to align entities referring to the same identity in the real world across different Knowledge Graphs (KGs), which is a fundamental task of KG construction and KG fusion. Recent works focus on embedding-based approaches. With the pre-aligned entity pairs, these approaches mainly embed entities based on relation triples to capture structural information and then try to refine the entity embeddings by self-characteristics contained in attribute triples. However, insufficient training data, diverse expressions of attributes, and different importance between self-characteristics and structural information in different KGs are three obstacles to entity embedding. In order to tackle these problems, we propose a novel Bootstrapping Relation-aware model for Cross-lingual Entity Alignment using both relation triples and attribute triples of KGs (BRCEA). Firstly, given the base prior alignments, it separately embeds entities from two aspects, namely self-characteristics and structural information. Then, bootstrapping component discovers two sets of new alignments. Finally, the two sets will be used to construct new training data for the next iteration to overcome the sparsity of training data. We performed our model on several real-world datasets, and the results show that our model outperforms the state-of-art models for cross-lingual entity alignment.

Yujing Zhang, Feng Zhou, Xiaoyong Y. Li

### Employing Multi-granularity Features to Extract Entity Relation in Dialogue

Extracting relational triples from unstructured text is essential for the construction of large-scale knowledge graphs, QA and other downstream tasks. The purpose of dialogue relation extraction is to extract the relations between entities from the multi-person dialogue texts. The existing dialogue relation extraction models only focused on coarse-grained global information and ignored fine-grained local information. In this paper, we propose a dialogue relation extraction model BERT-MG to capture the features on different granularity at different BERT layers to take advantage of the fine-grained dialogue features. Moreover, we design a type-confidence mechanism to use the entity type information to assist relation inference. Experimental results on the DialogRE dataset prove that our proposed model BERT-MG outperforms the SOTA baselines.

Qiqi Wang, Peifeng Li

### Attention Based Reinforcement Learning with Reward Shaping for Knowledge Graph Reasoning

Knowledge graph reasoning aims at solving certain tasks by finding reasoning paths, which has aroused extensive attention. Recently, a solution for path reasoning that combines reinforcement learning has achieved successful progress. But these researches mainly focus on the agent’s choice of relation and ignore the importance of entity, which will cause the random selection by the agent if 1-N/N-N relations occur. Thus, we propose a reinforcement learning based path reasoning model, which solves this problem from the topological and semantic levels. First, the attention mechanism is introduced in our model, which can extract the hidden feature from neighbor entities and helps the policy network to make a suitable choice instead of random for the actions with the same relation. Then, we introduce a convolutional neural network into our model to distinguish the rationality of the path by the semantic feature. To mitigate the negative impact of terminal rewards, we use a potential-based reward shaping function, which considers the potential gap between agent states as the reward and without any pre-training. Finally, we compare our model with the state-of-the-art baselines on two benchmark datasets, the results of extensive comparison experiments validate the effectiveness of the proposed method.

Sheng Wang, Xiaoying Chen, Shengwu Xiong

### Entity-Aware Relation Representation Learning for Open Relation Extraction

Open relation extraction aims at extracting novel relations from open-domain corpora. However, most recent works typically treat entities and tokens equally while encoding sentences, without taking full advantage of the guiding role of entities in representation learning. In this work, we propose the Entity-Aware Relation Representation learning framework for open relation extraction and establish the new state-of-the-art on standard benchmarks. It gives more attention to entities when learning representations by leveraging an entity-aware attention mechanism. And we further propose a pair-wise contrastive loss to learn relation representations effectively in terms of alignment and uniformity. Extensive experimental results show that our framework achieves significant improvements compared to state-of-the-art models.

Zihao Liu, Yan Zhang, Huizhen Wang, Jingbo Zhu

### ReMERT: Relational Memory-Based Extraction for Relational Triples

Relational triples extraction aims to detect entity pairs (subjects, objects) along with their relations. Previous work failed to deal with complex relationship triples, such as overlapping triples and nested entities, and lacked semantic representation in the process of extracting entity pairs and relationships. To mitigate these issues, we propose a joint extraction model called ReMERT, which first decomposes the joint extraction task into three interrelated subtasks, namely RSE (Relation-specific Subject Extraction), RM (Relational Memory) module construction and OE (Object Extraction). The first subtask is to distinguish all subjects that may be involved with target relations, the second is to retrieve target relational representation from RM module, and the last is to identify corresponding objects for each specific (s, r) pair. Additionally, RSE and OE subtasks are further deconstructed into sequence labeling problems based on the proposed hierarchical binary tagging scheme. Owing to the reasonable decomposition strategy, the proposed model can fully capture the semantic interdependency between different subtasks, as well as reduce noise from irrelevant entity pairs. Experimental results show that the proposed method outperforms previous work by 0.8% (F1 score), achieving a new state-of-the-art on Chinese DuIE datasets. We also adopt sufficient experiments and obtain promising results both in public English NYT and Chinese DuIE datasets.

Chongshuai Zhao, Xudong Dai, Lin Feng, Peng Liu

### Recognition of Nested Entity with Dependency Information

Named entity recognition (NER) is a basic task in natural language processing. However, most existing models are hard to detect entities with nested structure which means that an entity contains one or more entities. In this paper, we propose a boundary-aware approach for nested NER. First, word information is incorporated in the same dimension via Lexicon, in which characters are feed into LSTM to learn internal structure of words and obtain character representation. To augment word representation, Graph Convolutional Network (GCN) is applied to extract dependency information between entities. Second, our model can detect boundaries to locate entity by using Star-Transformer, which is suitable for small-scale corpus and unstructured texts because of its star structure. Based on predicted boundaries, our model utilizes boundary-aware regions to predict entity categorical labels, which can reduce the number of candidate entities and decrease computation cost. In our experiment, it shows an impressive improvement on forum corpus and that our model can perform well on a small-scale corpus.

Yu Xia, Fang Kong

### HAIN: Hierarchical Aggregation and Inference Network for Document-Level Relation Extraction

Document-level relation extraction (RE) aims to extract relations between entities within a document. Unlike sentence-level RE, it requires integrating evidences across multiple sentences. However, current models still lack the ability to effectively obtain relevant evidences for relation inference from multi-granularity information in the document. In this paper, we propose Hierarchical Aggregation and Inference Network (HAIN), performing the model to effectively predict relations by using global and local information from the document. Specifically, HAIN first constructs a meta dependency graph (mDG) to capture rich long distance global dependency information across the document. It also constructs a mention interaction graph (MG) to model complex local interactions among different mentions. Finally, it creates an entity inference graph (EG), based on which we design a novel hybrid attention mechanism to integrate relevant global and local information for entities. Experimental results demonstrate that our model achieves superior performance on a large-scale document-level dataset (DocRED). Extensive analyses also show that the model is particularly effective in extracting relations between entities across multiple sentences and mentions.

Nan Hu, Taolin Zhang, Shuangji Yang, Wei Nong, Xiaofeng He

### Incorporate Lexicon into Self-training: A Distantly Supervised Chinese Medical NER

Medical named entity recognition (NER) tasks usually lack sufficient annotation data. Distant supervision is often used to alleviate this problem, which can quickly and automatically generate annotated training datasets through dictionaries. However, the current distantly supervised method suffers from noisy labeling due to limited coverage of the dictionary, which will cause a large number of unlabeled entities. We call this phenomenon an incomplete annotation problem. To tackle the incomplete annotation problem, we propose a novel distantly supervised method for Chinese medical NER. Specifically, we propose a high recall self-training mechanism to recall potential unlabeled entities in the distant supervision dataset. To reduce error in the high recall self-training, we propose a fine-grained lexicon enhanced scoring and ranking mechanism. Our method improves 3.2% and 5.03% compared to the baseline models on the dataset we proposed and a benchmark dataset for Chinese medical NER.

Zhen Gan, Zhucong Li, Baoli Zhang, Jing Wan, Yubo Chen, Kang Liu, Jun Zhao, Yafei Shi, Shengping Liu

### Diversified Paraphrase Generation with Commonsense Knowledge Graph

Paraphrases refer to text with different expressions conveying the same meaning, which is usually modeled as a sequence-to-sequence (Seq2Seq) learning problem. Traditional Seq2Seq models mainly concentrate on fidelity while ignoring the diversity of paraphrases. Although recent studies begin to focus on the diversity of generated paraphrases, they either adopt inflexible control mechanisms or restrict to synonyms and topic knowledge. In this paper, we propose KnowledgE-Enhanced Paraphraser (KEEP) for diversified paraphrase generation, which leverages a commonsense knowledge graph to explicitly enrich the expressions of paraphrases. Specifically, KEEP retrieves word-level and phrase-level knowledge from an external knowledge graph, and learns to choose more related ones using graph attention mechanism. Extensive experiments on benchmarks of paraphrase generation show the strengths especially in the diversity of our proposed model compared with several strong baselines.

Xinyao Shen, Jiangjie Chen, Yanghua Xiao

### Explore Coarse-Grained Structures for Syntactically Controllable Paraphrase Generation

Syntactically controlled paraphrase generation can produce diverse paraphrases by exposing syntactic control, where both semantic preservation and syntactic variations are two important factors. Previous works mainly focus on using fine-grained syntactic structures (e.g., full parse tree) as syntactic control. While these methods can achieve excellent syntactic controllability, leads to failing to preserve the semantics of the input sentence. The main reason is that it is difficult to retrieve perfectly compatible syntactic structures with the input sentences. In this paper, we explore coarse-grained syntactic structures to trade-off semantic preservation and syntactic variations. Furthermore, to improve semantic preservation and syntactic controllability, we propose a Syntax Attention-Guided Paraphrase (SAGP) model that can correctly select syntactic information according to the current state for surface realization. Experiment results show that SAGP outperforms the previous state-of-the-art method under the same setting. Additionally, we validate that using coarse-grained structures can generate more semantically reasonable text without affecting the syntactic controllability.

Erguang Yang, Mingtong Liu, Deyi Xiong, Yujie Zhang, Yao Meng, Changjian Hu, Jinan Xu, Yufeng Chen

### Chinese Poetry Generation with Metrical Constraints

Poetry is a kind of literary art, which conveys emotion with aesthetic expressions. Poetry automatic generation is challenging because it is required to confirm the semantic representation (content) and metrical constraints (form). Most previous work lacks the effective use of metrical information, resulting in the generated poems may break these constraints, which are essential for poetry. In this paper, we formulate the poetry generation task as a constrained text generation problem. A Transformer-based dual-encoder model is then proposed to force the poetry generation conditioned on both the writing intention and the metrical patterns. We conduct experiments on three popular genres of Chinese classical poetry: quatrains ( ), regulated verse ( ) and Song iambic ( ). Both automatic and human evaluation results confirm that our method (poetry generation with metrical constraints, MCPG) significantly improves metrical compliance of generated poems while maintaining coherence and fluency.

Yingfeng Luo, Changliang Li, Canan Huang, Chen Xu, Xin Zeng, Binghao Wei, Tong Xiao, Jingbo Zhu

### CNewSum: A Large-Scale Summarization Dataset with Human-Annotated Adequacy and Deducibility Level

Automatic text summarization aims to produce a brief but crucial summary for the input documents. Both extractive and abstractive methods have witnessed great success in English datasets in recent years. However, there has been a minimal exploration of text summarization in other languages, limited by the lack of large-scale datasets. In this paper, we present a large-scale Chinese news summarization dataset CNewSum, which consists of 304,307 documents and human-written summaries for the news feed. It has long documents with high-abstractive summaries, which encourages document-level understanding and generation for current summarization models. An additional distinguishing feature of CNewSum is that its test set includes adequacy and deducibility annotations for the summaries. The adequacy level measures the degree of summary information covered by the document, and the deducibility indicates the reasoning ability the model needs to generate the summary. These annotations help researchers target their model performance bottleneck. We examine recent methods on CNewSum and will release our dataset after the anonymous period to provide a solid testbed for automatic Chinese summarization research.

Danqing Wang, Jiaze Chen, Xianze Wu, Hao Zhou, Lei Li

### Question Generation from Code Snippets and Programming Error Messages

For some inexperienced developers, extracting key information from code snippets and programming error messages and turning it into a highly readable question can help them better understand, locate and search for the cause of errors. This paper proposes a copy mechanism guided transformer with pre-trained programming and natural languages representations (CMPPN) to automatically generate questions with high human readability from code snippets and programming error messages. Our CMPPN is pre-trained on a large scale code corpus with code summarization task based on transformer, and incorporated with copying mechanism in the fine-tuning phase. To evaluate our proposed model, we create a new dataset based on Stack Overflow posts, which contains code snippets, programming error messages and corresponding question headlines in 3 programming languages (Java, C# and Python). Extensive experimental results on this dataset verify the effectiveness of our CMPPN compared to baseline methods. Both dataset and model are available on https://github.com/YuiTH/CEMS-SO .

Bolun Yao, Wei Chen, Yeyun Gong, Bartuer Zhou, Jin Xie, Zhongyu Wei, Biao Cheng, Nan Duan

### Extractive Summarization of Chinese Judgment Documents via Sentence Embedding and Memory Network

A rapidly rising number of open judgment documents has increased the requirement for automatic summarization. Since Chinese judgment documents are characterized by a lengthy and logical structure, extractive summarization is an effective method for them. However, existing extractive models generally cannot capture information between sentences. In order to enable the model to obtain long-term information in the judgment documents, this paper proposes an extractive model using sentence embeddings and a two-layers memory network. A pre-trained language model is used to encode sentences in judgment documents. Then the whitening operation is applied to get isotropic sentence embeddings, which makes the subsequent classification more accurate. These embeddings are fed into a unidirectional memory network to fuse previous sentence embeddings. A bidirectional memory network is followed to introduce position information of sentences. The experimental results show that our proposed model outperforms the baseline methods on the SFZY dataset from CAIL2020.

Yan Gao, Zhengtao Liu, Juan Li, Jin Tang

### ThinkTwice: A Two-Stage Method for Long-Text Machine Reading Comprehension

Long-text machine reading comprehension (LT-MRC) requires machine to answer questions based on a lengthy text. Despite transformer-based models achieve promising results, most of them are incapable of dealing with long sequences for their time-consuming. In general, a proper solution by sliding window splits the passage into equally spaced fragments, then predicts the answer based on each fragment separately without considering other contextual fragments. However, this approach suffers from lack of long-distance dependency, which severely damages the performance. To address this issue, we propose a two-stage method ThinkTwice for LT-MRC. ThinkTwice casts the process of LT-MRC into two main steps: 1) it firstly retrieves several fragments that the final answer is most likely to lie in; 2) then extracts the answer span from these fragments instead of from the lengthy document. We do experiments on NewsQA. The experimental results demonstrate that ThinkTwice can capture the most informative fragments from a long text. Meanwhile, ThinkTwice achieves considerable improvements compared to all existing baselines. All codes have been released at Github ( https://github.com/Walle1493/ThinkTwice ).

Mengxing Dong, Bowei Zou, Jin Qian, Rongtao Huang, Yu Hong

### EviDR: Evidence-Emphasized Discrete Reasoning for Reasoning Machine Reading Comprehension

Reasoning machine reading comprehension (R-MRC) aims to answer complex questions that require discrete reasoning based on text. To support discrete reasoning, evidence, typically the concise textual fragments that describe question-related facts, including topic entities and attribute values, are crucial clues from question to answer. However, previous end-to-end methods that achieve state-of-the-art performance rarely solve the problem by paying enough emphasis on the modeling of evidence, missing the opportunity to further improve the model’s reasoning ability for R-MRC. To alleviate the above issue, in this paper, we propose an Evidence-emphasized Discrete Reasoning approach (EviDR), in which sentence and clause level evidence is first detected based on distant supervision, and then used to drive a reasoning module implemented with a relational heterogeneous graph convolutional network to derive answers. Extensive experiments are conducted on DROP (discrete reasoning over paragraphs) dataset, and the results demonstrate the effectiveness of our proposed approach. In addition, qualitative analysis verifies the capability of the proposed evidence-emphasized discrete reasoning for R-MRC (Code is released at https://github.com/JD-AI-Research-NLP/EviDR ).

Yongwei Zhou, Junwei Bao, Haipeng Sun, Jiahui Liang, Youzheng Wu, Xiaodong He, Bowen Zhou, Tiejun Zhao

### Knowledge-Grounded Dialogue with Reward-Driven Knowledge Selection

Knowledge-grounded dialogue is a task of generating a fluent and informative response based on both conversation context and a collection of external knowledge, in which knowledge selection plays an important role and attracts more and more research interest. However, most existing models either select only one knowledge or use all knowledge for responses generation. The former may lose valuable information in discarded knowledge, while the latter may bring a lot of noise. At the same time, many approaches need to train the knowledge selector with knowledge labels that indicate ground-truth knowledge, but these labels are difficult to obtain and require a large number of manual annotations. Motivated by these issues, we propose Knoformer, a dialogue response generation model based on reinforcement learning, which can automatically select one or more related knowledge from the knowledge pool and does not need knowledge labels during training. Knoformer is evaluated on two knowledge-guided conversation datasets, and achieves state-of-the-art performance.

Shilei Liu, Xiaofeng Zhao, Bochao Li, Feiliang Ren

### Multi-intent Attention and Top-k Network with Interactive Framework for Joint Multiple Intent Detection and Slot Filling

Multiple intent detection and slot filling are essential components of spoken language understanding. Existing methods treat multiple intent detection as a multi-label classification task. However, multi-label classification methods focus on the correlation between different intents and set the threshold to select the high probability intents. These methods will cause the model to miss part of the correct intents. In this paper, to address this issue, we introduce Multi-Intent Attention and Top-k Network with Interactive Framework (MIATIF) for joint multiple intent detection and slot filling. In particular, we model the multi-intent attention to obtaining the relation between the utterance and intents. Meanwhile, we propose the top-k network to encode the distribution of different intents and accurately predict the number of intents. Experimental results on two publicly available multiple intent datasets show substantial improvement. In addition, our model saves 64%–72% of training time compared to the current state-of-the-art graph-based model.

Xu Jia, Jiaxin Pan, Youliang Yuan, Min Peng

### Enhancing Long-Distance Dialogue History Modeling for Better Dialogue Ellipsis and Coreference Resolution

Previous work on dialogue-specific ellipsis and coreference resolution usually concatenates all dialogue history utterances into a single sequence. It may mislead the model to attend to inappropriate parts and to copy from wrong utterances when the dialogue history is long. In this paper, we aim to model dialogue history from multiple granularities and take a deep look into the semantic connection between the dialogue history and the omitted or coreferred expressions. To achieve this, we propose a speaker highlight dialogue history encoder and a top-down hierarchical copy mechanism to generate the complete utterances. We conduct dozens of experiments on the CamRest676 dataset, and the experimental results show that our methods are expert in long-distance dialogue history modeling and can significantly improve the performance of ellipsis and coreference resolution in the dialogue task.

Zixin Ni, Fang Kong

### Exploiting Explicit and Inferred Implicit Personas for Multi-turn Dialogue Generation

Learning and utilizing personas in open-domain dialogue have become a hotspot in recent years. The existing methods that only use predefined explicit personas enhance the personality to some extent, however, they cannot easily avoid persona inconsistency and weak diversity responses. To address these problems, this paper proposes an effective model called Exploiting Explicit and Inferred Implicit Personas for Multi-turn Dialogue Generation (EIPD). Specifically, 1) an explicit persona extractor is designed to improve persona consistency; 2) Taking advantage of the von Mises-Fisher (vMF) distribution in modeling directional data (e.g., the different persona state), we introduce the implicit persona inference to increase diversity; 3) during the generation, the persona response generator fuses the explicit and implicit personas in the response. The experimental results on the ConvAI2 persona-chat dataset demonstrate that our model performs better than commonly used baselines. Further analysis of the ablation experiments shows that EIPD can generate more persona-consistent and diverse responses.

Ruifang Wang, Ruifang He, Longbiao Wang, Yuke Si, Huanyu Liu, Haocheng Wang, Jianwu Dang

### Few-Shot NLU with Vector Projection Distance and Abstract Triangular CRF

Data sparsity problem is a key challenge of Natural Language Understanding (NLU), especially for a new target domain. By training an NLU model in source domains and applying the model to an arbitrary target domain directly (even without fine-tuning), few-shot NLU becomes crucial to mitigate the data scarcity issue. In this paper, we propose to improve prototypical networks with vector projection distance and abstract triangular Conditional Random Field (CRF) for the few-shot NLU. The vector projection distance exploits projections of contextual word embeddings on label vectors as word-label similarities, which is equivalent to a normalized linear model. The abstract triangular CRF learns domain-agnostic label transitions for joint intent classification and slot filling tasks. Extensive experiments demonstrate that our proposed methods can significantly surpass strong baselines. Specifically, our approach can achieve a new state-of-the-art on two few-shot NLU benchmarks (Few-Joint and SNIPS) in Chinese and English without fine-tuning on target domains.

Su Zhu, Lu Chen, Ruisheng Cao, Zhi Chen, Qingliang Miao, Kai Yu

### Cross-domain Slot Filling with Distinct Slot Entity and Type Prediction

Supervised learning approaches have been proven effective in slot filling, but they need massive labeled training data which is expensive and time-consuming in a given domain. Recent models for cross-domain slot filling adopt transfer learning framework to cope with the data scarcity problem. However, these cross-domain slot filling models rely on the same encoder representation in different stages for slot entity task and slot type task, which decrease the performance of both tasks. Besides, these models treat different source domains equally and ignore the shared slot-related information in different domains, which may damage the performance of cross-domain learning. In this paper, we present a pipeline approach for cross-domain slot filling (PCD) by learning distinct contextual representations for slot entity identification and slot type alignment, and fusing slot entity information at the input layer of the slot type alignment model for incorporating global context. Moreover, we also present a simple yet effective instance weighting scheme ( $$\mathbf {Iw}$$ Iw ) to our approach for better capturing the slot entities in the cross-domain setting. Experiments on multiple domains show that our approach achieves state-of-the-art performance in cross-domain slot filling. Ablation analysis and further experiments also prove the effectiveness of each part of our model, especially in the identification of slot entities.

Shudong Liu, Peijie Huang, Zhanbiao Zhu, Hualin Zhang, Jianying Tan

### Semantic Enhanced Dual-Channel Graph Communication Network for Aspect-Based Sentiment Analysis

Aspect-based sentiment analysis is a fine-grained task that aims to clarify the sentiment polarity of a given aspect in a sentence, whose main challenge is to model the relation between the aspects and its opinion words. Seeing that the analysis based on dependency tree has its deficiencies, a Semantic Enhanced Dual-channel Graph Communication Network is proposed to address such issues. In our model, the semantic information is captured to supplement syntactic features while the communication mechanism and the hierarchical attention module are employed to obtain the word representation. The working performance of the proposed model is evaluated on publicly available datasets. Experimental results reveal that our model significantly outperforms the baseline methods and achieves advanced results in ABSA tasks.

Zehao Yan, Shiguan Pang, Yun Xue

### Highway-Based Local Graph Convolution Network for Aspect Based Sentiment Analysis

Aspect-level sentiment analysis is a fine-grained task in sentiment analysis, whose target is to identify the sentiment polarity of a specific aspect in a sentence. Due to the complexity of the human language, the widely-applied syntactic-based neural network methods have deficiencies in precisely capturing the relation between aspects and opinion words, and thus results in the misunderstanding of the sentiment. To address such issue, we focus on optimizing the encoding of syntactic information. To start with, the sub-dependency trees, from the basic dependency tree, are constructed in line with the syntactic distance. Further, we propose a novel Highway-Based Local Graph Convolution Network (HL-GCN) to capture the more-related information and thus facilitate the sentiment classification. Substantial experiments on a variety of datasets are performed. Comparing to the state-of-arts, the proposed model shows the effectiveness in eliminating the noise from the dependency tree, which results in an even higher classification accuracy.

Shiguan Pang, Zehao Yan, Weihao Huang, Bixia Tang, Anan Dai, Yun Xue

### Dual Adversarial Network Based on BERT for Cross-domain Sentiment Classification

Cross-domain sentiment classification uses useful information of the source domain to promote the classification accuracy of the target domain. Although previous approaches consider the effects of aspect information of the sentences, they lack the mechanism of syntactic constraints which may mistakenly assign irrelevant words to aspects. In this paper, we propose Dual Adversarial Network based on BERT (DAN-BERT), which can better transfer sentiment across domains by jointly learning the representation of sentences and aspect-based syntax. Specifically, DAN-BERT extracts the common features at the sentence level and aspect-based syntax level by adversarial training. We learn the features of aspect-based syntax by building Graph Convolutional Network over the dependency tree of a sentence. Experiments on the four datasets show that Dual Adversarial Network based on BERT outperforms state-of-the-art methods.

Shaokang Zhang, Xu Bai, Lei Jiang, Huailiang Peng

### Syntax and Sentiment Enhanced BERT for Earliest Rumor Detection

With the rapid development of social media, rumor is becoming an increasingly significant problem. Although quite a few researches have been proposed recently, most of methods rely on contextual information or propagation pattern of reply posts. For some threatening rumors, we need to interrupt their transmission in the beginning. To solve this problem, we propose Syntax and Sentiment Enhanced BERT (SSE-BERT), which can achieve superior performance only based on source post. SSE-BERT can learn extra syntax and sentiment features by additional linguistic knowledge. Experimental results on two real-word datasets show that our method outperforms some state-of-the-art methods on earliest rumor detection. Furthermore, to alleviate the shortage of Chinese dataset, we collect a new rumor detection dataset Weibo20 (The experimental resource is available https://github.com/SeanMiao95/SSE-BERT ).

Xin Miao, Dongning Rao, Zhihua Jiang

### Aspect-Sentiment-Multiple-Opinion Triplet Extraction

Aspect Sentiment Triplet Extraction (ASTE) aims to extract aspect term (aspect), sentiment and opinion term (opinion) triplets from sentences and can tell a complete story, i.e., the discussed aspect, the sentiment toward the aspect, and the cause of the sentiment. ASTE is a charming task, however, one triplet extracted by ASTE only includes one opinion of the aspect, but an aspect in a sentence may have multiple corresponding opinions and one opinion only provides part of the reason why the aspect has this sentiment, as a consequence, some triplets extracted by ASTE are hard to understand, and provide erroneous information for downstream tasks. In this paper, we introduce a new task, named Aspect Sentiment Multiple Opinions Triplet Extraction (ASMOTE). ASMOTE aims to extract aspect, sentiment and multiple opinions triplets. Specifically, one triplet extracted by ASMOTE contains all opinions about the aspect and can tell the exact reason that the aspect has the sentiment. We propose an Aspect-Guided Framework (AGF) to address this task. AGF first extracts aspects, then predicts their opinions and sentiments. Moreover, with the help of the proposed Sequence Labeling Attention (SLA), AGF improves the performance of the sentiment classification using the extracted opinions. Experimental results on multiple datasets demonstrate the effectiveness of our approach (Data and code can be found at https://github.com/l294265421/ASMOTE ).

Fang Wang, Yuncong Li, Sheng-hua Zhong, Cunxiang Yin, Yancheng He

### Locate and Combine: A Two-Stage Framework for Aspect-Category Sentiment Analysis

Aspect category sentiment classification aims at predicting the sentiment polarity of the given aspect category. Since the aspect category may not occur in the sentence, it is hard for the model to directly find the appropriate sentiment words for the aspect category and disregard unrelated ones. To address it, previous works have explored leveraging implicitly the information of the aspect term in the sentence and demonstrated the effectiveness of such information. Inspired by this conclusion, we propose a two-stage strategy named Locate-Combine(LC) to utilize the aspect term in a more straightforward way, which first locates the aspect term and then takes it as the bridge to find the related sentiment words. Specifically, in the “Locate” stage, we locate the aspect term corresponding to the given aspect category in the sentence, which can crystallize the target and further enable our model to focus on the target-related words. In the “Combine” stage, we first apply the graph convolutional network (GCN) over the dependency tree of the sentence to combine the information of the aspect term and related sentiment words and then take the output representation corresponding to the located aspect term to predict the sentiment polarity. The experimental results on the public datasets show that the proposed two-stage strategy is effective, which achieves state-of-the-art performance. Furthermore, our model can output explainable intermediate results for model analysis. (Code can be found at https://github.com/SCIR-MSA-Team/LC-ACSA )

Yang Wu, Zhenyu Zhang, Yanyan Zhao, Bing Qin

### Emotion Classification with Explicit and Implicit Syntactic Information

Emotion classification has become a hot research topic in natural language processing due to its wide application. Existing studies suffer from the error propagation problem when using the syntax information in emotion classification since the parser can not produce perfect syntax trees. To address this problem, we propose a new approach by comparing and combining different levels of syntactic information to make full use of syntactic information and alleviate the error propagation. First, we propose to use graph convolutional networks (GCN) to encode dependency trees, in which the probability matrix of all dependency arcs (edge-weighted graph) is treated as the GCN adjacent matrix. Next, we extract the dependency parser encoder hidden representations as the implicit syntactic representations, which can directly avoid the error propagation problem. Finally, we fuse the two different syntax-aware information and inject them into our baseline model as extra inputs. Further experimental results show that the explicit and implicit syntactic information can improve the performance of a BERT-based system which is much stronger than the baseline. In addition, we find that the syntactic knowledge that BERT can express is limited, and the syntactic information of our model brings more contributions, which makes our model consistently outperform the BERT on different sentence lengths.

Nan Chen, Qingrong Xia, Xiabing Zhou, Wenliang Chen, Min Zhang

### MUMOR: A Multimodal Dataset for Humor Detection in Conversations

Humor detection attracts increased attention in natural language processing for its potential applications. Prior work focus on analyzing humor on isolated, textual data, but humor usually comes from the interaction among speakers in a multimodal way. In this paper, we proposed a novel dataset named MUMOR, which consists of multimodal dialogues in both English and Chinese. It contains a total of 29,585 utterances belonging to 1,298 dialogues from two TV-sitcoms. We manually annotated each utterance with humor, emotion, and sentiment labels. To our best knowledge, this is the first corpus containing Chinese conversations for humor detection. This dataset could be used for research on humor detection, humor generation, and multi-task learning on emotion and humor analysis. We released this dataset publicly.

Jiaming Wu, Hongfei Lin, Liang Yang, Bo Xu

### NLP Applications and Text Mining

#### Frontmatter

The rapid development of social media has brought the prosperity of online economy. Recently, product promotion in social networks has become an essential way of online marketing. As one of the most common marketing means, Content Marketing (CM) inserts advertisements into regular articles in a roundabout and covert way. However, the values and characteristics of products are often exaggerated to attract users’ attention. It could cause severe economic losses to users and influence the creditworthiness of the platforms. In this paper, we model the problem of advertisement extraction from CM articles as a sentence classification task. We propose a topic-enhanced deep neural network to encode the semantic information of a sentence for classification. Motivated by the characteristics of CM articles, we develop a segment-aware optimization method that considers the label transitions of sentences in different segments of an article to improve the performance of the classifier. Experimental results based on real-world datasets demonstrate the superiority of the proposed method over state-of-the-art approaches.

Xiaoming Fan, Chenxu Wang

### Leveraging Lexical Common-Sense Knowledge for Boosting Bayesian Modeling

Recent research has shown that, since the ultimate goal of Bayesian perspective is to infer a posterior distribution, it is demonstrably more direct to impose domain knowledge directly on the posterior distribution. This paper presents a model by imposing lexical common-sense knowledge as constraints on the posterior distribution, under the conventional regularized Bayesian framework. We then improve the latent topic modeling with help of the aforementioned model, and experimental results show that, combining lexical common-sense knowledge and Bayesian modeling, is beneficial for prediction.

Yashen Wang

### Aggregating Inter-viewpoint Relationships of User’s Review for Accurate Recommendation

User reviews contain rich information about user interests in items. Recently, many deep learning methods have attempted to integrate review contents into user preference rating prediction, helping to solve data sparseness problems. However, existing methods suffered from an inherent limitation that many reviews are noisy and contain non-consecutive viewpoints, besides, they are insufficient to capture inter-viewpoint relationships. Incorporating useful information is helpful for more accurate recommendations. In this paper, we propose a neural recommendation approach with a Diversity Penalty mechanism and Capsule Networks, named DPCN. Specifically, the diversity penalty component employs weight distributed matrices with the penalization term to capture different viewpoints in textual reviews. The capsule networks are designed to aggregate individual viewpoint vectors to form high-level feature representations for feature interaction. Then we combine the review feature representations with the user and item ID embedding for final rating prediction. Extensive experiments on five real-world datasets validate the effectiveness of our approach.

Xingchen He, Yidong Chen, Guocheng Zhang, Xuling Zheng

### A Residual Dynamic Graph Convolutional Network for Multi-label Text Classification

Recent studies often utilize the Graph Convolutional Network (GCN) to learn label dependencies features for the multi-label text classification (MLTC) task. However, constructing the static label graph according to the pairwise co-occurrence from training datasets may degrade the generalizability of the model. In addition, GCN-based methods suffer from the problem of over-smoothing. To this end, we propose a Residual Dynamic Graph Convolutional Network Model (RDGCN) ( https://github.com/ilove-Moretz/RDGCN.git ) which adopts a label attention mechanism to learn the label-specific representations and then constructs a dynamic label graph for each given instance. Furthermore, we devise a residual connection to alleviate the over-smoothing problem. To verify the effectiveness of our model, we conduct comprehensive experiments on two benchmark datasets. The experimental results show the superiority of our proposed model.

Bingquan Wang, Jie Liu, Shaowei Chen, Xiao Ling, Shanpeng Wang, Wenzheng Zhang, Liyi Chen, Jiaxin Zhang

### Sentence Ordering by Context-Enhanced Pairwise Comparison

Sentence ordering is a task arranging the given unordered text into the correct order. A feasible approach is to use neural networks to predict the relative order of all sentence pairs and then organize the sentences into a coherent paragraph with topological sort. However, current methods rarely utilize the context information, which is essential for deciding the relative order of the sentence pair. Based on this observation, we propose an efficient approach context-enhanced pairwise comparison network (CPCN) that leverages both the context and sentence pair information in a post-fusion manner to order a sentence pair. To obtain the paragraph context embedding, CPCN first utilizes BERT to encode all sentences, then aggregates them using a Transformer followed by an average pooling layer. Finally, CPCN predicts the relative order of the sentence pair by the concatenation of the paragraph embedding and the sentence pair embedding. Our experiments on three benchmark datasets, SIND, NIPS and AAN show that our model outperforms all the existing models significantly and achieves a new state-of-the-art performance, which demonstrates the effectiveness of incorporating context information.

Haowei Du, Jizhi Tang, Dongyan Zhao

### A Dual-Attention Neural Network for Pun Location and Using Pun-Gloss Pairs for Interpretation

Pun location is to identify the punning word (usually a word or a phrase that makes the text ambiguous) in a given short text, and pun interpretation is to find out two different meanings of the punning word. Most previous studies adopt limited word senses obtained by WSD(Word Sense Disambiguation) technique or pronunciation information in isolation to address pun location. For the task of pun interpretation, related work pays attention to various WSD algorithms. In this paper, a model called DANN (Dual-Attentive Neural Network) is proposed for pun location, effectively integrates word senses and pronunciation with context information to address two kinds of pun at the same time. Furthermore, we treat pun interpretation as a classification task and construct pun-gloss pairs as processing data to solve this task. Experiments on the two benchmark datasets show that our proposed methods achieve new state-of-the-art results. Our source code is available in the public code repository ( https://github.com/LawsonAbs/pun ).

Shen Liu, Meirong Ma, Hao Yuan, Jianchao Zhu, Yuanbin Wu, Man Lan

### A Simple Baseline for Cross-Domain Few-Shot Text Classification

Few-shot text classification has been largely explored due to its remarkable few-shot generalization ability to in-domain novel classes. Yet, the generalization ability of existing models to cross-domain novel classes has seldom be studied. To fill the gap, we investigate a new task, called cross-domain few-shot text classification (XFew) and present a simple baseline that witnesses an appealing cross-domain generalization capability while retains a nice in-domain generalization capability. Experiments are conducted on two datasets under both in-domain and cross-domain settings. The results show that current few-shot text classification models lack a mechanism to account for potential domain shift in the XFew task. In contrast, our proposed simple baseline achieves surprisingly superior results in comparison with other models in cross-domain scenarios, confirming the need of further research in the XFew task and providing insights for possible directions. (The code and datasets are available at https://github.com/GeneZC/XFew ).

Chen Zhang, Dawei Song

### Shared Component Cross Punctuation Clauses Recognition in Chinese

NT (Naming-telling) Clause Complex Framework defines the clause complex structures through component sharing and logic-semantic relationships. In this paper, we formalize component sharing recognition as a multi-span extraction problem in machine learning. And we propose a model with mask strategy to recognize the shared components of cross punctuation clauses based on pre-training models. Furthermore, we present a Chinese Long-distance Shared Component Recognition Dataset (LSCR) with four domains, including 43k texts and 156k shared components that need to be predicted. Experimental results and analysis show that our model outperforms previous methods in large margin. All the codes and dataset are available at https://github.com/smiletm/LSCR .

Xiang Liu, Ruifang Han, Shuxin Li, Yujiao Han, Mingming Zhang, Zhilin Zhao, Zhiyong Luo

### BERT-KG: A Short Text Classification Model Based on Knowledge Graph and Deep Semantics

Chinese short text classification is one of the increasingly significant tasks in Natural Language Processing (NLP). Different from documents and paragraphs, short text faces the problems of shortness, sparseness, non-standardization, etc., which brings enormous challenges for traditional classification methods. In this paper, we propose a novel model named BERT-KG, which can classify Chinese short text promptly and accurately and overcome the difficulty of short text classification. BERT-KG enriches short text features by obtaining background knowledge from the knowledge graph and further embeds the three-tuple information of the target entity into a BERT-based model. Then we fuse the dynamic word vector with the knowledge of the short text to form a feature vector for short text. And finally, the learned feature vector is input into the Softmax classifier to obtain a target label for short text. Extensive experiments conducted on two real-world datasets demonstrate that BERT-KG significantly improves the classification performance compared with state-of-the-art baselines.

Yuyanzhen Zhong, Zhiyang Zhang, Weiqi Zhang, Juyi Zhu

### Uncertainty-Aware Self-paced Learning for Grammatical Error Correction

Recently, pre-trained language models have gained dramatic progress on grammatical error correction (GEC) task by fine-tuning on a small amount of annotated data. However, the current approaches ignore two problems. On the one hand, the GEC datasets suffer from annotation errors which may impair the performance of the model. On the other hand, the correction difficulty varies across sentences and the generating difficulty of each token within a sentence is inconsistent as well. Therefore, hard and easy samples in GEC task should be treated differently. To address these issues, we propose an uncertainty-aware self-paced learning framework for GEC task. We leverage Bayesian deep learning to mine and filter noisy samples in the training set. Besides, we design a confidence-based self-paced learning strategy to dynamically adjust the loss weights of hard and easy samples. Specifically, we measure the confidence score of the model on the samples at the token-level and the sentence-level, and schedule the training procedure according to the confidence scores. Extensive experiments demonstrate that the proposed approach surpasses the baseline model by 2.0+ point of $$F_{0.5}$$ F 0.5 scores on several GEC datasets and proves the effectiveness of our approach.

Kai Dang, Jiaying Xie, Jie Liu, Shaowei Chen

### Metaphor Recognition and Analysis via Data Augmentation

Metaphoric expression is widespread and frequently used to convey emotions. When it comes to metaphor recognition and analysis, there are still not enough samples for these tasks. In this study, we target on recognizing verb metaphors and analyzing their emotions via data augmentation. To this end, we firstly propose a sentence reconstruction method to prune the dependency parsing tree, and thus alleviates the disturbances caused by the noise information. Then, the data augmentation strategies are proposed based on Seq2Seq model and the reconstructed sentence, which generate sufficient candidate samples after an effective quality evaluation. Finally, a proposed model is trained with the extended dataset, and it achieves the recognition and emotion analysis for metaphors. Experiments are conducted on Chinese and English metaphor corpus respectively, and results show that our proposed model has the best performance compared with the baseline methods.

Liang Yang, Jingjie Zeng, Shuqun Li, Zhexu Shen, Yansong Sun, Hongfei Lin

### Exploring Generalization Ability of Pretrained Language Models on Arithmetic and Logical Reasoning

To quantitatively and intuitively explore the generalization ability of pre-trained language models (PLMs), we have designed several tasks of arithmetic and logical reasoning. We both analyse how well PLMs generalize when the test data is in the same distribution as the train data and when it is different, for the latter analysis, we have also designed a cross-distribution test set other than the in-distribution test set. We conduct experiments on one of the most advanced and publicly released generative PLM - BART. Our research finds that the PLMs can easily generalize when the distribution is the same, however, it is still difficult for them to generalize out of the distribution.

Cunxiang Wang, Boyuan Zheng, Yuchen Niu, Yue Zhang

### Skeleton-Based Sign Language Recognition with Attention-Enhanced Graph Convolutional Networks

The natural language processing of sign language is an important task in the field of artificial intelligence and information processing. In this paper, we propose an attention-enhanced graph convolutional networks (AEGCNs) for sign language recognition (SLR). First, there are four kinds of adaptive graphs for graph convolution and each graph topology can be either uniformly or individually learned based on the skeleton data in an end-to-end manner. In addition, we employ the spatial-temporal-channel attention mechanisms to give higher weight to the relative important joints, frames and features, and the higher-order connection with Chebychev polynomial approximation to enlarge the receptive field of graph convolution. Meanwhile, the information of both the joints and bones is simultaneously modeled in a framework, which further improves the representation of the movement about hand and finger. Finally, experiments on the DEVISIGN-D, DSL50 and ASL20 datasets show that the accuracies for top1 of three datasets reach 82.96%, 95.09% and 90.23% respectively and the accuracies for top5 of three datasets achieve 96.07%, 99.18% and 100% respectively. Compared with ST-GCN and BHOF, the accuracy of AEGCNs obtains significant improvements of +33.5% and +5.32% on ASL20 datasets, respectively, which demonstrates the effectiveness of our method on SLR.

Wuyan Liang, Xiaolong Xu

### XGPT: Cross-modal Generative Pre-Training for Image Captioning

In this paper, we propose XGPT, a new method of Cross-modal Generative Pre-Training for Image Captioning that is designed to pre-train text-to-image caption generators through four novel generation tasks, including Adversarial Image Captioning (AIC), Image-conditioned Masked Language Modeling (IMLM), Image-conditioned Denoising Autoencoding (IDA), and Text-conditioned Image Feature Generation (TIFG). As a result, the pre-trained XGPT can obtain new state-of-the-art results on the benchmark datasets, including COCO Captions and Flickr30k Captions. We also use XGPT to generate image captions as data augmentation for the image retrieval task and achieve significant improvement on all recall metrics.

Qiaolin Xia, Haoyang Huang, Nan Duan, Dongdong Zhang, Lei Ji, Zhifang Sui, Edward Cui, Taroon Bharti, Ming Zhou

### An Object-Extensible Training Framework for Image Captioning

Recent years have witnessed great progress in image captioning based on deep learning. However, most previous methods are limited to the original training dataset that contains only a fraction of objects in the real world. They lack the ability to describe other objects that are not in the original training dataset. In this paper, we propose an object-extensible training framework that enables a widely-used captioning paradigm to describe objects beyond the original training dataset (i.e., extended objects) by generating high-quality training data for these objects automatically. Specifically, we design a general replacement mechanism, which replaces the object (An object includes the object region in the image, and the corresponding object word in the caption) in the original training dataset with the extended object to generate new training data. The key challenge in the proposed replacement mechanism is that it should be context-aware to get the meaningful result that complies with common knowledge. We introduce the multi-modal context embedding to ensure that the generated object representation is coherent in the visual context and the generated caption is smooth and fluent in the linguistic context. Extensive experiments show that our method improves significantly over the state-of-the-art methods on the held-out MSCOCO in both automatic and human evaluation.

Yike Wu, Ying Zhang, Xiaojie Yuan

### Relation-Aware Multi-hop Reasoning forVisual Dialog

Visual dialog is a multi-modal task that requires a dialog agent to answer a series of progressive questions grounded in an image. In this paper, we propose Relation-aware Multi-hop Reasoning Network (i.e. R2N for short) for visual dialog tasks, which can perform multi-hop reasoning during visual co-reference resolution process in a recurrent way. At each hop, in order to fully understand the visual scene in the image, a Relation-aware Graph Attention Network is used, which encodes each image into graphs with multi-type inter-object relations via a graph attention mechanism. Moreover, we find that the auxiliary clustering mechanism on answer candidates is conducive to model’s performance. We evaluate R2N on VisDial v1.0 dataset. Experimental results on the VisDial v1.0 dataset demonstrate that the proposed model is effective and outperforms compared models.

Yao Zhao, Lu Chen, Kai Yu

### Multi-modal Sarcasm Detection Based on Contrastive Attention Mechanism

In the past decade, sarcasm detection has been intensively conducted in a textual scenario. With the popularization of video communication, the analysis in multi-modal scenarios has received much attention in recent years. Therefore, multi-modal sarcasm detection, which aims at detecting sarcasm in video conversations, becomes increasingly hot in both the natural language processing community and the multi-modal analysis community. In this paper, considering that sarcasm is often conveyed through incongruity between modalities (e.g., text expressing a compliment while acoustic tone indicating a grumble), we construct a Contrastive-Attention-based Sarcasm Detection (ConAttSD) model, which uses an inter-modality contrastive attention mechanism to extract several contrastive features for an utterance. A contrastive feature represents the incongruity of information between two modalities. Our experiments on MUStARD, a benchmark multi-modal sarcasm dataset, demonstrate the effectiveness of the proposed ConAttSD model.

Xiaoqiang Zhang, Ying Chen, Guangyuan Li

### Backmatter

Weitere Informationen