nach oben

2022 | Buch

Kapitel lesen Erstes Kapitel lesen

Natural Language Processing and Chinese Computing

11th CCF International Conference, NLPCC 2022, Guilin, China, September 24–25, 2022, Proceedings, Part II

herausgegeben von: Wei Lu, Shujian Huang, Yu Hong, Xiabing Zhou

Verlag: Springer Nature Switzerland

Buchreihe : Lecture Notes in Computer Science

Enthalten in: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

Einloggen, um Zugang zu erhalten

Über dieses Buch

This two-volume set of LNAI 13551 and 13552 constitutes the refereed proceedings of the 11th CCF Conference on Natural Language Processing and Chinese Computing, NLPCC 2022, held in Guilin, China, in September 2022.

The 62 full papers, 21 poster papers, and 27 workshop papers presented were carefully reviewed and selected from 327 submissions. They are organized in the following areas: Fundamentals of NLP; Machine Translation and Multilinguality; Machine Learning for NLP; Information Extraction and Knowledge Graph; Summarization and Generation; Question Answering; Dialogue Systems; Social Media and Sentiment Analysis; NLP Applications and Text Mining; and Multimodality and Explainability.

Inhaltsverzeichnis

Frontmatter

Question Answering (Poster)

Frontmatter

Faster and Better Grammar-Based Text-to-SQL Parsing via Clause-Level Parallel Decoding and Alignment Loss

As a mainstream approach, grammar-based models have achieved high performance in text-to-SQL parsing task, but suffer from low decoding efficiency since the number of actions for building SQL trees are much larger than the number of tokens in SQL queries. Meanwhile, intuitively it is beneficial from the parsing performance perspective to incorporate alignment information between SQL clauses and question segments. This paper proposes clause-level parallel decoding and alignment loss to enhance two high-performance grammar-based parsers, i.e., RATSQL and LGESQL. Experiments on the Spider dataset show our approach improves the decoding speed of RATSQL and LGESQL by 18.9% and 35.5% respectively, and also achieves consistent improvement in parsing accuracy, especially on complex questions.

Kun Wu, Lijie Wang, Zhenghua Li, Xinyan Xiao

Two-Stage Query Graph Selection for Knowledge Base Question Answering

Finding the best answer to a question in Knowledge Base Question Answering (KBQA) is always challenging due to its enormous searching space and the interactive performance requirement. A typical solution is to retrieve the answer by finding the optimal query graph, which is a sub-graph of the knowledge graph. However, existing methods usually generate a considerable number of sub-graph candidates, then fail to find the optimal one effectively, resulting in a significant gap between top-1 performance and the oracle score of all the graph candidates. To address this issue, this paper presents a novel two-stage method based on the idea of first reducing the candidates to form a shortlist, and then selecting the optimal one from them. Before the selection, we generate many, often hundreds of, candidates for each question. In the first selection stage, we sort the candidates and select a small set of query graphs (top-k), while in the second stage we propose to rerank them to select the final answer. We evaluate our system on both English and Chinese data, and the results show that our proposed two-stage method achieves competitive performance on all datasets (Our code is publicly available at https://github.com/EnernityTwinkle/KBQA-QueryGraphSelection .).

Yonghui Jia, Chuanyuan Tan, Yuehe Chen, Muhua Zhu, Pingfu Chao, Wenliang Chen

Plug-and-Play Module for Commonsense Reasoning in Machine Reading Comprehension

Conventional Machine Reading Comprehension (MRC) has been well-addressed by pattern matching, but the ability of commonsense reasoning remains a gap between humans and machines. Previous methods tackle this problem by enriching word representations via pretrained Knowledge Graph Embeddings (KGE). However, they make limited use of a large number of connections between nodes in Knowledge Graphs (KG), which can be pivotal cues for building the commonsense reasoning chains. In this paper, we propose a Plug-and-play module to IncorporatE Connection information for commonsEnse Reasoning (PIECER). Beyond enriching word representations with knowledge embeddings, PIECER constructs a joint query-passage graph to explicitly guide commonsense reasoning by the knowledge-oriented connections between words. Further, PIECER has high generalizability since it can be plugged into any MRC model. Experimental results on ReCoRD, a large-scale public MRC dataset requiring commonsense reasoning, show that PIECER introduces stable performance improvements for four representative base MRC models, especially in low-resource settings (The code is available at https://github.com/Hunter-DDM/piecer .).

Damai Dai, Hua Zheng, Zhifang Sui, Baobao Chang

Social Media and Sentiment Analysis (Poster)

Frontmatter

FuDFEND: Fuzzy-Domain for Multi-domain Fake News Detection

On the Internet, fake news exists in various domain (e.g., education, health). Since news in different domains has different features, researchers have begun to use single domain label for fake news detection recently. Existing works show that using single domain label can improve the accuracy of fake news detection model. However, there are two problems in previous works. Firstly, they ignore that a piece of news may have features from different domains. The single domain label focuses only on the features of one domain. This may reduce the performance of the model. Secondly, their model cannot transfer the domain knowledge to the other dataset without domain label. In this paper, we propose a novel model, FuDFEND, which solves the limitations above by introducing the fuzzy inference mechanism. Specifically, FuDFEND utilizes a neural network to fit the fuzzy inference process which constructs a fuzzy domain label for each news item. Then, the feature extraction module uses the fuzzy domain label to extract the multi-domain features of the news and obtain the total feature representation. Finally, the discriminator module uses the total feature representation to discriminate whether the news item is fake news. The results on the Weibo21 show that our model works better than the model using only single domain label. In addition, our model transfers domain knowledge better to Thu dataset which has no domain label.

Chaoqi Liang, Yu Zhang, Xinyuan Li, Jinyu Zhang, Yongqi Yu

NLP Applications and Text Mining (Poster)

Frontmatter

Continuous Prompt Enhanced Biomedical Entity Normalization

Biomedical entity normalization (BEN) aims to link the entity mentions in a biomedical text to referent entities in a knowledge base. Recently, the paradigm of large-scale language model pre-training and fine-tuning have achieved superior performance in BEN task. However, pre-trained language models like SAPBERT [21] typically contain hundreds of millions of parameters, and fine-tuning all parameters is computationally expensive. The latest research such as prompt technology is proposed to reduce the amount of parameters during the model training. Therefore, we propose a framework Prompt-BEN using continuous Prompt to enhance BEN, which just needs to fine-tune few parameters of prompt. Our method employs embeddings with the continuous prefix prompt to capture the semantic similarity between mention and terms. We also design a contrastive loss with synonym marginalized strategy for the BEN task. Finally, experimental results on three benchmark datasets demonstrated that our method achieves competitive or even greater linking accuracy than the state-of-the-art fine-tuning-based models while having about 600 times fewer tuned parameters.

Zhaohong Lai, Biao Fu, Shangfei Wei, Xiaodong Shi

Bidirectional Multi-channel Semantic Interaction Model of Labels and Texts for Text Classification

Text classification, aiming for discovering corresponding relationships between labels and texts, is a pivotal task in Natural Language Processing (NLP). The existing joint text-label models help input texts to establish early global category semantic awareness via label embedding techniques, but they cannot simultaneously capture literal and semantic relationships between texts and labels. It may lead models to ignore obvious clues or semantic relations on different cognitive levels. In this paper, we propose a Bidirectional Multi-channel semantic Interaction model (BMI) to handle both explicit and implicit category semantics in texts for text classification. On the explicit semantic level, BMI designs a word representation similarity match channel for shallow interaction to get rid of semantic mismatch based on assumptions that words have different meanings under the same context. On the implicit semantic level, BMI provides a novel attended attention mechanism over texts and labels for deep interaction to model bidirectional text explanation for labels and label guidance for texts. Furthermore, a gated residual mechanism is employed to obtain core information of labels to improve efficiency. Experiments on benchmark datasets show that BMI achieves competitive results over 15 strong baseline methods, especially in the case of short texts.

Yuan Wang, Yubo Zhou, Peng Hu, Maoling Xu, Tingting Zhao, Yarui Chen

Exploiting Dynamic and Fine-grained Semantic Scope for Extreme Multi-label Text Classification

Extreme multi-label text classification (XMTC) refers to the problem of tagging a given text with the most relevant subset of labels from a large label set. A majority of labels only have a few training instances due to large label dimensionality in XMTC. To solve this data sparsity issue, most existing XMTC methods take advantage of fixed label clusters obtained in early stage to balance performance on tail labels and head labels. However, such label clusters provide static and coarse-grained semantic scope for every text, which ignores distinct characteristics of different texts and has difficulties modelling accurate semantics scope for texts with tail labels. In this paper, we propose a novel framework TReaderXML for XMTC, which adopts dynamic and fine-grained semantic scope from teacher knowledge for individual text to optimize text conditional prior category semantic ranges. TReaderXML dynamically obtains teacher knowledge for each text by similar texts and hierarchical label information in training sets to release the ability of distinctly fine-grained label-oriented semantic scope. Then, TReaderXML benefits from a novel dual cooperative network that firstly learns features of a text and its corresponding label-oriented semantic scope by parallel Encoding Module and Reading Module, secondly embeds two parts by Interaction Module to regularize the text’s representation by dynamic and fine-grained label-oriented semantic scope, and finally find target labels by Prediction Module. Experimental results on three XMTC benchmark datasets show that our method achieves new state-of-the-art results and especially performs well for severely imbalanced and sparse datasets.

Yuan Wang, Huiling Song, Peng Huo, Tao Xu, Jucheng Yang, Yarui Chen, Tingting Zhao

MGEDR: A Molecular Graph Encoder for Drug Recommendation

Recently, drug recommendation tasks have been widely accepted in intelligent healthcare. Most of the existing methods utilize patients’ electronic health records (EHRs) to achieve medical prediction. However, existing algorithms neglect the description of the patient’s health status, which makes it difficult to adapt to the dynamic patients’ condition. And they ignore the intrinsic encoding of drug molecular structure, resulting in the weak performance of drug recommendation. To fill the gap, we propose a molecular graph encoder for drug recommendation named MGEDR to capture the genuine health status of patients. Furthermore, We encode the drug molecular graph and functional groups separately to obtain subtle drug representation. And we design the degree encoder and functional groups encoder to seize the intrinsic features of the molecule efficaciously. Our experimental results show that our proposed MGEDR framework performs significantly better than state-of-the-art baseline methods.

Kaiyuan Shi, Shaowu Zhang, Haifeng Liu, Yijia Zhang, Hongfei Lin

Student Workshop (Poster)

Frontmatter

Semi-supervised Protein-Protein Interactions Extraction Method Based on Label Propagation and Sentence Embedding

Protein-protein interaction (PPI) plays an extremely vital role in almost all life activities. The study of PPI has always been an important issue in the field of biomedicine. Extracting PPI information from the literature can provide meaningful references for related research. In order to build an automated PPI extraction system, labeled corpora are required. However, labeled corpora are very limited, and annotating corpora is a time-consuming, labor-intensive, and costly task. On the contrary, the amount of unlabeled data is huge and it’s easy to obtain, so it is of great significance to apply semi-supervised learning to PPI extraction. Existing semi-supervised methods have two limitations: 1) cannot make full use of the information in unlabeled data. 2) need to rely on text augmentation methods such as back-translation. Therefore, a semi-supervised PPI extraction method based on label propagation and sentence embedding is proposed in this work. It represents text as numerical features through sentence embedding, and then assigns pseudo-labels to unlabeled data through label propagation, thereby completing semi-supervised training. Experiments on public datasets for PPI extraction show that the proposed method can achieve competitive performance with only a small amount of labeled data. Specifically, when the number of labeled data is 250, it can achieve F1 scores of 36.8% and 53.4% on AIMed and BioInfer, respectively.

Zhan Tang, Xuchao Guo, Lei Diao, Zhao Bai, Longhe Wang, Lin Li

Construction and Application of a Large-Scale Chinese Abstractness Lexicon Based on Word Similarity

As an important semantic feature, abstractness has been widely studied in linguistics, psychology, cognitive sciences and other fields. Many languages have constructed their abstractness lexicons, while there has never been a large-scale and high-quality abstractness lexicon in Chinese. Since manual construction is time-consuming and costly, we use the existing resources with human abstractness scores as original data, and adopt the word similarity-based approach to automatically construct a large-scale Chinese abstractness lexicon. Besides, we evaluate the quality of the constructed lexicon by comparing it with expert knowledge and previous work. It has been verified that this lexicon is roughly consistent with human cognition and can provide reliable abstractness ratings for words. Finally, the performance of this constructed lexicon on two research, cross-language comparison and Chinese text readability auto-evaluation, shows that word abstractness is an important feature in investigating cognitive differences and text complexity. The large-scale Chinese abstractness lexicon constructed in this paper has important application values.

Huidan Xu, Lijiao Yang

Stepwise Masking: A Masking Strategy Based on Stepwise Regression for Pre-training

Recently, capturing task-specific and domain-specific patterns during pre-training has been shown to help models better adapt to downstream tasks. Existing methods usually use large-scale domain corpus and downstream supervised data to further pre-train pre-trained language models, which often brings a large computational burden and these data are difficult to obtain in most cases. To address these issues, we propose a pre-training method with a novel masking strategy called stepwise masking. The method employs stepwise masking to mine tokens related to the downstream task in mid-scale in-domain data and masks them. Then, the model is trained on these annotated data. In this stage, task-guided pre-training enables the model to learn task-specific and domain-specific patterns simultaneously and efficiently. Experimental results on sentiment analysis tasks show that our method can effectively improve the performance of the model.

Jie Pan, Shuxia Ren, Dongzhang Rao, Zongxian Zhao, Wenshi Xue

Evaluation Workshop (Poster)

Frontmatter

Context Enhanced and Data Augmented System for Named Entity Recognition

This paper describes the system proposed by the YSF2022 team for NLPCC 2022 shared task 5 [3] on Named Entity Recognition Model for English Scientific Literature. This task needs participants to develop a named entity recognition (NER) model for domain-specific texts based on state-of-the-art NLP and deep learning techniques with the labeled domain-specific sentences corresponding to seven entity types. Without the luxury of training data, we proposed two methods to improve the performance by capturing document-level features and performing data augmentation with entity replacement. Besides, instead of using the traditional sequence labeling model, we attempted to use a novel alternative by modeling the NER as word-word relation classification. On the other hand, we apply Entity Confidence Filter (ECF) and Result Ensemble (RE) to get better performance. According to the official results, our approach ranks $$\mathrm 1^{st}$$ 1 st for the NER track in this task.

Chunping Ma, Zijun Xu, Minwei Feng, Jingcheng Yin, Liang Ruan, Hejian Su

Multi-task Hierarchical Cross-Attention Network for Multi-label Text Classification

As the quantity of scientific publications grows significantly, manual indexing of literature becomes increasingly complex, and researchers have attempted to utilize techniques in Hierarchical Multi-label Text Classification (HMTC) to classify scientific literature. Although there have been many advances, some problems still cannot be effectively solved in HMTC tasks, such as the difficulty in capturing the dependencies of hierarchical labels and the correlation between labels and text, and the lack of adaptability of models to specialized text. In this paper, we propose a novel framework called Multi-task Hierarchical Cross-Attention Network (MHCAN) for multi-label text classification. Specifically, we introduce a cross-attention mechanism to fully incorporate text representation and hierarchical labels with a directed acyclic graph (DAG) structure, and design an iterative hierarchical-attention module to capture the dependencies between layers. Afterwards, our framework weighting jointly optimizes each level of loss. To improve the adaptability of the model to domain data, we also continue to pre-train SciBERT on unlabeled data and introduce adversarial training. Our framework ranks $$2^{nd}$$ 2 nd in NLPCC 2022 Shared Task 5 Track 1 (Multi-label Classification Model for English Scientific Literature). The experimental results show the effectiveness of the modules applied in this framework.

Junyu Lu, Hao Zhang, Zhexu Shen, Kaiyuan Shi, Liang Yang, Bo Xu, Shaowu Zhang, Hongfei Lin

An Interactive Fusion Model for Hierarchical Multi-label Text Classification

Scientific research literature usually has multi-level labels, and there are often dependencies between multi-level labels. It is crucial for the model to learn and integrate the information between multi-level labels for the hierarchical multi-label text classification (HMTC) of scientific research literature texts. Therefore, for the HMTC task in the scientific research literature, we use the pre-trained language model SciBERT trained on scientific texts. And we introduce a shared TextCNN layer in our multi-task learning architecture to learn the dependency information between labels at each level. Then the hierarchical feature information is fused and propagated from top to bottom according to the task level. We conduct ablation experiments on the dependency information interaction module and the hierarchical information fusion propagation module. Experimental results on the NLPCC2022 SharedTask5 Track1 dataset demonstrate the effectiveness of our model, and we rank 4th place in the task.

Xiuhao Zhao, Zhao Li, Xianming Zhang, Jibin Wang, Tong Chen, Zhengyu Ju, Canjun Wang, Chao Zhang, Yiming Zhan

Scene-Aware Prompt for Multi-modal Dialogue Understanding and Generation

This paper introduces the schemes of Team LingJing’s experiments in NLPCC-2022-Shared-Task-4 Multi-modal Dialogue Understanding and Generation (MDUG). The MDUG task can be divided into two phases: multi-modal context understanding and response generation. To fully leverage the visual information for both scene understanding and dialogue generation, we propose the scene-aware prompt for the MDUG task. Specifically, we utilize the multi-tasking strategy for jointly modelling the scene- and session- multi-modal understanding. The visual captions are adopted to aware the scene information, while the fixed-type templated prompt based on the scene- and session-aware labels are used to further improve the dialogue generation performance. Extensive experimental results show that the proposed method has achieved state-of-the-art (SOTA) performance compared with other competitive methods, where we rank the 1-st in all three subtasks in this MDUG competition.

Bin Li, Yixuan Weng, Ziyu Ma, Bin Sun, Shutao Li

BIT-WOW at NLPCC-2022 Task5 Track1: Hierarchical Multi-label Classification via Label-Aware Graph Convolutional Network

This paper describes the system proposed by the BIT-WOW team for NLPCC2022 shared task in Task5 Track1. The track is about multi-label towards abstracts of academic papers in scientific domain, which includes hierarchical dependencies among 1,530 labels. In order to distinguish semantic information among hierarchical label structures, we propose the Label-aware Graph Convolutional Network (LaGCN), which uses Graph Convolutional Network to capture the label association through context-based label embedding. Besides, curriculum learning is applied for domain adaptation and to mitigate the impact of a large number of categories. The experiments show that: 1) LaGCN effectively models the category information and makes a considerable improvement in dealing with a large number of categories; 2) Curriculum learning is beneficial for a single model in the complex task. Our best results were obtained by an ensemble model. According to the official results, our approach proved the best in this track.

Bo Wang, Yi-Fan Lu, Xiaochi Wei, Xiao Liu, Ge Shi, Changsen Yuan, Heyan huang, Chong Feng, Xianling Mao

CDAIL-BIAS MEASURER: A Model Ensemble Approach for Dialogue Social Bias Measurement

Dialogue systems based on neural networks trained on large-scale corpora have a variety of practical applications today. However, using uncensored training corpora may have risks, such as potential social bias issues. Meanwhile, manually reviewing these training corpora for social bias content is costly. So, it is necessary to design a recognition model that automatically detects social bias in dialogue systems. NLPCC 2022 Shared Task 7 - Fine-Grain Dialogue Social Bias Measurement, aims to measure social bias in dialogue systems and provides a well-annotated Chinese social bias dialogue dataset - CDAIL-BIAS DATASET. Based on CDAIL-BIAS DATASET, this paper proposes a powerful classifier, CDAIL-BIAS MEASURER. Specifically, we adopt a model ensemble approach, which combines five different pre-trained language models, and uses adversarial training and regularization strategy to enhance the robustness of the model. Finally, labels are obtained by using a novel method - a label-based weighted voting method. The result shows that the classifier has a macro F1 score of 0.580 for social bias measurement in dialogue systems. And our result ranks the third, demonstrating the effectiveness and superiority of our model.

Jishun Zhao, Shucheng Zhu, Ying Liu, Pengyuan Liu

A Pre-trained Language Model for Medical Question Answering Based on Domain Adaption

With the successful application of question answering (QA) in human-computer interaction scenarios such as chatbots and search engines, medical question answering (QA) systems have gradually attracted widespread attention, because it can not only help professionals make decisions efficiently, but also supply non-professional people advice when they are seeking useful information. However, due to the professionalism of domain knowledge, it is still hard for existing medical question answering systems to understand professional domain knowledge of medicine, which makes question answering systems unable to generate fluent and accurate answers. The goal of this paper is to train the language model on the basis of pre-training. With better language models, we can get better medical question answering models. Through the combination of DAP and TAP, the model can understand the knowledge of the medical domain and task, which helps question answering models generate smooth and accurate answers and achieve good results.

Lang Liu, Junxiang Ren, Yuejiao Wu, Ruilin Song, Zhen Cheng, Sibo Wang

Enhancing Entity Linking with Contextualized Entity Embeddings

Entity linking (EL) in written language domains has been extensively studied, but EL of spoken language is still unexplored. We propose a conceptually simple and highly effective two-stage approach to tackle this issue. The first stage retrieves candidates with a dual encoder, which independently encodes mention context and entity descriptions. Each candidate is then reranked by a LUKE-based cross-encoder, which concatenates the mention and entity description. Different from previous cross-encoder which takes only words as input, our model adds entities into input. Experiments demonstrate that our model does not need large-scale training on Wikipedia corpus, and outperforms all previous models with or without Wikipedia training. Our approach ranks the $$1^\textrm{st}$$ 1 st in NLPCC 2022 Shared Task on Speech EL Track 2 (Entity Disambiguation-Only).

Zhenran Xu, Yulin Chen, Senbao Shi, Baotian Hu

A Fine-Grained Social Bias Measurement Framework for Open-Domain Dialogue Systems

A pre-trained model based on a large-scale corpus can effectively improve the performance of open-domain dialogue systems in terms of performance. However, recent studies have shown various ethical issues in pre-trained models that seriously affect the application of dialogue systems. Social bias is particularly complex among these ethical issues because its negative impact on marginalized populations is often implicit and therefore requires normative reasoning and rigorous analysis. In this paper, we report the solution of the team BERT 4EVER for the NLPCC 2022 Shared Task 7 - Fine-Grain Dialogue Social Bias Measurement, which aims to measure the social bias in dialogue scenario. Specifically, we study fine-grained social bias measurement in open-domain dialogue systems. We construct a framework based on prompt learning and contrastive learning for fine-grained dialogue social bias measurement. We propose a two-stage prompt learning method to identify whether the text involves fairness topics, and then identify the bias of the text involving fairness topics. In order to enable the model to better learn the complete label (i.e. irrelevant, anti-bias, neutral, and biased) information in the first-stage prompt learning, we employ a contrastive learning module to further regularize the text representation of the same labeled samples to the uniform semantic space. On NLPCC 2022 task-7 final test, our proposed framework achieved second place with an $${F}_{macr\mathrm{o}}$$ F m a c r o of 59.02%.

Aimin Yang, Qifeng Bai, Jigang Wang, Nankai Lin, Xiaotian Lin, Guanqiu Qin, Junheng He

Dialogue Topic Extraction as Sentence Sequence Labeling

The topic information of the dialogue text is important for the model to understand the intentions of the dialogue participants and to abstractly summarize the content of the dialogue. The dialogue topic extraction task aims to extract the evolving topic information in long dialogue texts. In this work, we focus on topic extraction of dialogue texts in customer service scenarios. Based on the rich sequence features in the topic tags, we define this task as a sequence labeling task with sentences as the basic elements. For this task, we build a dialogue topic extraction system using a Chinese pre-trained language model and a CRF model. In addition, we use sliding windows to avoid excessive loss of contextual information, and use adversarial training and model integration to improve the performance and robustness of our model. Our system ranks first on the track 1 of the NLPCC-2022 shared task on Dialogue Text Analysis, Topic Extraction and Dialogue Summary.

Dinghao Pan, Zhihao Yang, Haixin Tan, Jiangming Wu, Hongfei Lin

Knowledge Enhanced Pre-trained Language Model for Product Summarization

Automatic summarization has been successfully applied to many scenarios such as news and information services, assisted recommendations, etc. E-commerce product summarization is also a scenario with great economic value and attention, as they can help generate text that matches the product information and inspires users to buy. However, existing algorithms still have some challenges: the generated summarization produces incorrect attributes that are inconsistent with original products and mislead users, thus reducing the credibility of e-commerce platforms. The goal of this paper is to enhance product data with attributes based on pre-trained models that are trained to understand the domain knowledge of products and generate smooth, relevant and faithful text that attracts users to buy.

Wenbo Yin, Junxiang Ren, Yuejiao Wu, Ruilin Song, Lang Liu, Zhen Cheng, Sibo Wang

Augmented Topic-Specific Summarization for Domain Dialogue Text

This paper describes HW-TSC’s submission to the NLPCC 2022 dialogue text summarization task. We convert it into a sub-summary generation and a topic detection task. A sequence-to-sequence model Transformer is adopted as the foundational structure of our generation model. An ensemble topic detection model is used to filter uninformative summaries. On the other hand, we utilize multiple data processing and data augmentation methods to improve the effectiveness of the system. A constrained search method is used to construct generation model’s training pairs between sub-dialogues and sub-summaries. Multiple role-centric training data augmentation strategies are used to enhance both the generation model and the topic detection model. Our experiments demonstrate the effectiveness of these methods. Finally, we rank first with the highest ROUGE score of 51.764 in the test evaluation.

Zhiqiang Rao, Daimeng Wei, Zongyao Li, Hengchao Shang, Jinlong Yang, Zhengzhe Yu, Shaojun Li, Zhanglin Wu, Lizhi Lei, Hao Yang, Ying Qin

DAMO-NLP at NLPCC-2022 Task 2: Knowledge Enhanced Robust NER for Speech Entity Linking

Speech Entity Linking aims to recognize and disambiguate named entities in spoken languages. Conventional methods suffer gravely from the unfettered speech styles and the noisy transcripts generated by ASR systems. In this paper, we propose a novel approach called Knowledge Enhanced Named Entity Recognition (KENER), which focuses on improving robustness through painlessly incorporating proper knowledge in the entity recognition stage and thus improving the overall performance of entity linking. KENER first retrieves candidate entities for a sentence without mentions, and then utilizes the entity descriptions as extra information to help recognize mentions. The candidate entities retrieved by a dense retrieval module are especially useful when the input is short or noisy. Moreover, we investigate various data sampling strategies and design effective loss functions, in order to improve the quality of retrieved entities in both recognition and disambiguation stages. Lastly, a linking with filtering module is applied as the final safeguard, making it possible to filter out wrongly-recognized mentions. Our system achieves 1st place in Track 1 and 2nd place in Track 2 of NLPCC-2022 Shared Task 2.

Shen Huang, Yuchen Zhai, Xinwei Long, Yong Jiang, Xiaobin Wang, Yin Zhang, Pengjun Xie

Overview of the NLPCC2022 Shared Task on Speech Entity Linking

In this paper, we present an overview of the NLPCC 2022 Shared Task on Speech Entity Linking. This task aims to study entity linking methods for spoken languages. This speech entity linking task includes two tracks: Entity Recognition and Disambiguation (track 1), Entity Disambiguation-Only (track 2). 20 teams registered in the challenging task, and the top system achieved 0.7460 F1 in track 1 and 0.8884 in track 2. In this paper, we present the task description, dataset description, team submission ranking and results and analyze the results.

Ruoyu Song, Sijia Zhang, Xiaoyu Tian, Yuhang Guo

Overview of the NLPCC 2022 Shared Task on Multimodal Product Summarization

We introduce the NLPCC 2022 shared task on multimodal product summarization. This task aims at generating a condensed textual summary for a given product. The input contains a detailed product description, a product knowledge base, and a product image. 29 teams register the task, among which 5 teams submit the results. In this paper, we present the task definition, the dataset, and the evaluation results for this shared task.

Haoran Li, Peng Yuan, Haoning Zhang, Weikang Li, Song Xu, Youzheng Wu, Xiaodong He

A Multi-task Learning Model for Fine-Grain Dialogue Social Bias Measurement

In recent years, the use of NLP models to predict people’s attitudes toward social bias has attracted the attention of many researchers. In the existing work, most of the research is at the sentence level, i.e., judging whether the whole sentence has a biased property. In this work, we leverage pre-trained models’ powerful semantic modeling capabilities to model dialogue context. Furthermore, to use more features to improve the ability of the model to identify bias, we propose two auxiliary tasks with the help of the dialogue’s topic and type features. In order to achieve better classification results, we use the adversarial training method to train two multi-task models. Finally, we combine the two multi-task models by voting. We participated in the NLPCC-2022 shared task on Fine-Grain Dialogue Social Bias Measurement and ranked fourth with the Macro-F1 score of 0.5765. The codes of our model are available at github ( https://github.com/33Da/nlpcc2022-task7 ).

Hanjie Mai, Xiaobing Zhou, Liqing Wang

Overview of NLPCC2022 Shared Task 5 Track 1: Multi-label Classification for Scientific Literature

Given the increasing volume of scientific literature in conferences, journals as well as open access websites, it is important to index these data in a hierarchical way for intelligent retrieval. We organized Track 1 in NLPCC2022 Shared Task 5 for multi-label classification for scientific literature. This paper will summarize the task information, the data set, the models returned from the participants and the final result. Furthermore, we will discuss key findings and challenges for hierarchical multi-label classification in the scientific domain.

Ming Liu, He Zhang, Yangjie Tian, Tianrui Zong, Borui Cai, Ruohua Xu, Yunfeng Li

Overview of the NLPCC 2022 Shared Task: Multi-modal Dialogue Understanding and Generation

In this paper, we give an overview of multi-modal dialogue understanding and generation at NLPCC 2022 shared task, which includes three sub-tasks: dialogue scene identification, dialogue session identification, and dialogue response generation. A bilingual multi-modal dialogue dataset consisting of 100M utterances was made public for the shared task. The dataset contains 119K dialogue scene boundaries and 62K dialogue session boundaries which are both annotated manually. Details of the shared task, dataset, evaluation metric and evaluation results will be presented in order.

Yuxuan Wang, Xueliang Zhao, Dongyan Zhao

Overview of NLPCC2022 Shared Task 5 Track 2: Named Entity Recognition

This paper presents an overview of the NLPCC 2022 shared task 5 track 2, i.e., Named Entity Recognition (NER), which aims at extracting entities of interest from domain-specific texts (material science). The task provides 5600 labeled sentences (with the BIO tagging format) collected from ACS material science publications. Participants are required to train a NER model with these labeled sentences to automatically extract entities of material science. 47 teams registered and 19 of them submitted the results; the results are summarized in the evaluation section. The best-submitted model shows around 0.07 improvement with respect to $$F_{1}score$$ F 1 s c o r e over the baseline BiLSTM-CRF model.

Borui Cai, He Zhang, Fenghong Liu, Ming Liu, Tianrui Zong, Zhe Chen, Yunfeng Li

Overview of NLPCC 2022 Shared Task 7: Fine-Grained Dialogue Social Bias Measurement

This paper presents the overview of the shared task 7, Fine-Grained Dialogue Social Bias Measurement, in NLPCC 2022. In this paper, we introduce the task, explain the construction of the provided dataset, analyze the evaluation results and summarize the submitted approaches. This shared task aims to measure the social bias in dialogue scenarios in a fine-grained categorization which is challenging due to the complex and implicit bias expression. The context-sensitive bias responses in dialogue scenarios make this task even more complicated. We provide 25k data for training and 3k data for evaluation. The dataset is collected from a Chinese question-answering forum Zhihu ( www.zhihu.com ). Except for the above-mentioned bias attitude label, this dataset is also finely annotated with multiple auxiliary labels. There are 11 participating teams and 35 submissions in total. We adopt the macro F1 score to evaluate the submitted results, and the highest score is 0.5903. The submitted approaches focus on different aspects of this problem and use diverse techniques to boost the performance. All the relevant information can also be found at https://para-zhou.github.io/NLPCC-Task7-BiasEval/ .

Jingyan Zhou, Fei Mi, Helen Meng, Jiawen Deng

Overview of the NLPCC 2022 Shared Task: Dialogue Text Analysis (DTA)

In this paper, we present an overview of the NLPCC 2022 shared task on Dialogue Texts Analysis (DTA). The evaluation consists of two sub-tasks: (1) Dialogue Topic Extraction (DTE) and (2) Dialogue Summary Generation (DSG). We manually annotated a large-scale corpus for DTA, in which each dialogue contains customer and service conversation. A total of 50 + teams participated in the DTA evaluation task. We believe that DTA will push forward the research in the field of dialogue text analysis.

Qingliang Miao, Tao Guan, Yifan Yang, Yifan Zhang, Hua Xu, Fujiang Ge

Backmatter

Titel: Natural Language Processing and Chinese Computing
herausgegeben von: Wei Lu
Shujian Huang
Yu Hong
Xiabing Zhou
Verlag: Springer Nature Switzerland
Electronic ISBN: 978-3-031-17189-5
Print ISBN: 978-3-031-17188-8
DOI: https://doi.org/10.1007/978-3-031-17189-5