nach oben

Complex & Intelligent Systems

Open Access 12.03.2024 | Original Article

Enhanced Chinese named entity recognition with multi-granularity BERT adapter and efficient global pointer

verfasst von: Lei Zhang, Pengfei Xia, Xiaoxuan Ma, Chengwei Yang, Xin Ding

Erschienen in: Complex & Intelligent Systems

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Patentsuche

Aus

Abstract

Named Entity Recognition (NER) plays a crucial role in the field of Natural Language Processing, holding significant value in applications such as information extraction, knowledge graphs, and question–answering systems. However, Chinese NER faces challenges such as semantic complexity, uncertain entity boundaries, and nested structures. To address these issues, this study proposes an innovative approach, namely Multi-Granularity BERT Adapter and Efficient Global Pointer (MGBERT-Pointer). The semantic encoding layer adopts Multi-Granularity Adapter (MGA), while the decoding layer employs Efficient Global Pointer (EGP) network, ensuring collaborative functionality. The MGA, incorporating Character Adapter, Entity Adapter, and Lexicon Adapter through interactive mechanisms, are deeply integrated into the BERT base, significantly enhancing the model’s ability to handle complex contexts and ambiguities. The EGP, utilizing Rotary Position Embedding, resolves the issue of insufficient boundary information in traditional attention mechanisms, thereby improving the model’s understanding and recognition of nested entity structures. Experimental results on four public datasets demonstrate a significant enhancement in Chinese NER performance achieved by the MGBERT-Pointer model.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Introduction

Named Entity Recognition (NER), as a fundamental task in Natural Language Processing (NLP), aims to accurately identify and classify named entities in text, such as personal names, locations, organizations, etc. [1]. NER plays a crucial role in various NLP applications, including information extraction [2, 3], knowledge graphs [4], question–answering systems [5, 6], and machine translation [7]. In the English domain, NER has made significant progress, particularly by adopting deep learning network structures such as LSTM-CRF [8‐10], achieving mature results in terms of performance. However, Chinese NER faces challenges due to complex semantics, uncertain entity boundaries, and the definition of nested entities [11], making research in Chinese NER more challenging.

Chinese NER exhibits unique complexity when facing contextual challenges. First, due to the diversity of language expressions, the same entity may have different meanings in different contexts, leading to the problem of complex context. For example, in the Chinese sentence illustrated in Fig. 1, “苹果” (apple) could refer to a fruit or a technology company, which increases the difficulty for the model to understand the context. Second, the unclear entity boundaries and the existence of polysemous words exacerbate the ambiguity, posing a greater challenge for the model to correctly identify the entities. Taking Fig. 1 as an example, “苹果” (apple) may be an independent entity depending on the context, while “苹果公司” (Apple Corporation) could also be an independent entity, which adds complexity to an accurate entity classification. Furthermore, Chinese NER is more challenging than English NER because Chinese words are rarely separated by spaces, reducing the certainty of entity boundaries and increasing the complexity of capturing entity boundaries for NER systems. Chinese NER often relies on character information directly, and may ignore valuable word information. Word information is crucial in Chinese NER [12, 13] because word boundaries usually align with named entity boundaries, as shown in Fig. 1, where the boundaries of the named entity “美国” (United States) and the word “美国” are the same. Currently, most research addresses these challenges by incorporating external dictionaries, like Lattice-LSTM [14] and LEBERT [15]. For instance, LEBERT integrates dictionary features into the model’s underlying structure using a single BERT adapter. However, these methods overlook other semantic information and rely solely on dictionary information.

On the other hand, Chinese NER also faces the challenge of nested entities. In Chinese text, entities often have complex hierarchical relationships, which means that one entity may be nested within another. This introduces additional semantic complexity, as NER systems need to accurately capture the hierarchical relationships between entities to ensure proper boundary labeling. As shown in Fig. 1, the entity “苹果” is nested within another entity “苹果公司”, and “美国乔布斯” (Steve Jobs from the United States) might be mistakenly considered as a complete entity due to the close span between the entities “美国” and “乔布斯” (Steve Jobs). Rapidly and accurately obtaining nested entity information for semantic understanding is challenging due to the complexity of nested entity structures, as well as the irregularities in granularity and levels of nesting. Currently, methods to solve nested NER are primarily based on span-based techniques, learning the representations of the heads and tails of spans [16‐18]. However, these methods often ignore boundary information and are relatively cumbersome to integrate with other models.

To address the above challenges, this paper proposes an innovative approach, Multi-Granularity BERT Adapter and Efficient Global Pointer (MGBERT-Pointer), to significantly enhance the performance of Chinese NER. Specifically, at the semantic encoding layer, a novel Multi-Granularity Adapter (MGA) fusion method is designed to improve the model’s understanding and expression of semantic information at different levels. MGA includes Character Adapter (CA), Entity Adapter (EA), and Lexicon Adapter (LA), which is fully integrated through information exchange. CA integrates both local features and global context information of character to help more accurately understand the meaning of each character in different contexts and to address complex contextual issues. EA incorporates entity-related information into the model, enabling better understanding and identification of ambiguous entities. LA introduces crucial word boundary information by integrating information from external knowledge bases, helping to resolve the issue of unclear entity boundaries. MGA is a lightweight parameterized module that is deeply integrated into the BERT model, and provides a powerful semantic modeling tool to enhance model performance.

Simultaneously, this paper introduces and Efficient Global Pointer (EGP) in decoding layer to address the issue of insufficient boundary position information in traditional attention mechanisms by incorporating Rotary Position Embedding. Leveraging a globally normalized strategy, it achieves efficiency and indiscrimination in recognizing nested and non-nested entities. While the lightweight nature of MGA brings significant advantages in semantic modeling, it may encounter boundary ambiguity when dealing with nested entity structures. The introduction of the Global Pointer (GP) addresses this limitation by enhancing the model’s understanding and recognition capabilities of entity nesting structures through global normalization, allowing for a more precise capture of hierarchical relationships between entities. This multi-level collaborative effort enables the MGBERT-Pointer framework to achieve a more significant performance improvement in Chinese NER tasks.

In summary, this paper makes the following main contributions:

The paper designs CA, EA, and LA, and innovatively integrates them into MGA. Through deep integration into the BERT model, the design of this MGA significantly enhances the model's ability to address challenges such as complex contexts, ambiguity, and entity nesting, resulting in a notable improvement in performance on Chinese NER tasks.
To address the issue of entity nesting in Chinese NER, the EGP is introduced. Using a novel approach of global normalization, this network efficiently and indifferently distinguishes between nested and non-nested entities, greatly enhancing the model's understanding and recognition capabilities of entity nesting structures. This makes the NER system more accurate and reliable when dealing with complex hierarchical relationships.
By integrating MGA and EGP, the paper proposes a comprehensive and effective framework. This framework demonstrates state-of-the-art performance on public datasets, bringing innovations and breakthroughs to the field of Chinese NER research.

The work presented in this paper is relevant to existing neural network methods for improving Chinese NER models using the BERT Adapter and Pointer network.

Enhanced Chinese NER

The traditional approaches for Chinese NER include rule-based systems [19], Conditional Random Field (CRF) [20, 21], and sequence-labeling models such as Hidden Markov Model (HMM) [22, 23] and Maximum Entropy Model (MEM) [24]. In recent years, significant progress has been made in the field of Chinese NER through the application of deep learning methods. Convolutional Neural Network (CNN) [12, 25], Recurrent Neural Network (RNN) [8, 14], and Graph Neural Network (GNN) [26, 27] were once the primary deep learning architectures used in Chinese NER tasks. However, with the rise of pre-trained models, a revolutionary transformation has taken place in NER. Pre-trained models, especially those based on Transformer architecture such as BERT, have significantly improved NER performance by incorporating bidirectional context modeling to better understand the contextual information [28]. Consequently, the approach presented in this paper is based on BERT model.

In previous research, Chinese NER has made a series of progress, with two notable approaches being the integration of dictionary information and the recognition technology of nested named entities. To address the unique challenges posed by Chinese characters, recent methods have involved integrating dictionary information at the character level into sentences. The main strategies can be categorized into two types: dynamic improvements in the sequence modeling layer and enhancements in the embedding representation layer. At the sequence modeling layer, dynamic approaches, such as the Lattice-LSTM model proposed by Zhang [14], introduce word-level LSTM units between non-adjacent words in the character-level model to effectively utilize information between words. Zhao et al.’s AT-Lattice-LSTM-CRF method [29], based on adversarial training, balances the model’s consideration of word and character information through LSTM encoding, thereby fully utilizing clinical entity information in electronic health records. Other models, such as LR-CNN [12], dictionary-based GNN model with global semantics [26], and Flat-Lattice-Transformer [30], employ different mechanisms to integrate lexical information, enhancing performance on large datasets. At the embedding representation layer, methods like WC-LSTM [31] and Soft-Lexicon [13] integrate lexical information into the model by modifying the embedding layer. These models effectively leverage word boundary information and alleviate errors caused by Chinese word segmentation.

In response to the widespread presence of nested entities in actual text, researchers introduced innovative nested NER methods. Ju et al. [32] introduced a dynamic hierarchical model by dynamically stacking Flat NER layers. By passing the information obtained after each layer to the next layer, this approach effectively recognizes nested named entities at multiple levels. Experimental results demonstrate that the use of internal entities significantly enhances the detection of external entities. Straková et al. [33] proposed a neural architecture of nested NER by connecting multiple labels of nested entities into multi-labels. In addition, Sohrab and Miwa [34] presented a simple neural network model that first lists all possible subsequences as potential entities and then classifies them using a deep neural network. Zheng et al. [35] introduced a nested NER boundary-aware neural model, using a sequence-labeling model to achieve accurate entity localization. Fisher and Vlachos [36] proposed a new nested NER neural network structure that predicts internal relationships between entities without enumeration or boundary prediction, thereby reducing the likelihood of subsequences.

However, the approach based on the dictionary information integration face some challenges in model transferability, slow execution speed at the dynamic sequence modeling layer and embedding layer, and poor recognition of new words. On the other hand, nested NER methods encounter challenges in the dynamic hierarchical model, such as a high probability of error propagation, long training times, and explosive growth in the number of labels. To address these problems, this paper proposes an innovative Chinese NER method that combines MGBERT-Pointer. The aim is to comprehensively optimize the drawbacks of both dictionary information integration and nested NER methods, providing a more efficient and accurate solution for Chinese NER tasks.

BERT adapter

BERT Adapter [37] is a lightweight parameterized module that can be embedded into a pre-trained BERT model to fine-tune specific tasks. The introduction of the adapter allows BERT to be applied to various NLP tasks, including NER without retraining the entire BERT model. BERT Adapter offer an effective way to personalize BERT model, adapting it more effectively to the specific requirements of a given task without introducing a large number of additional parameters [38].

BERT Adapter, as an effective method, introduces a highly portable and parameter-efficient information transfer mechanism by adding adapters between layers of a pre-trained model and fine-tuning only the parameters relevant to a specific task for downstream applications. In previous work, Bapna and Firat [39] injected task-specific adapter layers into pre-trained models of neural machine translation. MAD-X [40] built an adapter-based framework to achieve parameter-efficient transferability for arbitrary tasks. Wang et al.’s K-ADAPTER further incorporated external knowledge into pre-trained models [41]. LEBERT [15] directly integrated external lexical knowledge into BERT layers for Chinese sequence-labeling tasks. Similar to LEBERT, Guo [42] fused a LA into BERT as an encoder for Chinese NER tasks.

Building upon the foundation of a single BERT Adapter used in prior approaches, this paper introduces an innovative MGA to more comprehensively integrate information. This MGA contains CA, EA, and LA. Our goal is to deeply fuse these adapters into the lower layers of the BERT model, enabling a more holistic and accurate capture of Chinese context. Furthermore, to ensure that the direct integration of multigranular information does not impact the performance of BERT and considering the information differences between the two, the decision is made to fine-tune the original parameters of BERT rather than fix them.

Pointer network

Pointer network is a neural network structure commonly used for sequence-to-sequence tasks. In NER, Pointer network can be used to mark entity boundaries in a sequence, treating it as a sequence-to-sequence task. The advantage of this approach lies in its ability to automatically capture entity boundaries without the need for predefined tagging schemes.

Pointer network has been successful in various NLP tasks such as text summarization [43], machine translation [44], and shows potential applications in NER. Yan et al. [45] employed a unified sequence generation framework based on Pointer network for three different NER subtasks, including flat NER, nested NER, and disjoint NER, achieving state-of-the-art performance in these tasks. Zhai et al. [46] took the entity blocks as the basic annotation units and used Pointer network to solve sequence-labeling problems within these blocks. In contrast, Guo [42] incorporated character sequences into their method’s encoder and generated target index sequences through a decoder, extracting semantic features of simple Chinese NER tasks through joint encoding and decoding. However, these approaches have limitations in extracting global semantic features. To address this, Su et al. [47] proposed a span-based NER framework, GP, which uses a global normalization strategy to predict entities, allowing for more prominent extraction of global semantic features. Li et al. [48] introduced a model based on GP and adversarial training, enhanced the model’s robustness and generalization through adversarial training, and decoded the entity information through GP. Simultaneously, Zhang et al. [49] proposed the RGP-with-FGM model based on GP and adversarial learning, constructing Uighur Name Dataset (UHND) to train a Chinese NER model. Zhang and Liang [50] introduced external Chinese medical knowledge into their model, utilizing the GP method to identify named entities in Chinese biomedical literature.

To further optimize the performance of GP and improve parameter utilization, this paper devises the EGP, an innovative enhancement to the GP. It addresses the shortcomings of the original GP in terms of parameter utilization, and significantly reduces the number of parameters while achieving superior performance. It is noteworthy that despite the reduction in the number of parameters, the EGP demonstrates excellent results in multiple-task experiments, providing a more efficient solution for Chinese NER, balancing the improvement in model performance and speed.

Method

The proposed architecture of the enhanced BERT is illustrated in Fig. 2. Compared to BERT and LEBERT, MGBERT-Pointer has two main distinctions. First, MGA is appended between the Transformer layers, allowing effective integration of character information, lexicon information, and entity information into BERT. Second, the design incorporates the EGP, utilizing a global normalization approach for NER, indiscriminately identifying nested and non-nested entities.

Multi-Granularity adapter

MGA is the core component of the proposed method, which enables the model to perform feature extraction and self-adaptation at different granularities. This section describes three key components: CA, LA, and EA, and how they work together.

Character adapter

CA is a novel component proposed in this paper, designed to enhance the performance of BERT model by combing the local features of characters with the global context information. As depicted in Fig. 3, local features of character features are obtained through convolutional layers with different kernel sizes, global features are acquired through global max-pooling, and an attention mechanism is introduced to integrate these features with the BERT representation.

In CA, the first step involves using an embedding layer to convert the character input $c$ into character embeddings ${h}^{c}$. To capture local character-level features, the CA utilizes a series of convolutional layers with different kernel sizes. The convolutional layers are defined as follows:

$$\begin{array}{c}conv=\{{{\text{conv}}}_{1},{{\text{conv}}}_{2},{{\text{conv}}}_{3}\},{{\text{conv}}}_{i}\in {\mathbb{R}}^{{{\text{out}}}_{{\text{channels}}}}\end{array}$$

(1)

Here, $conv$ represents the set of convolutional layers, and each convolutional layer $con{v}_{i}$ has a different number of output channels (${{\text{out}}}_{{\text{channels}}}$).

${h}_{{\text{conv}}}^{c}$ represents the hidden states obtained from the convolutional layers, where each convolutional layer is applied to ${h}^{c}$ and the output is obtained through max-pooling:

$$\begin{array}{c}{h}_{{\text{conv}}}^{c}=\left[{{\text{conv}}}_{1}\left({h}^{c}\right),{{\text{conv}}}_{2}\left({h}^{c}\right),{{\text{conv}}}_{3}\left({h}^{c}\right)\right]\end{array}$$

(2)

In addition to local features, CA also aims to capture global character-level information. Therefore, the global max-pooling operation is applied to the character embeddings to obtain global features. ${g}^{c}$ represents the global character-level representation:

$$\begin{array}{c}{g}^{c}={\text{GlobalMaxPooling}}\left({h}^{c}\right)\end{array}$$

(3)

The outputs of the convolutional layers and global max-pooling are concatenated together:

$$\begin{array}{c}{h}_{{\text{b}}}^{c}={\text{Concat}}\left({h}_{{\text{conv}}}^{c},{g}^{c}\right)\end{array}$$

(4)

To align with the BERT hidden representation, ${h}_{{\text{b}}}^{c}$ is passed through a transformation layer:

$$\begin{array}{c}{h}_{c}={W}_{2}^{c}\left({\text{tan}}h\left({W}_{1}^{c}\cdot {h}_{{\text{b}}}^{c}+{b}_{1}^{c}\right)\right)+{b}_{2}^{c}\end{array}$$

(5)

where ${W}_{1}^{c},{W}_{2}^{c}$ are learned parameter matrices, and ${b}_{1}^{c}, {b}_{2}^{c}$ are scaler biases.

To integrate character-level information into BERT model, CA adopts an attention mechanism. The attention scores $\mathrm{\alpha }$ between the BERT hidden layer output ${h}_{B}$ and the character-level representation ${h}_{c}$ are computed, and the attention mechanism is defined as follows:

$$\begin{array}{c}\alpha ={\text{Softmax}}\left({h}_{B}\cdot {W}_{\text{att}}^{c}\cdot {h}_{c}^{T}\right)\end{array}$$

(6)

where ${W}_{\text{att}}^{c}$ is a learned parameter matrix.

Then, the weighted sum of character embeddings is calculated using attention scores $\alpha $, and this information is fused with the BERT output:

$$\begin{array}{c}{H}_{c}={h}_{B}+ \sum \limits _{i}{\alpha }_{i}\cdot {h}_{c,i}\end{array}$$

(7)

To maintain the stability of the representation, apply a dropout layer (DO) and layer normalization (LN) to the fused representation to obtain the final hidden output ${{\varvec{H}}}_{{\varvec{c}}}$ of CA:

$$\begin{array}{c}{{\varvec{H}}}_{{\varvec{c}}}={\text{LN}}\left({\text{DO}}\left({H}_{c}\right)\right)\end{array}$$

(8)

Entity adapter

EA is a model component designed to handle entity-level information and integrate entity-related details into the hidden representation of the BERT model. It contributes to improve the NER performance, especially when dealing with text containing different entity labels. As shown in Fig. 4, EA adapts the hidden representation of BERT to entity embeddings through a transformation layer. It then applies the entity information to the entity embeddings and finally fuses the entity information with the BERT hidden representation.

EA acquires entity information from the model input, where individual entity information is represented as a triplet, including the entity type, start position, and end position. The entity information is structured as a matrix, denoted as ${h}^{e}$.

First, a multi-layer perceptron is applied to transform the BERT hidden representation ${h}_{{\text{B}}}$. The purpose of this transformation is to adapt the hidden representation to the entity information, resulting in the output representation ${e}_{{\text{B}}}$:

$$\begin{array}{c}{e}_{{\text{B}}}={W}_{2}^{e}\left(ReLU\left({W}_{1}^{e}\cdot {h}_{{\text{B}}}+{b}_{1}^{e}\right)\right)+{b}_{2}^{e}\end{array}$$

(9)

Here, ${W}_{1}^{e},{W}_{2}^{e}$ represent weight matrices, and ${b}_{1}^{e},{b}_{2}^{e}$ are scaler biases.

Applying the entity information ${h}^{e}$ to the entity embedding ${e}_{{\text{B}}}$ emphasizes the information related to a specific entity, as follows:

$$\begin{array}{c}{h}^{e}={h}^{e}\cdot {e}_{{\text{B}}}\end{array}$$

(10)

The aggregated entity information representation is obtained by summing the entity embeddings:

$$\begin{array}{c}{h}_{e}= \sum \limits _{i}{h}_{i}^{e}\end{array}$$

(11)

Finally, the entity information is fused with the BERT hidden representation to generate the fused representation. Apply dropout and layer normalization operations to the fused representation, resulting in the final hidden output ${{\varvec{H}}}_{{\varvec{e}}}$ of EA:

$$\begin{array}{c}{{\varvec{H}}}_{{\varvec{e}}}={\text{L}}{\text{N}}\left({\text{DO}}\left({h}_{{\text{B}}}+{h}_{e}\right)\right)\end{array}$$

(12)

Lexicon adapter

LA enhances the representation capability of BERT model by integrating rich information from an external knowledge base. It effectively incorporates key word boundary information, enabling the model to capture the boundaries of words and entities in the text more accurately, thus improving the precision and efficiency of the Chinese NER task. Inspired by the recent work on BERT Adapter in LEBERT, we propose a novel LA. Unlike the LA in LEBERT, LA not only directly injects lexicon information into BERT but also incorporates position encoding for the lexicon in the adapter. In addition, it adaptively learns the importance of different lexicon information.

As shown in Fig. 5, each character in the sentence is matched with an external lexicon to retrieve all potential words. Position encoding is added to each word to assist the model in capturing the word sequence. After the transformation layer, an adaptive vocabulary information fusion mechanism is introduced to adjust the importance of the vocabulary. Then, an attention mechanism in the Transformer layer is employed to select words most relevant to the entities and integrate character-word pairs. Finally, external lexicon knowledge is integrated with the BERT hidden representations.

When LA receives input, it treats characters and their corresponding words as pairs. For the $i$-th position in the character-word pair sequence, the input is represented as $\left({h}_{i}^{c},{h}_{i}^{w}\right)$, where ${h}_{i}^{c}$ denotes the character vector, which is an output from a certain Transformer layer in BERT, and the set of ${h}_{i}^{c}$ constitutes the hidden vectors ${h}_{{\text{B}}}$ of BERT. ${h}_{i}^{w}$ includes a set of word embeddings used to represent the corresponding word. The $j$th word in ${h}_{i}^{w}$ is represented as:

$$\begin{array}{c}{h}_{ij}^{w}={e}^{w}\left({w}_{ij}\right)\end{array}$$

(13)

Here, ${e}^{w}$ is the word lookup table, and ${w}_{ij}$ is the $j$th word matched to the $i$th character.

As the matched words for characters are sorted based on the length of characters, i.e., the granularity of words differs, and the larger the granularity, the larger the positional values, positional encoding is introduced to help the model capture the order and importance of vocabulary information. A position encoding matrix is defined to obtain positional information. The calculation of the position encoding matrix assigns a position encoding vector to each row, corresponding to a different position. The computation of the position encoding vector involves the use of sine and cosine functions, each with different frequencies. This design ensures that position encoding vectors of different locations have some differences in vector space, so that the model can distinguish them. The representation is as follows, where $d$ represents the dimension of word embeddings:

$$\begin{array}{c}{p}_{ij}=\left\{\begin{array}{c}{\text{sin}}\left(\frac{i}{{\mathrm{10,000}}^{2j/d}}\right), \; {\text{if}} \; j \; {\text{is even}}\\ {\text{cos}}\left(\frac{i}{{\mathrm{10,000}}^{2j/d}}\right), \; {\text{if}} \; j \;{\text{is odd}}\end{array}\right.\end{array}$$

(14)

Next, the positional encoding matrix is added to the word embedding matrix to incorporate positional information into the word embeddings. This allows the model to simultaneously consider both lexical and positional information when processing input sequences. In order to align different representations, a non-linear transformation is applied to the word vectors that include positional information:

$$\begin{array}{c}{v}_{ij}^{w}={W}_{2}^{w}(tanh\left({W}_{1}^{w}\left({h}_{ij}^{w}+{p}_{ij}\right)+{b}_{1}^{w}\right)+{b}_{2}^{w}\end{array}$$

(15)

where ${W}_{1}^{w}$, ${W}_{2}^{w}$ are weight matrices, and ${b}_{1}^{w},{b}_{2}^{w}$ are scaler biases.

In order to learn the importance of lexical information, an adaptive weight parameter ${\mathrm{\alpha }}^{w}$ is introduced to adjust the significance of lexical information. Each lexical information ${v}_{ij}^{w}$ is multiplied by the adaptive weight ${\mathrm{\alpha }}_{ij}^{w}$ and then integrated into the model. The computation of the adaptive weight is as follows:

$$\begin{array}{c}{\mathrm{\alpha }}_{ij}^{w}=\sigma \left({\upbeta }_{ij}\right)\end{array}$$

(16)

where $\sigma $ represents the Sigmoid activation function, and ${\beta }_{ij}$ represents the adaptive weight parameter. Finally, we obtain the fused lexical information representation ${x}_{ij}^{w}$:

$$\begin{array}{c}{x}_{ij}^{w}={v}_{ij}^{w}\cdot {\mathrm{\alpha }}_{ij}^{w}\end{array}$$

(17)

To select the most relevant word from all matched words, a character-to-word attention mechanism is applied in LA. Specifically, all ${x}_{ij}^{w}$ assigned to the $i$-th character are represented as ${X}_{i}$, and the relevance of each word ${a}_{i}$ can be calculated as:

$$\begin{array}{c}{a}_{i}=softmax\left({h}_{i}^{c}\cdot {W}^{w}\cdot {X}_{i}^{T}\right)\end{array}$$

(18)

where ${W}^{w}$ is the weight matrix for attention. Therefore, the weighted sum of all words can be obtained:

$$\begin{array}{c}{z}_{i}= \sum \limits _{j=1}^{d}{a}_{ij}\cdot {x}_{ij}^{w}\end{array}$$

(19)

Finally, the weighted lexical information is injected into the character vectors, followed by a dropout layer and a normalization layer:

$$\begin{array}{c}{{\varvec{H}}}_{{\varvec{w}}}={\text{L}}{\text{N}}\left({\text{DO}}\left({h}_{{\text{B}}}+z\right)\right)\end{array}$$

(20)

where ${h}_{{\text{B}}}$ is the set of ${h}_{i}^{c}$, $z$ is the set of ${z}_{i}$, and ${{\varvec{H}}}_{{\varvec{w}}}$ is the final hidden output of LA.

Adapter fusion

CA, EA, and LA process information at the character, entity, and lexical levels, respectively, providing a multigranular language representations of the model. To better integrate these different levels of semantic information, this paper proposes an adapter fusion method called MGA. MGA enhances the model's understanding and expression of semantic information at various levels. MGA integrates the interactions between character, lexical, and entity-level information, enabling the BERT model to comprehensively understand text and improve the accuracy and robustness of Chinese NER performance. MGA is based on a deep neural network model, which includes a self-attention layer in addition to the three mentioned adapter components.

As shown in Fig. 6, to dynamically adjust the weight distribution of each adapter, a self-attention layer is designed. This self-attention layer learns three weight values, used to adjust the outputs of CA, EA, and LA. This allows the model to dynamically allocate adapter weights under different input conditions, adapting more effectively to specific scenarios.

Firstly, define the process of adapter fusion, where ${{\varvec{H}}}_{{\varvec{c}}}$, ${{\varvec{H}}}_{{\varvec{e}}}$, and ${{\varvec{H}}}_{{\varvec{w}}}$ represent the outputs of CA, EA, and LA, respectively. These outputs are concatenated into ${H}_{b}$, and then weighted using a self-attention mechanism:

$$\begin{array}{c}{H}_{b}=Concat\left({{\varvec{H}}}_{{\varvec{c}}},{{\varvec{H}}}_{{\varvec{e}}},{{\varvec{H}}}_{{\varvec{w}}}\right)\end{array}$$

(21)

$$\begin{array}{c}A={\text{Softmax}}\left({W}_{1}\cdot {H}_{b}\right)\end{array}$$

(22)

where ${W}_{1}$ is the weight matrix for self-attention, and $A$ contains the weight allocations for CA, EA, and LA.

Next, employ the weight matrix $A$ to perform a weighted fusion of the outputs from the three adapters, resulting in the fused output ${H}_{f}$:

$$\begin{array}{c}{H}_{f}={A}_{c}\cdot {H}_{c}+{A}_{e}\cdot {H}_{e}+{A}_{w}\cdot {H}_{w}\end{array}$$

(23)

Finally, obtain the final hidden output $\widetilde{{{\varvec{H}}}_{{\varvec{f}}}}$ of MGA through a fusion layer:

$$\begin{array}{c}\widetilde{{{\varvec{H}}}_{{\varvec{f}}}}=LN\left({\text{DO}}\left(ReLU\left({W}_{2}\cdot {H}_{f}+{b}_{1}\right)\right)\right)\end{array}$$

(24)

where ${W}_{2}$ is the weight matrix, and ${b}_{1}$ is the bias.

MGA is inserted between certain Transformers within BERT, injecting multigranular information into model. We denote the preceding layer's hidden vector within a certain Transformer layer in BERT as ${H}^{l-1}$ and the intermediate variable as $M$, the computation for each Transformer layer is as follows:

$$\begin{array}{c}M={\text{L}}{\text{N}}\left({H}^{l-1}+MHAttn\left({H}^{l-1}\right)\right)\end{array}$$

(25)

$$\begin{array}{c}{H}^{l}={\text{L}}{\text{N}}\left(M+FFN\left(M\right)\right)\end{array}$$

(26)

where ${H}^{l}$ represents the output of the $l$-th layer, ${H}^{l}=\{{h}_{1}^{l},{h}_{2}^{l},\dots ,{h}_{n}^{l}\}$, $MHAttn$ is the multi-head attention mechanism, and $FFN$ is a two-layer feedforward network with $ReLU$ as the hidden activation function. A schematic diagram is illustrated in Fig. 7.

To achieve the deep integration of multigranular information with BERT, multigranular information is incorporated between the $k$th and $(k+1)$th Transformers. MGA takes three inputs: character, entity, and lexical information. Character input obtains character-level features through CA, entity input acquires entity-level features through EA, and character input and lexical input obtain word-level features through LA. The features obtained from these three sources are then fused through MGA to generate multigranular semantic feature outputs, which can be expressed as:

$$\begin{array}{c}\widetilde{{{\varvec{H}}}^{{\varvec{k}}}}=MGA\left(CA\left({h}^{c}\right)+EA\left({h}^{e}\right)+LA\left({h}^{c},{h}^{w}\right)\right)\end{array}$$

(27)

where ${h}^{c}$, ${h}^{e}$ and ${h}^{w}$ represent the character, entity, and matched lexical inputs, respectively.

Since BERT has $L=12$ Transformer layers, $\widetilde{{{\varvec{H}}}^{{\varvec{k}}}}$ is input into the remaining $(L-k)$ Transformers. Eventually, the output of the $L$th Transformer for the Chinese NER task, denoted as $\widetilde{{{\varvec{H}}}^{{\varvec{L}}}}$, serves as the input for the EGP, used in the decoding computation of the sequence modeling layer.

Efficient global pointer

EGP is another key component of our model, contributing to the model's inference and classification at the sequence level. EGP is a span-based entity recognition method that identifies named entities by classifying subsequences of sentences. This approach is superior to sequence-labeling methods in preventing error propagation and can easily detect nested named entities since they belong to different subsequences. In addition, compared to general pointer networks, EGP incorporates global information about entities, avoiding inconsistencies between training and prediction.

Entity prediction

Due to the continuity of entities, for a sentence sequence of length $n$, the maximum number of continuous subsequences in the sentence is $n(n+1)/2$. These subsequences encompass all possible entities, resulting in a total of $n(n+1)/2$ candidate entities for the sequence. The task of EGP is to select the actual entities from these candidates. For a sentence containing $k$ entities and $m$ entity types, the NER task can be modeled as selecting $k$ entities from the $n(n+1)/2$ candidate entities and performing an $m$-classification task for each entity.

As shown in Fig. 8, the Chinese sentence”张三来自北京” (Zhang San is from Beijing) can be transformed into a matrix with dimensions$d=[n-head, n, n]$, where $n-head$ represents the number of entity type categories, and $n$ represents the length of the sentence. The entities “张三” (Zhang San) and “北京” (Beijing) are then extracted from the sentence and classified as person names and location names, respectively. For the EGP model, the entity information of the sentence can be represented by the two-dimensional matrix in the Fig. 8, where the blue and yellow sections represent the entities in the sentence. The sentence contains two entity categories, corresponding to two heads, with each head representing one entity category. As the upper triangular part can fully construct the entity information, the lower triangular part of the matrix can be omitted. Therefore, for entities in the sentence, $i$ and $j$ represent the starting and ending positions, respectively. A coordinate $(i, j)$ with a value of 1 indicates the presence of an entity, while a value of 0 indicates the absence of an entity.

EGP can address character-level labeling errors and more accurately identify entity boundaries. For entity extraction tasks using EGP, the input of length $n$ is encoded to obtain a vector sequence $H=\left[{h}_{1},{h}_{2},\dots ,{h}_{n}\right]$, where ${h}_{i}$ and ${h}_{j}$ are the vectors corresponding to the $i$-th and $j$-th positions in $H$. Then, $H$ is transformed into two vectors ${q}_{i}^{t}$ and ${k}_{i}^{t}$ through a fully connected layer:

$$\begin{array}{c}{q}_{i}^{t}={W}_{q}^{t}\cdot {h}_{i}+{b}_{q}^{t}\end{array}$$

(28)

$$\begin{array}{c}{k}_{j}^{t}={W}_{k}^{t}\cdot {h}_{j}+{b}_{k}^{t}\end{array}$$

(29)

where ${W}_{q}^{t}$ and ${W}_{k}^{t}$ are weight matrices, and ${b}_{q}^{t}$ and ${b}_{k}^{t}$ are biases. ${q}_{i}^{t}$ and ${k}_{j}^{t}$ are vector representations used to identify the tags of entities belonging to type $t$. Based on the above equations, the scoring formula for entities of type $t$ can be defined as:

$$\begin{array}{c}{s}_{t}\left(i,j\right)={\left({q}_{i}^{t}\right)}^{T}\cdot {k}_{j}^{t}\end{array}$$

(30)

However, for any type $t$, its scoring matrix ${s}_{t}\left(i,j\right)$ has many similarities. To capture the shared scoring computation under each entity type, the NER task is regarded as two subtasks, including extraction and classification. The former extracts the subsequences as entities, while the latter identifies the type of each entity. In this way, the extraction step is equivalent to an NER task with only one entity type, and the scoring matrix can be expressed as ${q}_{i}^{T}{k}_{j}$ using the above formula. The classification step uses ${w}_{t}^{T}\left[{q}_{i};{k}_{i};{q}_{j};{k}_{j}\right]$, where ${w}_{t}$ represents the identifier for entity type $t$, and $\left[{q}_{i};{k}_{i};{q}_{j};{k}_{j}\right]$ is a span representation that further reduces parameters compared to $\left[{h}_{i};{h}_{j}\right]$. Therefore, the new scoring function is as follows:

$$\begin{array}{c}{s}_{t}\left(i,j\right)={q}_{i}^{T}\cdot {k}_{j}+{w}_{t}^{T}\left[{q}_{i};{k}_{i};{q}_{j};{k}_{j}\right]\end{array}$$

(31)

Relative position

Using only EGP is not sufficient for accurately identifying entities of different types because it does not consider the length and span of entities. For example, in the sentence “北京上海的气温下降” (The temperature in Beijing and Shanghai is dropping), it is desirable to recognize the two location entities “北京” (Beijing) and “上海” (Shanghai). However, the model may mistakenly consider “北京上海” as a single entity because “北” is the starting position of the entity “北京,” and "海" is the ending position of the entity “上海.” Without relative position information as input, EGP is not particularly sensitive to the length and span of entities, making it prone to predicting arbitrary combinations of the beginnings and ends of any two entities as targets.

To improve the accuracy of entity boundary recognition, incorporating relative position information into EGP is crucial, as this approach is more sensitive to the length and span of entities. Specifically, EGP uses Rotary Position Embedding (RoPE) [51] to encode relative position information. RoPE is a transformation matrix ${R}_{i}$ that satisfies the relationship ${R}_{i}^{T}{R}_{j}={R}_{j-i}$. ${R}_{i}$ can be applied to $q$ and $k$, explicitly injecting relative position information into the scoring function, resulting in more accurate entity representations. Therefore, the final scoring function for EGP is:

$$\begin{aligned}{s}_{t}\left(i,j\right)&={\left({R}_{i}{q}_{i}\right)}^{T}\left({R}_{j}{k}_{j}\right)+{w}_{t}^{T}\left[{q}_{i};{k}_{i};{q}_{j};{k}_{j}\right] \\ & ={q}_{i}^{T}{R}_{i}^{T}{R}_{j}{K}_{j}+{w}_{t}^{T}\left[{q}_{i};{k}_{i};{q}_{j};{k}_{j}\right] \\ & \begin{array}{c}={q}_{i}^{T}{R}_{j-i}{K}_{j}+{w}_{t}^{T}\left[{q}_{i};{k}_{i};{q}_{j};{k}_{j}\right]\end{array}\end{aligned}$$

(32)

Training

NER is essentially a multi-label classification problem. After obtaining the scores for all candidate entities of each label category, for a sentence of length $n$, multi-label classification is performed for each label category among the $n(n+1)/2$ categories. Since the number of entities typically present in each input text is usually less than $n(n+1)/2$, directly performing binary classification would lead to a severe class imbalance issue. Therefore, the problem is treated as pairwise comparisons between target and non-target category scores. Cross-entropy is then used to calculate self-balancing weights, mitigating the problem of label category imbalance. Meanwhile, the loss function for the model training step can be designed as:

$$\begin{array}{c}Loss={\text{log}}\left(1+ \sum \limits_{\left(i,j\right)\in {P}_{t}}{e}^{-{s}_{t}\left(i,j\right)}\right)+{\text{log}}\left(1+ \sum \limits_{\left(i,j\right)\in {Q}_{t}}{e}^{{s}_{t}\left(i,j\right)}\right)\end{array}$$

(33)

where $i$ and $j$ represent the starting and ending indices of a span, ${P}_{t}$ is the positive sample set of starting and ending positions for all entities with type $t$, ${Q}_{t}$ represents the negative sample set of starting and ending positions for all entities that are not of type $t$ or are not entities at all, and ${s}_{t}\left(i,j\right)$ is the score for an entity of type $t$ with span $s[i:j]$. During the decoding phase, candidate entities that satisfy the condition ${s}_{t}\left(i,j\right)>0$ are considered as output entities of type $t$.

Experiments

In this section, a series of extensive experiments are conducted to investigate the performance of the proposed model. The experimental results are comprehensively compared and analyzed to gain insights into the model's performance in different scenarios. Furthermore, to highlight the key components of the proposed method, ablation experiments are performed.

Evaluation metrics

In the experiments, the standard F1 score (F1) is employed as the evaluation metric, and its calculation formula is as follows:

$$\begin{array}{c}P=\frac{{\text{TP}}}{{\text{TP}}+{\text{FP}}}\end{array}$$

(34)

$$\begin{array}{c}R=\frac{{\text{TP}}}{{\text{TP}}+{\text{FN}}}\end{array}$$

(35)

$$\begin{array}{c}{F}_{1}=\frac{2\times P\times R}{P+R}\times 100\%\end{array}$$

(36)

where TP, FP, and FN represent the numbers of true positives, false positives, and false negatives, respectively.

Datasets

The proposed model is extensively evaluated in this paper, covering datasets from multiple domains to comprehensively examine its performance. Specifically, experiments were conducted on Weibo [52, 53], OntoNotes [54], Resume [14], and MSRA [55].

Weibo NER: Belonging to the social domain, this dataset follows the same training, development, and testing splits as Peng and Dredze [52]. This dataset, characterized by social media text, poses challenges for NER models in handling informal language, abbreviations, and internet slang.
OntoNotes and MSRA: Belonging to the news domain. For OntoNotes, we adopted the same data split as Che et al. [56]. Since the MSRA dataset lacks a dedicated development set, we randomly selected 10% of samples from the training set as the development set. This design allows us to thoroughly assess the model's NER performance in news domain text.
Resume NER: This dataset consists of resumes of senior executives and belongs to a specific domain. In this task, we focus on evaluating the model's performance in handling professional terms and domain-specific information.

Table 1 provides a detailed overview of each dataset, including the sizes of the training, development, and test sets, as well as the domains to which the datasets belong. This dataset configuration is designed to comprehensively validate the robustness and applicability of the proposed method across different domains and contexts.

Table 1

Details of the datasets

Dataset	Domain	Type	Train	Dev	Test
Weibo	Social media	Sentence	1.4k	0.27k	0.27k
Weibo	Social media	Char	73.8k	14.5k	14.8k
Resume	Resume	Sentence	3.8k	0.46k	0.48k
Resume	Resume	Char	124,1k	13.9k	15.1k
Ontonotes	News	Sentence	15.7k	4.3k	4.3k
Ontonotes		Char	491.9k	200.5k	208.1k
MSRA		Sentence	46.4k	–	4.4k
MSRA		Char	2169.9k	–	172.6k

The datasets adopt the BIO (Begin, Inside, Outside) labeling scheme to annotate named entities. Each word is marked as the beginning (B), inside (I), or outside (O) of an entity. Weibo and Ontonotes datasets include four entity categories: Personal Name (PER), Organization (ORG), Location (LOC), and Geo-Political Entity (GPE). The Resume dataset encompasses eight entity types: Personal Name (NAME), Organization (ORG), Location (LOC), Title (TITLE), Education Organization (EDU), Country (CONT), Profession (PRO), and Race (RACE). The MSRA dataset comprises three entity types: Personal Name (PER), Organization (ORG), and Location (LOC).

Experimental settings

All experiments were conducted on an NVIDIA RTX 2080 Ti GPU with 11 GB of VRAM. The model utilizes Chinese BERT-base as the Encoder, leveraging its pre-trained weights with a weight decay set to 0.01. The Transformer model consists of 12 layers, and the hidden layer dimension is set to 768. During the training phase, AdamW was employed as the optimization function with a decay rate of 1e−8 and an initial learning rate of 1.8e−5, following the original parameters of BERT. The batch size parameter was set to 32. For the OntoNotes and MSRA datasets, training was conducted for 10 epochs, while for the Resume and Weibo datasets, training continued for 30 epochs. A 50% dropout rate was applied to prevent overfitting. Considering the varying sentence lengths in different datasets, the maximum sentence length was set to 150. In addition, LA used the dictionary from Song et al. [57], with a maximum of 3 words merged for each Chinese character. The specific parameter settings are detailed in Table 2.

Table 2

Experimental parameters of the model

Parameter	Value
Transformer layer	12
Hidden layer dimension	768
Weight decay	0.01
Optimizer	AdamW
Optimizer decay	1e−8
Learning rate	1.8e−5
Training batch size	32
Epoch number	10,30
Dropout rate	0.5
Max sentence length	150

Between the first and second Transformers of the BERT model, MGA was introduced, and EGP was incorporated at the output layer of BERT. Fine-tuning was applied to both BERT and the pointer network during training. To comprehensively evaluate the effectiveness of the proposed MGBERT-Pointer model, models incorporating vocabulary information and employing different deep learning network structures were chosen as baselines. In addition, results were compared with other state-of-the-art (SOTA) models.

Lattice LSTM [14]: Lattice LSTM is a model that utilizes a RNN structure, particularly suitable for processing sequential data. It introduces a lattice structure in NER tasks to better capture the relationships between characters and words, especially advantageous in handling Chinese NER.
LR-CNN [12]: LR-CNN adopts a CNN structure, capturing local features through convolutional operations. It achieves good results in certain contexts for NER tasks. LR-CNN extracts features through a local perception field, making it suitable for tasks where there is strong local correlation in annotated information.
LGN [26]: LGN utilizes a GNN structure, modeling the NER task as a graph node classification problem. By considering the relationships between entities, GNN can better capture contextual information between entities, particularly advantageous in handling complex relationships such as nested entities.
FLAT [30]: FLAT adopts the Transformer structure, which is a model based on self-attention mechanisms. It effectively captures long-distance dependency relationships and is suitable for handling NER tasks with uncertain entity boundaries and complex contexts.
LEBERT [15]: LEBERT is an improvement based on the BERT model, introducing external lexical information through BERT Adapter to enhance the model's learning ability for entity boundaries.

Results and analysis

Table 3 provides a detailed presentation of the results from a series of experiments conducted on Chinese NER datasets. Compared to the traditional BiLSTM model, other models have incorporated external information, especially lexical information. Due to this integration, the performance of other models on all datasets surpasses that of BiLSTM model, highlighting the enhanced performance of Chinese NER through the incorporation of lexical information.

Table 3

Experimental results on Chinese NER

Model	Weibo	Resume	Ontonotes	MSRA
BiLSTM [10]	56.75	94.41	71.84	91.87
Lattice-LSTM	58.79	94.46	73.88	93.18
WC-LSTM [31]	49.86	94.96	74.43	93.36
LR-CNN	59.92	95.11	74.45	93.71
CGN [58]	63.09	94.12	74.79	93.47
LGN	60.15	95.41	74.85	93.63
FLAT	60.32	95.45	76.45	94.12
BERT	67.27	95.33	79.93	94.71
ERINE [59]	67.96	94.82	77.65	95.08
ZEN [60]	66.71	95.40	79.03	95.20
LEBERT	70.75	96.08	82.08	95.70
Our model (MGBERT-Pointer)	73.22	96.23	83.25	96.10

For Chinese NER enhanced by lexicon, one can observe the structural evolution at the network encoding layer. Starting from the initial use of LSTM (including Lattice-LSTM and WC-LSTM) to CNN (such as LR-CNN), GNN (including CGN and LGN), and ultimately to the Transformer-based FLAT structure, it is evident that the model structure continuously improves, leading to advancements in Chinese NER performance. Particularly noteworthy is the significant performance improvement achieved by Transformer-based BERT models in their variant versions (ERNIE, ZEN, and LEBERT).

Different from the models mentioned above, which only shallowly incorporate lexicon information, and distinct from ERNIE and ZEN, two models that utilize lexicon guidance, LEBERT model achieves leading performance across various domains in Chinese NER by deeply integrating lexicon information at the model’s lower layers. Drawing inspiration from the usage of BERT Adapter in LEBERT, our model incorporates MGA at the semantic encoding layer to integrate information at different granularities. Furthermore, at the decoding layer, we innovatively introduce an EGP network. The performance of our model on datasets from different domains surpasses that of previous models.

From Table 3, it is shown that MGBERT-Pointer demonstrates varying degrees of advantages across different domains in the Chinese NER task. Notably, the performance improvement on the small sample dataset (Weibo) is significantly higher than other datasets, whereas on the Resume dataset, despite also being a small sample, the improvement is not as pronounced. This could be attributed to the model performing exceptionally well on small sample datasets, but when the NER performance of the dataset approaches saturation, the room for improvement becomes relatively limited. This viewpoint is validated by the model’s performance on the Ontonotes dataset; both Ontonotes and MSRA are relatively larger datasets, but the performance improvement of MGBERT-Pointer is more significant on the Ontonotes dataset. It is worth noting that the Weibo dataset belongs to the social domain, and compared to datasets from other domains, it exhibits higher semantic complexity. Therefore, our model excels in handling such small-sample and semantically complex text contexts.

However, the reasons behind the performance improvement of MGBERT-Pointer are still unclear—whether it is the effect provided separately by the MGA, the EGP, or the combined effect of both. Therefore, in the ablation experiments, we will carefully investigate the individual and joint effects of MGA and EGP on performance.

Ablation experiments

The results of the ablation experiments illustrate the impact of different components on the performance of MGBERT-Pointer model in Chinese NER tasks. Table 4 presents the F1 scores from the ablation experiments on four public datasets. It can be observed that removing CA leads to a slight performance decrease, especially on the Weibo and Ontonotes datasets. This suggests that CA plays a role in understanding local features of characters and global contextual information, particularly when dealing with social media data. Removing EA has a relatively minor impact on performance, with a slight decrease across datasets. The presence of EA may contribute positively to a better understanding and recognition of ambiguous entities. Removing both CA and EA results in a slight decrease in performance, particularly on the Weibo dataset. This indicates that their combination enhances the model’s comprehensive understanding of character and entity information to some extent. Removing LA significantly affects performance, especially on the Weibo and Ontonotes datasets. This implies that LA plays a crucial role in introducing external knowledge base information, particularly in scenarios dealing with high semantic complexity. Removing MGA has the most significant impact on performance, leading to a substantial decrease in F1 scores. This highlights the critical role of MGA in the model, especially when handling different levels of semantic information, providing a powerful semantic modeling tool.

Table 4

F1 scores (%) of ablation studies on four datasets

Setting	Weibo	Resume	Ontonotes	MSRA
MGBERT-Pointer	73.22	96.23	83.25	96.10
w/o CA	72.65 (− 0.57)	96.21 (− 0.02)	83.15 (− 0.10)	96.09 (− 0.01)
w/o EA	72.85 (− 0.37)	96.19 (− 0.04)	82.91 (− 0.34)	96.01 (− 0.09)
w/o CA & EA	72.15 (− 1.07)	96.17 (− 0.06)	82.80 (− 0.45)	96.00 (− 0.10)
w/o LA	71.72 (− 1.50)	96.15 (− 0.08)	82.45 (− 0.80)	95.97 (− 0.13)
w/o MGA	71.01 (− 2.21)	96.10 (− 0.13)	82.11 (− 1.14)	95.75 (− 0.35)
w/o EGP	71.80 (− 1.42)	96.09 (− 0.14)	82.51 (− 0.74)	95.85 (− 0.25)
w/o RoPE	72.25 (− 0.97)	96.11 (− 0.12)	82.91 (− 0.34)	95.89 (− 0.21)
w/o MGA and EGP	67.27 (− 5.95)	95.33 (− 0.90)	79.93 (− 3.32)	94.71 (− 1.39)

w/o absence of the corresponding component, CA character adapter, EA entity adapter, LA lexicon adapter, MGA Multi- Granularity Adapter, EGP efficient global pointer, RoPE rotary position embedding

The removal of EGP has a notable impact on performance, resulting in a decrease in F1 scores. This suggests that EGP plays an important role in addressing the lack of boundary position information in traditional attention mechanisms, crucial for improving the model's understanding and identification capabilities of nested entity structures. Removing RoPE has a certain negative impact on performance, indicating that RoPE plays a positive role in handling hierarchical relationships between entities, contributing to the enhancement of model performance.

Furthermore, the simultaneous removal of MGA and EGP leads to a substantial decrease in NER performance across all datasets, with a decline exceeding the combinations where other components are removed. This indicates that the combination of MGA and EGP can enhance the performance of Chinese NER. MGA and EGP are the main contributing factors to the performance improvement of the MGBERT-Pointer model. In addition, from the table, it can be observed that our model exhibits a certain advantage in handling small-sample, semantically complex texts like Weibo.

To gain a deeper understanding of the practical impact of MGA and EGP, a case study was conducted, and the results are presented in Table 5. In the absence of MGA, the model made errors in recognizing named entities, incorrectly identifying "阿拉善盟" as an organizational entity and, as a result, treating “内蒙古” and “阿拉善盟” as separate entities. Without EGP, the model’s capability to handle the nested entity “阿拉善盟” declined, misidentifying it as an independent entity and causing boundary errors. When both MGA and EGP were absent, the model struggled significantly in recognizing the entity “内蒙古阿拉善盟,” almost entirely failing to annotate it correctly. MGBERT-Pointer, by integrating MGA and EGP, successfully maintains efficient processing of complex contexts and nested entities compared to other configurations.

Table 5

A case study

Sentence	内蒙古阿拉善盟 (Alxa League, Inner Mongolia)
Characters	内	蒙	古	阿	拉	善	盟
Gold labels	B-GPE	I-GPE	I-GPE	I-GPE	I-GPE	I-GPE	I-GPE
w/o MGA	B-GPE	I-GPE	I-GPE	B-ORG	I-ORG	I-ORG	I-ORG
w/o EGP	B-GPE	I-GPE	I-GPE	B-GPE	I-GPE	I-GPE	I-GPE
w/o MGA and EGP	B-GPE	I-GPE	I-GPE	O	O	O	O
MGBERT-Pointer	B-GPE	I-GPE	I-GPE	I-GPE	I-GPE	I-GPE	I-GPE

B and I beginning and end of entities, O other, GPE geopolitical entity, ORG organization entity

Discussion

As shown in Table 6, introducing MGA at different layers of the Transformer model on the Weibo dataset was studied to investigate the impact of adapters on semantic representation. Better performance was achieved when MGA was introduced to the shallow layers of the Transformer, possibly because shallow layers facilitate a closer hierarchical interaction between multi-granularity information and the BERT model. This interaction helps better fuse semantic information at different levels. However, when applying adapters at multiple layers, the F1 score consistently decreased, indicating that the joint action of adapters at these layers did not bring a significant performance improvement and might even involve information redundancy or conflicts. It could be a result of overfitting caused by the integration of multiple layers. Therefore, setting in the separate shallow layers of the Transformer can achieve better performance and avoid overfitting issues.

Table 6

F1 scores (%) of MGAs at different layers of transformer

Layer	1	3	5	7	9	12
F1 (%)	73.22	73.15	72.90	72.75	72.60	72.40

Layer	1, 3	1, 3, 5	1, 3, 5, 7	All
F1 (%)	72.80	72.05	71.40	69.21

In recent NER research, GP is commonly used for experiments. To validate the effectiveness of EGP in terms of performance, a comparison was made between the two on the Weibo dataset, considering accuracy, recall, F1 score, and training time, as shown in Table 7. By comparing their performances, it can be observed that Efficient GP has slightly lower recall than GP but excels in terms of accuracy, F1 score, and training speed. The problem of having too many parameters in GP may lead to relatively sparse updates for each parameter, increasing the risk of overfitting. In contrast, Efficient Global Pointer successfully improves performance and computational speed by reducing the number of parameters, indicating it is a more effective choice for NER tasks.

Table 7

Experimental results comparing efficient global pointer with global pointer on different metrics

Model	P (%)	R (%)	F1 (%)	Time (s)
Global pointer	70.74	75.53	72.95	641
Efficient global pointer	73.00	73.81	73.22	599

The size of the lexicon and the number of matching words by the adapter significantly impact the performance of the model. As shown in Table 8, a larger lexicon size effectively improves the performance of Chinese NER on the Weibo dataset, primarily due to the high coverage of the lexicon. If the lexicon contains a large number of entities relevant to the task, the model’s performance tends to improve accordingly. Regarding the maximum number of matching words, it is observed that more is not necessarily better. Moderately increasing the maximum number of matching words can effectively enhance performance, but excessive increase may negatively impact performance. This could be attributed to various factors, including overfitting caused by the addition of too many new words or redundancy in features learned by the model due to the high similarity between newly added words and existing vocabulary. Therefore, a larger lexicon size and an appropriate maximum matching word count play a positive role in improving the performance of Chinese NER systems. However, careful consideration is needed when selecting lexicon size and matching strategies to balance performance and computational costs.

Table 8

The Impact of lexicon size and maximum matching word on model performance

Lexicon size	Maximum matching word	F1 (%)
2,000,000	1	71.80
	3	72.41
	5	72.04
12,287,936	1	71.83
	3	73.22
	5	72.95

It is evident that the MGBERT-Pointer model is more complex compared to conventional models. This is mainly attributed to the complexity of its adapter network, which requires providing multigranular information, implementing a pointer network based on RoPE, and incorporating entity positional information. As shown in Table 9, with the increase in dataset size, the runtime parameters of the model significantly increase, especially when compared to the model that only include LA. This is because the MGBERT-Pointer model needs to handle more intricate semantic information, including character, dictionary, and entity details, to learn richer features. In addition, the training duration reflects the model's complexity. MGBERT-Pointer requires more time to process a larger number of parameters. It is noteworthy that some zero vectors and zero matrices were observed during model training. This could be indicative of redundant information in the model or a lack of fully learned effective features during the training process. Overall, more complex models generally achieve better performance, but there is a need to carefully balance performance improvement and computational costs.

Table 9

Parameter size and training time of model training on each dataset

Dataset	Dataset size (mega byte)	Model	Parameter size (mega byte)	Time (min)
Weibo	1.86	BERT + LA + CRF	181	8
Weibo	1.86	MGBERT-Pointer	19,520	10
Resume	3.6	BERT + LA + CRF	589	25
Resume	3.6	MGBERT-Pointer	32,768	32
Ontonotes	17	BERT + LA + CRF	971	61
Ontonotes	17	MGBERT-Pointer	82,944	99
MSRA	43	BERT + LA + CRF	1269	112
MSRA	43	MGBERT-Pointer	131,072	265

Conclusion

This study successfully constructed the MGBERT-Pointer model, aiming to address the challenges in the Chinese NER field, including complex contextual variations, ambiguity, and nested entities. The introduction of MGA and EGP resulted in significant performance improvements.

The design of MGA, encompassing CA, EA, and LA, stands out as a key innovation. These adapters respectively incorporate character, entity, and lexical information, synergistically forming multigranular information that effectively enhances the model’s understanding across different contextual layers. The EGP network, utilizing a unique strategy of global normalization and employing Rotary Position Embedding, addresses the issue of insufficient boundary information in traditional attention mechanisms. This enhancement ensures the model’s precision and reliability in handling nested entity structures.

Experimental results demonstrate the outstanding performance of the MGBERT-Pointer model across multiple public datasets. However, the model has some shortcomings, such as being relatively complex and containing redundant information. The model relies on existing scenarios and requirements, and once these change, the model may no longer be applicable. Therefore, in subsequent work, we plan to employ optimization techniques to eliminate task-irrelevant redundant features and attempt to introduce more discriminative features. We also plan to develop a monitoring system to track the model's performance in real-time and take further adjustment measures as needed.

Furthermore, to comprehensively capture information and enhance the capability to handle complex tasks, we have chosen multimodality as a future research direction. We plan to further elevate the model's multimodal capacity by extending multigranular information into multimodal information. This entails introducing multimodal adapters such as Pinyin adapters, glyph adapters, etc., and attempting to leverage large-scale models (such as GPT) to enhance the performance of Chinese NER.

Acknowledgements

This work is supported by the National Key Research and Development Program “Industrial Software” Key Special Project (2022YFB3305602), Social Science Planning Foundation of Beijing (20GLC059), Humanities and Social Sciences Planning Fund of the Ministry of Education (22YJA630111, 22YJAZH110), Shandong Provincial Key Research and Development Program (Major Science and Technology Demonstration Project) (2021SFGC0102, 2020CXGC010110).

Declarations

Conflict of interest

No conflicts of interest/competing interests.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Nadeau D, Sekine S (2007) A survey of named entity recognition and classification. Lingvist Investig 30:3–26CrossRef

Yin L, Meng X, Li J, Sun J (2019) Relation extraction for massive news texts. Comput Mater Contin 58:275–285

Ma X, Lu Y, Lu Y et al (2020) Biomedical event extraction using a new error detection learning approach based on neural network. Comput Mater Contin 63:923–941

Zhou H, Shen T, Liu X et al (2020) Survey of knowledge graph approaches and applications. J Artif Intell 2:89–101CrossRef

Sharma Y, Gupta S (2018) Deep learning approaches for question answering system. Procedia Comput Sci 132:785–794CrossRef

Dou Z-Y, Wang X, Shi S, Tu Z (2020) Exploiting deep representations for natural language processing. Neurocomputing 386:1–7CrossRef

Qiu J, Liu Y, Chai Y et al (2019) Dependency-based local attention approach to neural machine translation. Comput Mater Contin 59:547–562

Lample G, Ballesteros M, Subramanian S et al (2016) Neural architectures for named entity recognition. In: Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies. Association for Computational Linguistics

Liu L, Shang J, Ren X et al (2018) Empower sequence labeling with task-aware neural language model. In: 32nd AAAI conference on artificial intelligence, AAAI 2018. American Association for Artificial Intelligence (AAAI) Press, pp 5253–5260

10.

Ma X, Hovy E (2016) End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. In: Proceedings of the 54th annual meeting of the association for computational linguistics, vol. 1. Long papers. Association for Computational Linguistics

11.

Duan H, Zheng Y (2011) A study on features of the CRFS-based Chinese named entity recognition. Int J Adv Intell 3:287–294

12.

Gui T, Ma R, Zhang Q, et al (2019) CNN-based Chinese NER with lexicon rethinking. In: IJCAI

13.

Ruotian M, Minlong P, Qi Z et al (2020) Simplify the usage of lexicon in Chinese NER. In: Proceedings of the 58th annual meeting of the association for computational linguistics. pp 5951–5960

14.

Zhang Y, Yang J (2018) Chinese NER using lattice LSTM. In: Proceedings of the 56th annual meeting of the association for computational linguistics, vol. 1. Long papers. pp 1554–1564

15.

Liu W, Fu X, Zhang Y, Xiao W (2021) Lexicon enhanced Chinese sequence labeling using BERT adapter. In: Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing, vol 1. Long papers. pp 5847–5858

16.

Luan Y, Wadden D, He L, et al (2019) A general framework for information extraction using dynamic span graphs. In: Proceedings of the 2019 Conference of the North. Association for Computational Linguistics

17.

Zhong Z, Chen D (2021) A frustratingly easy approach for entity and relation extraction. In: Proceedings of the 2021 conference of the North American chapter of the association for computational linguistics: human language technologies. Association for Computational Linguistics

18.

Yuan Z, Tan C, Huang S, Huang F (2022) Fusing heterogeneous factors with triaffine mechanism for nested named entity recognition. In: Findings of the association for computational linguistics: ACL 2022. pp 3174–3186

19.

Liu W, Yu B, Zhang C et al (2018) Chinese named entity recognition based on rules and conditional random field. In: Proceedings of the 2018 2nd International conference on computer science and artificial intelligence. pp 268–272

20.

Chen W, Zhang Y, Isahara H (2006) Chinese named entity recognition with conditional random fields. In: Proceedings of the Fifth SIGHAN workshop on Chinese language processing. pp 118–121

21.

Lafferty J, McCallum A, Pereira FC (2001) Conditional random fields: Probabilistic models for segmenting and labeling sequence data

22.

Morwal S, Jahan N, Chopra D (2012) Named entity recognition using hidden Markov model (HMM). Int J Natural Lang Comput (IJNLC) 1:15–23CrossRef

23.

Zhou G, Su J (2002) Named entity recognition using an HMM-based chunk tagger. In: Proceedings of the 40th annual meeting of the association for computational linguistics. pp 473–480

24.

Fresko M, Rosenfeld B, Feldman R (2005) A hybrid approach to NER by MEMM and manual rules. In: Proceedings of the 14th ACM international conference on Information and knowledge management. pp 361–362

25.

LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86:2278–2324CrossRef

26.

Gui T, Zou Y, Zhang Q et al (2019) A lexicon-based graph neural network for Chinese NER. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). pp 1040–1050

27.

Ding R, Xie P, Zhang X, et al (2019) A neural multi-digraph model for Chinese NER with gazetteers. In: Proceedings of the 57th annual meeting of the association for computational linguistics. pp 1462–1467

28.

Devlin J, Chang M-W, Lee K, Toutanova K (2019) BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, vol 1. Long and short papers. pp 4171–4186

29.

Zhao S, Cai Z, Chen H et al (2019) Adversarial training based lattice LSTM for Chinese clinical named entity recognition. J Biomed Inform 99:103290CrossRefPubMed

30.

Li X, Yan H, Qiu X, Huang X-J (2020) FLAT: Chinese NER using flat-lattice transformer. In: Proceedings of the 58th annual meeting of the association for computational linguistics. pp 6836–6842

31.

Liu W, Xu T, Xu Q et al (2019) An encoding strategy based word-character LSTM for Chinese NER. In: Proceedings of the 2019 conference of the north American chapter of the association for computational linguistics: human language technologies, vol 1. Long and short papers. pp 2379–2389

32.

Ju M, Miwa M, Ananiadou S (2018) A neural layered model for nested named entity recognition. In: Proceedings of the 2018 conference of the north American chapter of the association for computational linguistics: human language technologies, vol 1. Long papers. pp 1446–1459

33.

Straková J, Straka M, Hajic J (2019) Neural architectures for nested NER through linearization. In: Proceedings of the 57th annual meeting of the association for computational linguistics. pp 5326–5331

34.

Sohrab MG, Miwa M (2018) Deep exhaustive model for nested named entity recognition. In: Proceedings of the 2018 conference on empirical methods in natural language processing. pp 2843–2849

35.

Zheng C, Cai Y, Xu J et al (2019) A boundary-aware neural model for nested named entity recognition. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). pp 357–366

36.

Fisher J, Vlachos A (2019) Merge and label: a novel neural network architecture for nested NER. In: Proceedings of the 57th annual meeting of the association for computational linguistics. pp 5840–5850

37.

Houlsby N, Giurgiu A, Jastrzebski S et al (2019) Parameter-efficient transfer learning for NLP. In: International conference on machine learning. PMLR, pp 2790–2799

38.

Pfeiffer J, Rücklé A, Poth C et al (2020) AdapterHub: a framework for adapting transformers. In: Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations. pp 46–54

39.

Bapna A, Firat O (2019) Simple, scalable adaptation for neural machine translation. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). pp 1538–1548

40.

Pfeiffer J, Vulić I, Gurevych I, Ruder S (2020) MAD-X: an adapter-based framework for multi-task cross-lingual transfer. In: Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP). pp 7654–7673

41.

Wang R, Tang D, Duan N et al (2021) K-Adapter: infusing knowledge into pre-trained models with adapters. In: Findings of the association for computational linguistics: ACL-IJCNLP 2021. pp 1405–1418

42.

Guo Q, Guo Y (2022) Lexicon enhanced Chinese named entity recognition with pointer network. Neural Comput Appl 34:14535–14555CrossRef

43.

See A, Liu PJ, Manning CD (2017) Get to the point: summarization with pointer-generator networks. In: Proceedings of the 55th annual meeting of the association for computational linguistics, vol 1. Long papers. Association for Computational Linguistics

44.

Vaswani A, Shazeer N, Parmar N et al (2017) Attention is all you need. In: Advances in neural information processing systems, vol 30

45.

Yan H, Gui T, Dai J et al (2021) A unified generative framework for various NER subtasks. In: Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing, vol. 1. Long papers. pp 5808–5822

46.

Zhai F, Potdar S, Xiang B, Zhou B (2017) Neural models for sequence chunking. In: Proceedings of the thirty-first AAAI conference on artificial intelligence. pp 3365–3371

47.

Su J, Murtadha A, Pan S et al (2022) Global pointer: novel efficient span-based approach for named entity recognition. arXiv preprint arXiv:220803054

48.

Li H, Cheng M, Yang Z et al (2023) Named entity recognition for Chinese based on global pointer and adversarial training. Sci Rep 13:3242ADSCrossRefPubMedPubMedCentral

49.

Zhang Y, Li J, Xin Y et al (2023) A model for Chinese named entity recognition based on global pointer and adversarial learning. Chin J Electron 32:854–867CrossRef

50.

Zhang P, Liang W (2023) Medical name entity recognition based on lexical enhancement and global pointer. Int J Adv Comput Sci Appl. https://doi.org/10.14569/IJACSA.2023.0140369CrossRef

51.

Shaw P, Uszkoreit J, Vaswani A (2018) Self-attention with relative position representations. In: Proceedings of NAACL-HLT. pp 464–468

52.

Peng N, Dredze M (2015) Named entity recognition for Chinese social media with jointly trained embeddings. In: Proceedings of the 2015 conference on empirical methods in natural language processing. pp 548–554

53.

He H, Sun X (2017) F-score driven max margin neural network for named entity recognition in Chinese social media. In: Proceedings of the 15th conference of the European chapter of the association for computational linguistics, vol 2. Short Papers. pp 713–718

54.

Weischedel R, Pradhan S, Ramshaw L et al (2011) Ontonotes release 4.0. LDC2011T03. Linguistic Data Consortium, Philadelphia

55.

Zhang S, Qin Y, Hou W-J, Wang X (2006) Word segmentation and named entity recognition for sighan bakeoff3. In: Proceedings of the Fifth SIGHAN workshop on chinese language processing. pp 158–161

56.

Che W, Wang M, Manning CD, Liu T (2013) Named entity recognition with bilingual constraints. In: Proceedings of the 2013 conference of the North American chapter of the association for computational linguistics: human language technologies. pp 52–62

57.

Song Y, Shi S, Li J, Zhang H (2018) Directional skip-gram: explicitly distinguishing left and right context for word embeddings. In: Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics: human language technologies, vol. 2. Short papers. pp 175–180

58.

Sui D, Chen Y, Liu K et al (2019) Leverage lexical knowledge for Chinese named entity recognition via collaborative graph network. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). pp 3830–3840

59.

Sun Y, Wang S, Li Y et al (2019) Ernie: Enhanced representation through knowledge integration. arXiv preprint arXiv:190409223

60.

Diao S, Bai J, Song Y et al (2020) ZEN: PRE-training Chinese text encoder enhanced by N-gram representations. In: Findings of the association for computational linguistics: EMNLP 2020. pp 4729–4740

Titel: Enhanced Chinese named entity recognition with multi-granularity BERT adapter and efficient global pointer
verfasst von: Lei Zhang
Pengfei Xia
Xiaoxuan Ma
Chengwei Yang
Xin Ding
Publikationsdatum: 12.03.2024
Verlag: Springer International Publishing
Erschienen in: Complex & Intelligent Systems
Print ISSN: 2199-4536
Elektronische ISSN: 2198-6053
DOI: https://doi.org/10.1007/s40747-024-01383-6

Springer Professional

Enhanced Chinese named entity recognition with multi-granularity BERT adapter and efficient global pointer

Abstract

Publisher's Note

Introduction

Enhanced Chinese NER

BERT adapter

Pointer network

Method

Multi-Granularity adapter

Character adapter

Entity adapter

Lexicon adapter

Adapter fusion

Efficient global pointer

Entity prediction

Relative position

Training

Experiments

Evaluation metrics

Datasets

Experimental settings

Results and analysis

Ablation experiments

Discussion

Conclusion

Acknowledgements

Declarations

Conflict of interest

Publisher's Note

Premium Partner

Springer Professional

Abstract

Publisher's Note

Introduction

Related work

Enhanced Chinese NER

BERT adapter

Pointer network

Method

Multi-Granularity adapter

Character adapter

Entity adapter

Lexicon adapter

Adapter fusion

Efficient global pointer

Entity prediction

Relative position

Training

Experiments

Evaluation metrics

Datasets

Experimental settings

Results and analysis

Ablation experiments

Discussion

Conclusion

Acknowledgements

Declarations

Conflict of interest

Publisher's Note

Premium Partner