nach oben

Complex & Intelligent Systems

Erschienen in:

Open Access 02.06.2022 | Original Article

AttenSy-SNER: software knowledge entity extraction with syntactic features and semantic augmentation information

verfasst von: Mingjing Tang, Tong Li, Wei Gao, Yu Xia

Erschienen in: Complex & Intelligent Systems | Ausgabe 1/2023

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Patentsuche

Aus

Abstract

Software knowledge community contains a large scale of software knowledge entity information, complex structure and rich semantic correlations. It is significant to recognize and extract software knowledge entity from software knowledge community, as it has great impact on entity-centric tasks such as software knowledge graph construction, software document generation and expert recommendation. Since the texts of the software knowledge community are unstructured by user-generated texts, it is difficult to apply the traditional entity extraction method in the domain of the software knowledge community due to the problems of entity variation, entity sparsity, entity ambiguity, out-of-vocabulary (OOV) words and the lack of annotated data sets. This paper proposes a novel software knowledge entity extraction model, named AttenSy-SNER, which integrates syntactic features and semantic augmentation information, to extract fine-grained software knowledge entities from unstructured user-generated content. The input representation layer utilizes Bidirectional Encoder Representations from Transformers (BERT) model to extract the feature representation of the input sequence. The contextual coding layer leverages the Bidirectional Long Short-Term Memory (BiLSTM) network and Graph Convolutional Network (GCN) for contextual information and syntactic dependency information, and a semantic augmentation strategy based on attention mechanism is introduced to enrich the semantic feature representation of sequences as well. The tag decoding layer leverages Conditional Random Fields (CRF) to solve the dependency between the output tags and obtain the global optimal label sequence. The results of model comparison experiments show that the proposed model has better performance than the benchmark model in software engineering domain.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Introduction

As a current mainstream software knowledge community, StackOverflow serves as a platform for software knowledge exchange and sharing, which is related to software programming, configuration management, and project organization, and gradually it develops into an important knowledge base of software domain for software developers [1]. Software developers are able to search for relevant information of specific software knowledge entities (such as libraries, APIs and Bug exceptions) on this platform, and design forums for questions, answers and comments according to individual requirements, which is helpful to understand relevant knowledge of software-specific entities and solve the problems in the development process.

So far, about 21 million software-related questions have been posted on StackOverflow, and produced lots of specific software knowledge entities, complex structure and rich semantic association in these Q&A texts. Traditional text processing technologies based on keywords and topic models treat software knowledge entities as plain texts, neglecting the domain features of software knowledge community texts, and unable to satisfy the requirements of software developers to acquire intensive software knowledge. It is an imperative challenge to acquire software knowledge accurately and efficiently from the software knowledge community in software knowledge management.

As a method of knowledge representation, knowledge graph is often applied to model entities, concepts and semantic relationships, which can enhance the expression of knowledge organization structure and enable users to process information quickly, accurately and intelligently [2]. It will promote the development of entity-centered applications such as intelligent question-answering, software document generation, expert recommendation and software reuse, if we extract software knowledge entities and semantic relationships between entities from user-generated texts in software knowledge community, and then construct the software knowledge graph.

Entity extraction aims to recognize mentions of rigid designators from the text belonging to the pre-defined semantic types such as person, location and organization, which plays an essential role in knowledge graph construction and natural language understanding [3]. In this paper, the software knowledge entity extraction is referred to recognize and extract software-specific entities (such as software programming, software development libraries, software projects, etc.) from massive unstructured texts in knowledge community, and classified into pre-defined categories. Due to the problem of error propagation, the accuracy and efficiency of software knowledge entity extraction is crucial for subsequent software entity relationship extraction, software knowledge graph construction and application.

The user-generated content in the software knowledge community StackOverflow is unstructured short text with the following common characteristics:

Lack of unified programming language specifications and strict spelling rules, more spelling mistakes and abbreviations generated, resulting in the problem of name variation. For example, software knowledge entity “JavaScript” has multiple entity variation generated by name abbreviations: “JS”, “javascript”.

Many software-specific entity names are in common words, resulting in entity sparsity problem.

The same software knowledge entity can belong to different entity types in different contexts, resulting in entity ambiguity. For example, software knowledge entity “Mac” can be labeled either “PlatCOS” (operating system) or “SLMDL” (mobile development library).

Some rare, distinctive software knowledge entities may cause OOV words that are unrecognizable.

Compared with finance, law, biomedical and other domains, entity extraction research in software engineering domain lacks corresponding resources and technologies, and encounter huge challenges such as entity variation, entity sparsity, entity ambiguity, more OOV words and lack of annotated data sets.

In view of the above challenges, by comprehensive considerations of word, syntax, entity context and its semantic features, we propose a novel software knowledge entity extraction model which integrates syntactic features and semantic enhancement information. Different from the current entity extraction methods in the general field, this model improves the input feature representation and context coding based on BERT model, BiLSTM and GCN. The main contributions of this paper are:

According to the domain features of software knowledge community text, a novel software knowledge entity extraction model is proposed, which integrates syntactic features and semantic enhancement information. This model applied GCN to encode the syntactic dependencies of sequences, and a semantic enhancement strategy based on the attention mechanism is proposed to enhance the semantic representation of words in the sequence. The model outperforms the mainstream NER models used in the software engineering domain.

To enrich the word feature representation in software engineering domain, the BERT model is utilized to carry out unsupervised trainings on 2,656,719 questions, 5,526,559 answers and 43,594,128 sentence sequences in StackOverflow, and the pre-trained word vector representation in software engineering domain is obtained.

The annotated data sets of software engineering domain covering 6371 sentence, 129,135 tokens and 40 entity categories is constructed based on posts of the StackOverflow to overcome the drawback of the lacking of labeled corpus for training the NER model in software engineering domain.

Entity extraction is a classic sequence labeling task, which can be formally defined as: given an input sequence $X = \left\{ {x_{1} ,x_{2} , \ldots ,x_{n} } \right\}$, entity extraction model will output a list of triples $\left\{ {Y_{{\text{s}}} ,Y_{{\text{e}}} ,t} \right\}$. Each triple in the list represents a named entity of the input sequence X, with Y_s as the starting index of the entity, Y_e as the ending index of the entity, and t as pre-defined type of the entity.

At present, the models and methods of entity extraction are mainly divided into three categories: rules-based and dictionary-based methods, machine learning-based methods and deep learning-based methods [4]. The rule-based and dictionary-based methods rely on artificial feature selection and domain dictionary [5, 6]. Such methods usually showed high Precision (P), but low Recall (R) and poor domain migration performance. With advances in machine learning, Hidden Markov Models (HMM) [7], Maximum Entropy Models (MEM) [8], Support Vector Machines (SVM) [9] and Conditional Random Fields (CRF) [10] have been widely used in entity extraction tasks. Compared with the rule-based and dictionary-based methods, this kind of methods are more adaptable and do not require extra language knowledge, but relies on feature engineering and a large amount of annotated data. The quality of feature extraction and the size of annotated data determine the generalization ability of the algorithm. However, it is difficult to feature extraction and obtain annotated data in actual production. Recently, deep learning models such as Convolutional Neural Network (CNN), Long Short-term Memory Network (LSTM) and GCN have been widely used in entity extraction task and achieved promising performance [11, 12]. Compared with other methods, the deep learning-based method adopts an end-to-end method to automatically learn features from texts without additional feature engineering. So, it has been applied to entity extraction tasks in various domains.

For entity extraction tasks in the software knowledge community domain, Ye et al. [13] divided software knowledge entities into five categories: Programming Languages, Platform, API, Tool-library-framework and Software Standard. Then, a software-specific named entity recognition was proposed in software engineering social content based on semi-supervised learning. The model obtained better performances through orthographic features, lexical and contextual features, word bitstring features and gazetteer features. Zhao et al. [14] proposed a relational triplet extraction framework HDSKG in software engineering domain by incorporated dependency parser with rule-based method. In this model, SVM is used as a classifier to evaluate the domain relevance of candidate relational triples. Combined with text features, corpus features, concept features and source features, a knowledge graph of software engineering domain with 35,279 relational triples, 44,800 concepts and 9660 unique verb phrases was constructed. Aiming at the problem that the HDSKG framework does not fully consider the features of entity concepts and term phrases, Guo et al. [15] proposed a strategy for the extraction of WiKi pages in the software engineering domain. First, the domain dictionary was built by page titles. Secondly, rules were designed according to the characteristics of concept in software engineering domain. Finally, the domain dictionary was used to improve the precision of entity recognition. The above researches focused on information extraction in software domain earlier, which belongs to rules-based and dictionary-based methods or machine learning-based methods, and the performance of the model relied on manual feature extraction and a large amount of annotated data. In terms of deep learning-based methods, Reddy et al. [16] proposed a named entity recognition model based on BiLSTM + CRF model, classifying entities in the software engineering domain into 22 entity types. For multi-source data such as unstructured data, semi-structured data and code data, Lv et al. [17] extracted multi-source software knowledge entities by using BiLSTM + CRF, template matching and abstract syntax tree. At the same time, TF-IDF, TextRank and K-means methods were used to solve the problem of the lacking of annotated data set. Tabassum et al. [18] combined the question-and-answer text data of Stack Overflow, constructed a named entity recognition corpus for the computer programming domain, consisting of 15,372 sentences annotated with 20 fine-grained entity types, and proposed an entity recognition model based on attention mechanism, which improved the effect of code entity recognition.

Above all, the extraction of software knowledge entity and the construction of software knowledge graph have attracted the attention of many scholars and obtained tremendous outstanding progresses. However, due to the problems in software knowledge community, such as entity variation, entity sparsity, entity ambiguity, more OOV words and lack of annotated data sets, the task of software knowledge entity extraction has not been well solved.

Proposed method

Aimed at the above problems and challenges, we propose a software knowledge entity extraction model, named AttenSy-SNER, which integrates syntactic features and semantic augmentation information, to extract fine-grained software knowledge entities from unstructured user-generated content. The Attensy-SNER model is generally divided into input representation layer, context coding layer and tag decoding layer. The overall architecture is shown in Fig. 1.

First, Attensy-SNER model uses pre-training model BERT for unsupervised learning of massive software knowledge community texts to generate word vector representation of software engineering field and enrich input feature representation of software knowledge entity extraction. Secondly, as the context information and syntactic dependency of sentence sequences play a key role in entity extraction, Attensy-SNER model uses BiLSTM and GCN to encode the context information and syntactic dependencies of sequences, and obtain multi-feature context representation, which effectively alleviates entity ambiguity. Meanwhile, aiming at the problems of entity variant, entity sparsity and out-of-vocabulary in the text of software knowledge community, semantic enhancement strategy based on attentional mechanism is proposed to enhance the semantic representation of words in the sequences and the final feature vector representation of entities is obtained by feature fusion. Finally, CRF model is used as the tag decoder to solve the dependence between the output tags and the global optimal tag sequence is obtained.

Input representation layer based on BERT model

The input representation layer of the model is responsible for transforming the sentence sequence of the software knowledge community text into low-dimensional and dense distributed vector representation, which is fed to the next layer of this model to obtain context features. Relevant literature [19‐21] shows that rich domain features obtained in the input representation layer will help to improve the performance of specific domain entity extraction model.

As a pre-trained language model, BERT model can learn a large amount of prior information about lexical, syntactic and domain features for downstream tasks through unsupervised training based on a large number of corpus in the early stage, helping to generate dynamic word vectors with the current context, thus improving the semantic disambiguation ability of the model. In terms of model architecture, BERT model uses bidirectional Transformer encoder, combined with Attention mechanism, to learn the context information of the current word, so as to obtain a better distributed representation of the word. In terms of input representation, the input of BERT model consists of Token Embeddings, Segment Embeddings and Position Embeddings. The Token Embeddings convert words to 768-dimensional word vector representations starting with the [CLS] and ending with the [SEP]. Segment Embeddings represents a feature vector of sentences for downstream sentence-level classification tasks. Position Embeddings encode the position information of words into vector representation, and add the key sequential characteristic to the sequence data with Position Embeddings [22]. In terms of pre-training task, BERT model obtains character-level, word-level, sentence-level and inter-sentence relationship feature representation through the joint training of the Masked Language Model (MLM) and Next Sentence Prediction (NSP). The MLM task is to mask 15% of the words randomly and make predictions based on the context. Of these, 80% of the words were replaced with [mask], 10% were randomly replaced with other words, and 10% remained unchanged. The NSP task is to train the model to understand the relationship between sentences, predict the next sentence based on the previous sentence, and enhance the model to understand the sentence-level task.

To obtain high-quality word vector representations of input sequence and improve the performance of extracting entities from the software knowledge community text, we use the BERT to pre-train the massive question–answer text of StackOverflow, and take the pre-trained word vector as the input of the model. In the training process of Attensy-SNER model, the words in the input sequence can be represented by the corresponding word vector based on querying the BERT pre-trained word vector, and then input into the model.

Context encoder layer based on multi-feature fusion

For the fine-grained entity classification task of software knowledge, the acquisition of context information and syntactic dependency can alleviate the problem of entity ambiguity caused by different contexts. Therefore, BiLSTM and GCN are used to encode the context information and syntactic dependency of sentence sequences, respectively.

Context feature encoder based on BiLSTM model

In the task of software knowledge entity extraction, data are sentence sequences from the software knowledge community, which is suitable for modeling with a Recurrent Neural Network (RNN). The RNN model has the characteristics of parameter sharing and memory, which combine the output of the previous time step with the input of the current time step to determine the output of the current time step. Nevertheless, for processing long sequences, the RNN model appears the gradient vanishing problem and gradient exploding problem, which cannot solve the problem of long-distance dependence. The LSTM model proposed by Hochreiter [23] alleviates the gradient vanishing problem and gradient exploding problem by introducing memory unit and gating mechanism to process long-distance information. The circulation unit structure is shown in Fig. 2.

The formal representation of the T time of Model is as follows [23]:

$$ i_{t} = \sigma \left( {W_{xi} x_{t} + W_{hi} h_{t - 1} + W_{ci} c_{t - 1} + b_{i} } \right), $$

(1)

$$ f_{t} = \sigma \left( {W_{{x{\text{f}}}} x_{t} + W_{{h{\text{f}}}} h_{t - 1} + W_{{c{\text{f}}}} c_{t - 1} + b_{{\text{f}}} } \right) , $$

(2)

$$ c_{t} = f_{t} c_{t - 1} + i_{t} \tan h\left( {W_{xc} x_{t} + W_{hc} h_{t - 1} + b_{c} } \right) , $$

(3)

$$ o_{t} = \sigma \left( {W_{xo} x_{t} + W_{ho} h_{t - 1} + W_{co} c_{t} + b_{o} } \right) , $$

(4)

$$ h_{t} = o_{t} \tan h\left( {c_{t} } \right). $$

(5)

Among them, σ and tanh represent the nonlinear activation function of neurons, i_t represents the input gate state, f_t represents the forget gate state, c_t represents the cell unit state, o_t represents the output gate state, x_t is the input vector, h_t is the hidden state vector, W represents the interlayer weight matrix, and b is the bias vector of neurons.

According to the structure of the LSTM model, at the moment t of sequence processing, the LSTM model captures the above information of the current sentence sequence through the states of input gate, forget gate, cell unit and output gate. However, it lacks the following information, which also plays an important role in the task of software knowledge entity extraction. Therefore, two LSTM models with opposite directions are used to construct a bidirectional LSTM model, which can simultaneously capture the context information of the sentence sequence at moment t, and the final output is obtained by combining the output of the LSTM model in both forward and reverse directions, the calculation process is as follows:

At time t of the model, the output of the hidden layer state of the forward LSTM is:

$$ \overrightarrow {h}_{t} = {\text{LSTM}}^{{^{ \to } }} \left( {X_{t} ,h_{t - 1} } \right). $$

(6)

Among them, $\overrightarrow {h}_{t}$ and X_t represent the output and input at time t, $h_{t - 1}$ represents the hidden layer state at time t − 1. The output of the hidden layer state of the reverse LSTM is:

$$ \overleftarrow {h}_{t} = {\text{LSTM}}^{{^{ \leftarrow } }} \left( {X_{t} ,h_{{t{ + }1}} } \right). $$

(7)

Among them, $\overleftarrow {h}_{t}$ and X_t represent the output and input at time t, $h_{{t{ + }1}}$ represents the hidden layer state at time t + 1. Therefore, the total output of the bidirectional LSTM model at time t is:

$$ h_{t} = \left[ {\overrightarrow {h}_{t} ;\overleftarrow {h}_{t} } \right]. $$

(8)

Syntactic feature encoder based on GCN model

Syntactic feature refers to the extraction of the sentence structure and the dependency between the words in the sentence by means of syntactic analysis techniques. Among them, syntactic dependency features focus on describing the local components of sentence sequences and the long-distance dependency between words. Therefore, obtaining the dependency feature between words in sentence sequence can capture the dependency relationship and long-distance dependencies between entities, which is helpful to recognize some potential entities. Compared with the general field, the data of the software knowledge entity extraction in this paper come from the software knowledge community, and there are a large number of phrase entities and long-distance dependencies. Therefore, it is of great significance for software knowledge entity extraction to extract the dependency relations between words in sentence sequence.

GCN is a kind of convolutional neural network based on graph structure data. By modeling graph nodes and edges, it can capture the dependence between nodes, and is gradually applied in the field of natural language processing, such as text classification [24], semantic role labeling [25], relation extraction [26] and machine translation [27]. Since the standard LSTM network cannot model the dependency structure of sentence sequences in the software knowledge community text, this paper uses GCN to encode the dependency graph of the sentence sequence, and treat each word in the sentence sequence as a node. The node generates its eigenvector representation by collecting the information of its neighboring nodes.

The input of syntactic feature encoder based on GCN consists of two parts: the output of the context feature encoder layer based on BiLSTM and the adjacency matrix constructed by syntactic dependency analysis. The adjacency matrix of the sentence sequence is constructed from the syntactic dependency graph obtained in advance. In this paper we use the StanfordCoreNLP tool to analyze the syntactic dependency of sentence sequences in software knowledge community texts. For example, the sentence sequence “How to convert a TIFF to a JPG with ImageMagick?”, the dependency diagram of the sentence sequence is obtained through syntactic dependency analysis, as shown in Fig. 3.

In this sentence sequence, the entity is “TIFF”, “JPG” and “ImageMagick” respectively, and the dependency word is “convert”, where the dependency relationship between “TIFF” and the dependency word “convert” is “obj”, indicating that the entity “TIFF” is the object of the predicated “convert”.

After the dependency diagram of the sentence sequence is constructed, it can be transformed into an adjacency matrix, and its formal description is stated as follows:

For the sentence sequence $X = \left( {x_{1} ,x_{2} , \ldots ,x_{t} } \right)$, each word is a node. If there is a dependency relationship between node x_i and node x_j, it means that there is an edge between node x_i and node x_j, and A_ij = 1, then adjacency matrix A is obtained. For example, the adjacency matrix of the above sentence sequence is:

$$ A = \left[ {\begin{array}{*{20}c} 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 1 & 1 & 1 & 0 & 1 & 0 & 0 & 1 & 0 & 1 \\ 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 & 1 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 1 & 1 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 1 \\ \end{array} } \right]{,} $$

In an L layer of GCN syntactic feature coding layer, the node i aggregates the features of its neighboring nodes through graph convolution operation to complete the output of feature vectors:

$$ h_{i}^{l} = {\text{ReLU}}\left( {\mathop \sum \limits_{j = 1}^{n} Wh_{i}^{l - 1} + {\text{b}}^{l} } \right), $$

(9)

where ReLU represents the nonlinear activation function,$h_{i}^{l - 1}$ represents the input of the node at the lth layer, $h_{i}^{l}$ represents the output of the node at the lth layer, W represents the weight matrix, and $b^{l}$ represents the bias units.

Semantic enhancement strategy based on attention mechanism

Problems such as entity variation, entity sparsity and out-of-vocabulary words in the software knowledge community texts affect the quality of software knowledge entity extraction. Especially in the case of fine-grained entity types, the traditional methods based on domain dictionaries or external features will bring noise and error propagation to the extraction results of software knowledge entities, resulting in errors in the labeling of software knowledge entities.

Relevant literature [28] shows that enhancing semantic feature representation of entities in a specific domain can effectively alleviate entity variants, entity sparsity and out-of-vocabulary word problems and improve the accuracy of software knowledge entity extraction. To alleviate these problems, a semantic enhancement strategy based on attention mechanism is introduced in the context coding layer of the model to enhance the semantic representation of entities in the sentence sequence, which mainly includes three steps: similar word extraction based on semantic consistency, semantic contribution weight calculation based on attention mechanism and feature vector fusion. The semantic enhancement strategy based on attention mechanism can be described as Algorithm 1.

As shown in Algorithm 1, the specific process is listed: First, pre-trained word vectors and thesaurus of software engineering are introduced as auxiliary resources to extract semantically consistent domain words through semantic similarity calculation. Then, the attention mechanism is used to calculate the semantic contribution weight of similar domain words to the target word. Finally, the final vector representation of the word is obtained by fusing the semantic enhanced vector representation.

Similar word extraction based on semantic consistency

External auxiliary resources play an important role in similar word extraction and affect the quality of semantic enhanced vector representation. On one hand, in the input representation layer of the Attensy-SNER model, in this paper we uses the BERT model to pre-train the sentence sequences of 43,594,128 in software knowledge community texts, and obtains the pre-training word vectors in the software engineering domain. The pre-trained word vector contains rich semantic information of software knowledge entities and can be used as an auxiliary corpus for similar word extraction. On the other hand, some research works [29, 30] focus on the construction of thesaurus in the domain of software engineering, and automatically build thesaurus in the domain of software engineering through unsupervised learning from StackOverflow and Wikipedia, realizing functions such as abbreviation recognition and similar word recognition for the software engineering domain.

To improve the quality of similar word extraction and ensure semantic consistency, this paper comprehensively considers similar words extracted from BERT pre-trained word vectors and software engineering Sethesaurus [29] as semantic enhancement auxiliary. As an example, for the word “PHP” in the input sentence sequence “My PHP Version is 7.1”, similar words like “JavaScript”, “Ruby”, “Groovy”, “Cython” and “Sinatra” are extracted with semantic similarity calculation. These similar words have software domain relevance and will be used as auxiliary resources to enhance the semantic representation of the word “PHP”.

Semantic contribution weight calculation based on attention mechanism

As the semantic contribution degree of similar words extracted in different contexts is different from the target words, this paper combines the attention mechanism to design the weight according to the semantic contribution degree of similar words, and comprehensively considers the semantic consistency of similar words.

The essence of attention mechanism is to selectively focus on some important information, and it is a selection mechanism to allocate information processing capacity. Specifically, it can be described as a mapping from a query in the target data to a series of key-value pairs in the metadata. By calculating the similarity between Query and Key, the corresponding weight of each Value is obtained [31, 32]:

$$ {\text{Attention}}\left( {Q,K,V} \right) = \mathop \sum \limits_{i = 1}^{l} {\text{Sim}}\left( {Q_{i} ,K_{i} } \right) \cdot V_{i } . $$

(10)

Based on the attention mechanism, the sentence of software knowledge community text is given $X = \left( {x_{1} ,x_{2} , \ldots ,x_{t} } \right)$, for each word in the sequence $x_{{\text{i}}} \in x$, extract n similar words $S_{i} = \left( {s_{1} ,s_{2} , \ldots ,s_{n} } \right)$, and get the corresponding word vector representation $C_{i} = \left( {c_{1} ,c_{2} , \ldots ,c_{j} } \right)$, then the semantic contribution weight of each similar word s_i to x_i is:

$$ {\text{w}}_{i} = \frac{{{\text{exp}}\left( {h_{i} \cdot c_{j} } \right)}}{{\mathop \sum \nolimits_{j = i}^{n} {\text{exp}}\left( {h_{i} \cdot c_{j} } \right)}}, $$

(11)

where $h_{i}$ is the context hidden vector corresponding to the word $x_{i}$ in the sentence sequence, and the dot product operation is adopted for the similarity function. After the contribution weight of each similar word $s_{i}$ to the word $x_{i}$ is obtained, the semantic enhancement embedding vector of the current word in the sentence sequence is obtained by the weighted sum:

$$ A_{i} = \mathop \sum \limits_{j = 1}^{n} w_{i} c_{j} . $$

(12)

Feature vector fusion

After the above two steps, the semantic enhanced embedding vector A_i of the current word X_i in the sentence sequence of the software knowledge community text is obtained, and then it is integrated with the hidden vector $h_{i}$ of the context-encoded word $x_{i}$ to serve as the input vector of the linear CRF layer.

$$ h_{i} = h_{i} \oplus A_{i} . $$

(13)

Tag decoding layer

Contextual encoding layer captures the context information of sentence sequence in the software knowledge community text, and gets the tag score of each word in the sentence sequence. However, if the label with the highest score is directly selected as the predicted output result, it will result in label error or invalid label as the dependency relationship between labels is not considered. For example, in this paper, software knowledge entities are annotated according to the BIO pattern of the sequence annotation specification. The tag sequence “B-APIWA I-APIWA” is valid, while the tag sequence “B-APIWA I-APIPM” or the tag sequence “O B-APIWA” is invalid. Therefore, a linear CRF layer is added in this paper to obtain a globally optimal tag sequence by considering the constraint relationship of adjacent tags from the sentence level to avoid the above problems.

The formal representation of the linear CRF layer is [33]: the hidden state sequence $h = \left( {h_{1} ,h_{2} , \ldots ,h_{t} } \right)$ output by the context coding layer is transformed into the optimal tag sequence $y = \left( {y_{1} ,y_{2} , \ldots ,y_{t} } \right)$. The specific calculation process is as follows: First, the sequence of sentences $X = \left( {x_{1} ,x_{2} , \ldots ,x_{t} } \right)$ for the specified software knowledge community text, and the total score of the tag sequence is calculated by formula (14); Then, the probability of tag sequence y is normalized through the Softmax function, as shown in formula (15); Finally, the dynamic programming algorithm Viterbi is used to calculate the tag sequence with the highest score.

$$ {\text{score}}\left( {{\text{h}},{\text{y}}} \right) = \mathop \sum \limits_{t = 1}^{T} Z_{yt - 1,yt} + \mathop \sum \limits_{t = 1}^{T} W_{yt}^{t} h_{t} , $$

(14)

$$ P\left( \frac{y}{h} \right) = \frac{{e^{{{\text{score}}\left( {h,y} \right)}} }}{{\mathop \sum \nolimits_{\omega \in Y\left( h \right)} e^{{{\text{score}}\left( {h,\omega } \right)}} }}, $$

(15)

$$ y^{*} = \mathop {\arg \max }\limits_{\omega \in Y\left( h \right)} \left( {h,\omega } \right), $$

(16)

where Z is the transfer matrix between tags, $Z_{yt - 1,yt}$ represents the probability of the transfer of tag $yt - 1$ to $yt$, $W_{yt}^{t}$ is the parameter vector, and $Y(h)$ is an all possible tag sequence.

Experimental setup

To evaluate the performance of Attensy-SNER model proposed in this paper, the comparative experiments with the benchmark models in the field of entity extraction were carried out. Attensy-SNER model was implemented in Python using the deep learning framework PyTorch. It was specifically configured as Intel Xeon Gold 5117 processor, 2.0 GHz clock speed, NVIDIA Tesla T4 GPU, 16GiB display memory, and all the experiments in this paper were conducted in this experimental environment.

Dataset

Construction of pre-trained word vectors based on BERT model

First, in this paper, we download the official database of the StackOverflow, which uses SQL Sever2008 to store all the question-and-answer text of the StackOverflow from 2008 to 2018.

Second, in the software knowledge community StackOverflow, the tags represent the knowledge domain of the question, and users label 1–5 tags according to the topic of the question. Therefore, a huge amount of software engineering texts is preliminarily classified with the help of the tag system. The greater the number of questions are covered by the tag, the more attentions are paid to the domain and the greater the probability that the questions will be answered or high-quality answers will be generated. To obtain high-quality corpus texts, the strategies for selecting Q&A texts in this paper are stated as follows: first, all the tags are sorted in the StackOverflow by the number of posts they cover, and select the tags with the higher attention. Second, the posts are selected based on whether they have accepted answers, question scores, answer scores, and the posts viewing times. The text content of the selected post consists of the question titles, the question descriptions, accepted the answers, and randomly selected comments. Finally, the text corpus set of software engineering field is obtained after data pre-processing.

The BERT-Base pre-training model is used for unsupervised learning of 43,594,128 sentences (including 2,656,719 questions, 5,526,559 answers and comments) in the software engineering domain text corpus set. According to the corpus set and experimental hardware configuration, the training sample size, learning rate and maximum sentence length are set as 32, 0.0001 and 128, respectively, with the sum of Masked Language Model loss and Next Sentence Prediction loss the lowest and training effect of model the best. Finally, the 768-dimensional software engineering domain word vector representation is obtained for software knowledge entity extraction task.

Construction of annotated data sets in software engineering domain

Due to the lack of open domain annotation data set for software engineering, in this paper we build annotation data set based on the question-and-answer text of the StackOverflow. To obtain high-quality corpus text, this paper adopts the same strategy mentioned above to construct annotated datasets in the software engineering domain.

The pre-definition of entity type is the key step of entity extraction, which reflects the granularity of entity and the goal of knowledge graph construction. Compared with the entity type in general domain, the software knowledge graph is a domain knowledge graph, which requires a more detailed entity type definition based on the knowledge requirements of the software engineering field and the goal of the software knowledge graph. From the related research work, there is no unified standard for the type of software knowledge entity in the field of software engineering, and most of them are pre-defined according to the application goal of software knowledge graph. The aim of this paper is to extract the knowledge entities related to software programming from the question-and-answer texts of the community, and to provide the entity-centric software knowledge retrieval and recommendation service for software developers. Therefore, this paper combines the Wikipedia category system relate to software programming and the knowledge requirements of software developers, this paper expands the types of software knowledge entities based on the literature [13], and predefines a total of 40 entity types from eight aspects, including programming language, system platform, software API, software tools, software development library, software framework, software standards, and software development process. Details are shown in Table 1.

Table 1

Software knowledge entity types

Entity types	Tag	Examples	Entity types	Tags	Examples
Object oriented	PLOO	Java,python,c#	Scientific computing	ToolSCS	Matlab,ATLAB
Procedural oriented	PLPD	Ada, C,Fortran	Integrated development	ToolIDE	XcodeE,Pycharm
Scripting language	PLSL	JS,Groovy,Jscript	Software plugin	ToolSDP	JProfiler
Markup language	PLML	HTML,XML	Data base	ToolDB	Oracle,MySQL
Query language	PLSQL	SQL,GraphQL,linq	Application software	ToolAS	Word,Excel,
CPU instruction sets	PlatCIS	IA-32, × 86–64	Vision library	SLVDL	OpenCV
Command parser	PlatCP	bash,sh.ksh	Game library	SLGDL	Cocos2D
Cloud computing	PlatCCP	Hadoop,amazon-s3	Mobile library	SLMDL	Retrofit
Web server	PlatCWSW	apache,nginx	Software libraries	SLSL	jQuery
Operating systems	PlatCOS	Android, Ubuntu	Web application framework	SFWAF	jsf,Django
Packages	APIP	Java.Lang	Server-side framework	SFSSF	Spring 4.2
Public methods	APIPM	charAt(),toString()	Data formats	StanDF	.jpg,.png
Database query	APIDQ	LIKE, select	Standard protocols	StanSP	SMTP,ssl
Web API	APIWA	POST,GET	Coding standards	StanCS	utf-8,gbk
System api	APISYS	GetMessagePos	Design patterns	StanSDP	mvc,REST
Event api	APIOE	onClickListener	Software operation	StanSO	download
Software modelin	ToolSEM	umlet,green-uml	Droject deployment	SDPSPD	Maven,gradle
Software test	SDPSFT	LoadRunner	Version control	SDPVC	VSS,CVS,
Bug exception	SDPBUG	SyntaxError	Algorithm	SDPALG	Buble sort

In the process of data annotation, in this paper, we adopt BIO mode for annotation, in which “B-” represents the starting position of the software knowledge entity, “I-” represents the interior of the corresponding entity, and “O” represents the non-entity. The annotation team is composed of 10 teachers, software developers, graduate students and undergraduates with software engineering background. After 5 rounds of cross-validation, the annotation data set of software engineering domain is obtained. To ensure the scientific and reasonable results of model experiment, the data set is divided into training set, verification set and test set according to the ratio of 7:1:2 for the experiment of software knowledge entity extraction. The detailed information of the data set is shown in Table 2.

Table 2

Dataset details

	Training set	Verification set	Test set	Total
Questions	294	42	83	419
Answers	625	89	179	893
Sentences	4460	637	1274	6371
Tokens	90,395	12,913	25,827	129,135
Entities	5458	780	1559	7797

Parameter setting

In the training process of Attensy-SNER model, the dimension of pre-trained word vector in the input representation layer is set as 768 dimensions, the number of units in the hidden state of BiLSTM in the context coding layer is set as 200, and GCN is set as 1–3 layers. The Categorical Cross Entropy is used as the loss function of the model, and Adam is used as the optimizer. The initial learning rate is set at 0.001. At the same time, L2 regularization and Dropout mechanism were adopted to prevent over-fitting of model training. The setting of relevant hyperparameters of the model is shown in Table 3.

Table 3

Hyperparameters of the proposed model

Names	Value
Word embedding dimension	768
BiLSTM state size	200
GCN layer	1–3
Batch size	10
Epochs	1000
Optimizer	Adam
Dropout	0.5
Learning rate	0.001

Evaluation metrics

In this paper, general evaluation metrics in information extraction task are selected to evaluate the performance of the model, including precision rate (P), recall rate (R) and F1 score (F1). Precision rate (P) represents the percentage of correctly recognized samples in all recognized samples in the model recognition results; the recall rate (R) represents the percentage of correctly recognized samples in the number of all correct samples; F1 score is the weighted harmonic average of precision rate (P) and recall rate (R), which is used as the comprehensive performance evaluation index of the model. The formal expression of each evaluation index is as follows:

$$ P = \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FP}}}}, $$

(17)

$$ R = \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FN}}}} , $$

(18)

$$ F1 = \frac{2 \times P \times R}{{P + R}}, $$

(19)

where TP (True Positive) means that the model recognizes the correct entity as the correct number of entities, FP (False Positive) means that the model recognizes the wrong entity as the correct number of entities and FN (False Negative) means that the model recognizes the correct entity as the wrong number of entities.

Experimental results and discussion

The software knowledge entity extraction model Attensy-SNER proposed in this paper is improved in the input representation layer and context coding layer compared with the benchmark model of sequence annotation BiLSTM-CRF. Therefore, the performance of the model before and after improvement will be compared and analyzed.

Contribution of pre-trained word vectors to model performance

In the input representation layer of Attensy-SNER model, the pre-training word vectors are obtained by unsupervised training on the massive question-and-answer text corpus of StackOverflow by BERT model. To evaluate the contribution of pre-trained word vectors to the task of software knowledge entity extraction, a comparative experiment was conducted with the benchmark model BiLSTM-CRF. The experimental results are shown in Table 4 and Fig. 4.

Table 4

Contribution of pre-trained word vectors to model performance

Models	P%	R%	F1%
BiLSTM-CRF	67.74	60.15	63.94
BERT-BiLSTM-CRF	71.59	67.05	69.25

According to the experimental results of model comparison, the BERT-BiLSTM-CRF model is better than the benchmark model BiLSTM-CRF model. Specifically, after embedding the BERT pre-trained word vector, the precision rate increases by 4%, the recall rate increases by 7%, F1 score increases by 6%, which shows that BERT model based on the Transformer with a strong ability of text feature extraction can enrich the text feature presentation of input representation layer, so as improve the performance of the software knowledge entity extraction task.

Contribution of context features based on BiLSTM to model performance

To explore the effect of context information based on BiLSTM on software knowledge entity extraction, this paper compares the effects of LSTM and BiLSTM with different levels, and adopts BERT-CRF as the benchmark model. The experimental results are shown in Table 5 and Fig. 5.

Table 5

Contribution of context features to model performance

Models	P%	R%	F1%
BERT-CRF	68.17	63.32	65.66
BERT-LSTM-CRF	69.04	63.54	66.18
BERT-BiLSTM-CRF (L = 1)	71.59	67.05	69.25
BERT-BiLSTM-CRF (L = 2)	70.23	65.38	67.72

Compared with the benchmark model BERT-CRF, the precision rate and F1 score of the models are improved after adding context encoding, indicating that context feature extraction contributes to retain semantic information of text. Meanwhile, compared with the LSTM, the F1 score of BiLSTM model increases by 4%, indicating that Bidirectional LSTM can capture the context information of sentence sequence more effectively.

As the number of BiLSTM layers increase, the F1 score decreases. The experimental results show that when the BiLSTM layer increases to 2 layers, precision rate and F1 score decrease by 1% and 2%, respectively, indicating that the network model may fall into local optimality or overfit.

Contribution of syntactic features based on GCN to model performance

To explore the influence of syntactic dependency relationship based on GCN on the task of software knowledge entity extraction, this paper compares the effects of software knowledge entity extraction with different levels of GCN and adopts BERT-BiLSTM-CRF as the reference model. The experimental results are shown in Table 6 and Fig. 6.

Table 6

Contribution of syntactic features to model performance

Models	P%	R%	F1%
BERT-BiLSTM-CRF	71.59	67.05	69.25
BERT-BiLSTM-GCN-CRF (L = 1)	73.14	67.49	70.21
BERT-BiLSTM-GCN-CRF (L = 2)	73.77	69.23	71.43
BERT-BiLSTM-GCN-CRF (L = 3)	71.32	68.42	69.84

According to the comparative experimental results, the F1 score of the models are improved after syntactic features integrated with different levels of GCN, indicating that syntactic dependency features contribute to the performance improvement of the software knowledge entity extraction task.

At the same time, the results show that when the layer number of GCN is 2, the precision rate and F1 score of the model are the highest, and the performance of the model is improved compared with 1 layer of GCN. However, when the number of GCN layers is 3, the precision rate and F1 score of the model decrease, indicating that the increase in the number of GCN layers will lead to the over-fitting problem of the model.

Contribution of semantic enhancement based on attention mechanism to model performance

Semantic enhancement strategies based on attention mechanisms extract similar words from BERT pre-trained word vectors and software engineering thesaurus Sethesaurus, assign weight to the semantic contributions of similar words based on attention mechanisms, obtain semantic enhanced representation of words with the weighted sum way and thus alleviating the problems of entity variant, entity sparsity and out-of-vocabulary words. For example, in the process of model training, similar words corresponding to software knowledge entity “CentOS” and their semantic contribution are obtained through semantic enhancement strategy based on attention mechanism, as shown in Fig. 7.

According to the Fig. 7, entity “Rehel” has the greatest semantic contribution, and entity variants such as “Centos6” and “Rehel7” also have corresponding contributions to the semantic representation of the target entity as auxiliary resources. Therefore, semantic enhancement based on attention mechanism can alleviate the entity variant problem existing in the software knowledge community text and improve the domain adaptability of the model.

To evaluate the performance contribution of semantic enhancement strategy based on attention mechanism to software knowledge entity extraction task, Attensy-SNER model is selected to conduct comparative experiments with BiLSTM-CRF, BERT-BiLSTM-CRF and BERT-BiLSTM-GCN-CRF. The experimental results are shown in Table 7 and Fig. 8.

Table 7

Contribution of semantic enhancement to model performance

Models	Word feature	Syntactic feature	Semantic enhancement	P%	R%	F1%
BiLSTM-CRF	✕	✕	✕	67.74	60.15	63.94
BERT-BiLSTM-CRF	✓	✕	✕	71.59	67.05	69.25
BERT-BiLSTM-GCN-CRF	✓	✓	✕	73.77	69.23	71.43
AttenSy-SNER	✓	✓	✓	76.87	71.58	74.17

In Table 7, the model is represented by the symbol “✓”, if the corresponding features representation is used; Otherwise, the symbol “✕” denotes the scenario it’s not used. Based on the comparison experiment results, the Attensy-SNER model is improved; the F1 score increases by 10%, 5% and 3%, respectively, compared with the three other models. It is suggested that it can effectively enhance the semantic representation of words, and thus improve the extraction effect of software knowledge entities by providing domain-related similar words in sentence sequences and assigning different weights combined with attention mechanism.

To explore the effects of semantic enhancement strategies based on attention mechanism on entity sparsity and out-of-vocabulary words, the entity extraction results of AttenSy-SNER model were analyzed. The training set contains 5548 software knowledge entities, the test set contains 1559 software knowledge entities, among which the test set contains 451 out-of-vocabulary entities. The results for recognizing out-of-vocabulary entities in the test set of all datasets are shown in Table 8.

Table 8

Results of out-of-vocabulary entity recognition

Models	R%
BiLSTM-CRF	35.41
BERT-BiLSTM-CRF	42.39
BERT-BiLSTM-GCN-CRF	45.87
AttenSy-SNER	59.48

The results show that AttenSy-SNER model with semantic enhancement strategies has higher recall rates (R) than the other three models without semantic enhancement strategies, and has better performance in recognizing out-of-vocabulary entities. Therefore, semantic enhancement strategies based on attention mechanism can enhance the semantic representation of entities by integrating the vector representation of similar words, which is conducive to solving the out-of-vocabulary problem existing in the software knowledge community and thus alleviate the entity sparsity problem.

Contrastive analysis of model training

The training process of deep learning model is a process of constantly updating parameters. To further understand the training situation of this model, this paper selects the process data of Epoch in the first 100 rounds of model training for comparative analysis. The relationship between F1 score of each model and Epoch is shown in Fig. 9.

It can be seen that the F1 score of the BiLSTM-CRF model without BERT pre-training word vectors continuously increases from the initial low value. In the other three models with BERT pre-training word vectors, start from a higher initial value and maintained a higher level continuously. Therefore, it is verified that embedding the BERT pre-trained word vector as input layer can effectively extract the text features of the software knowledge community, enrich the feature representation of input layer, and make an important contribution to the performance of the software knowledge entity extraction task.

Compared with other models, the Attensy-SNER model proposed in this paper can obtain the highest F1 score at the initial stage by integrating syntactic features and semantic enhancement information, and at the 40th round of Epoch, the loss function begins to converge and keeps the best F1 score state continuously. The results show that BERT pre-trained word vector, syntactic dependency based on GCN and semantic enhancement based on attention mechanism play an important role in improving the performance of software knowledge entity extraction task.

Conclusion

In view of the problems such as entity variation, entity sparsity, entity ambiguity, out-of-vocabulary words and lack of annotated data sets in the software knowledge community text, we consider the features of word, syntax, entity context and semantics, and propose a software knowledge entity extraction model Attensy-SNER. It combines the pre-trained word vector representation, syntactic dependency features and entity semantic enhanced information, to extract fine-grained software knowledge entities from unstructured user-generated content. To solve the problem of lack of open data set in software engineering field, the pre-trained word vector based on BERT model and the annotated data set of software engineering domain covering 8 aspects and 40 fine-grained entity types are constructed based on the question-and-answer texts of the StackOverflow. The comparison experimental analysis shows that the Attensy-SNER model proposed in this paper is superior to the current benchmark model in the task of software knowledge entity extraction, and it paves a way for the construction of software knowledge graph in the next step.

Declarations

Conflict of interest

The authors declare no competing interests.

This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visithttp://creativecommons.org/licenses/by/4.0/.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Vorheriger Artikel Developing sequentially trained robust Punjabi speech recognition system under matched and mismatched conditions

Nächster Artikel A novel feature relearning method for automatic sleep staging based on single-channel EEG

Wang T, Yin G, Wang HM, Yang C, Zou P (2015) Automatic knowledge sharing across communities: a case study on android issue tracker and StackOverflow. In: 2015 IEEE symposium on service-oriented system engineering, San Francisco, CA, USA, pp 107–116. https://doi.org/10.1109/SOSE.2015.34

Ji SX, Pan SR, Cambria E, Marttinen P, Yu PS (2020) A survey on knowledge graphs: representation, acquisition, and applications. IEEE Trans Neural Netw Learn Syst. https://doi.org/10.1109/TNNLS.2021.3070843CrossRef

Tang X, Chen L, Cui J, Wei BG (2019) Knowledge representation learning with entity descriptions, hierarchical types, and textual relations. Inf Process & Manag 56(3):809–822. https://doi.org/10.1016/j.ipm.2019.01.005CrossRef

Li J, Sun AX, Han JL, Li CL (2020) A survey on deep learning for named entity recognition. IEEE Trans Knowl Data Eng. https://doi.org/10.1109/TKDE.2020.2981314CrossRef

Soomro PD, Kumar S, Banbhrani, Shaikh AA, Raj H (2017) Bio-NER: biomedical named entity recognition using rule-based and statistical learners. Int J Adv Comput Sci Appl 8(12):163–170. https://doi.org/10.14569/IJACSA.2017.081220CrossRef

Quimbaya AP, Múnera AS, Rivera RAG, Rodríguez JCD, Velandia OMM, Peña AAG et al (2016) Named entity recognition over electronic health records through a combined Dictionary-based approach. Proc Comput Sci 100:55–61. https://doi.org/10.1016/j.procs.2016.09.123CrossRef

Zhang J, Shen D, Zhou GD, Su J, Tan CL (2004) Enhancing HMM-based biomedical named entity recognition by studying special phenomena. J Biomed Inform 37(6):411–422. https://doi.org/10.1016/j.jbi.2004.08.005CrossRef

Cofre R, Cessac B (2014) Exact computation of the maximum-entropy potential of spiking neural-network models. Phys Rev E Stat Nonlin Soft Matter Phys 89(5):52117–52130. https://doi.org/10.1103/PhysRevE.89.052117CrossRef

Mansouri A, Affendy LS, Mamat A (2008) A new fuzzy support vector machine method for named entity recognition. Int Conf Comput Sci Inf Technol. https://doi.org/10.1109/ICCSIT.2008.187CrossRef

10.

Seker GA, Eryigit G (2017) Extending a CRF-based named entity recognition model for Turkish well formed text and user generated content. Semantic Web 8(5):625–642. https://doi.org/10.3233/SW-170253CrossRef

11.

Habibi M, Weber L, Neves M, Wiegandt DL, Leser U (2017) Deep learning with word embeddings improves biomedical named entity recognition. Bioinformatics 33(14):i37–i48. https://doi.org/10.1093/bioinformatics/btx228CrossRef

12.

Tang Z, By W, Yang L (2020) Word-character graph convolution network for chinese named entity recognition. IEEE/ACM Trans Audio Speech Lang Process. https://doi.org/10.1109/TASLP.2020.2994436CrossRef

13.

Ye DH, Xing ZC, Foo CY, Ang ZQ, Li J, Kapre N (2016) Software-specific named entity recognition in software engineering social content. In: IEEE 23rd international conference on software analysis, evolution, and reengineering, pp 90–101. https://doi.org/10.1109/SANER.2016.10

14.

Zhao XJ, Xing ZC, Kabir MS, Sawada N, Li J, Lin SW (2017) Hdskg: harvesting domain specific knowledge graph from content of webpages. In: IEEE 24th international conference on software analysis, evolution and reengineering, pp 56–67. https://doi.org/10.1109/SANER.2017.7884609

15.

Guo JP, Luo H, Sun Y (2019) Research on extracting named entities in software engineering field from wiki webpage. In: IEEE international conference on consumer electronics—Taiwan, pp 1–2. https://doi.org/10.1109/ICCE-TW46550.2019.8991742

16.

Reddy MVPR, Prasad PVRD, Chikkamath M, Mandadi S (2019) NERSE: named entity recognition in software engineering as a service. In: Australian symposium on service research and innovation, pp 65–80. https://doi.org/10.1007/978-3-030-32242-7_6

17.

Lv WQ, Liao ZF, Liu SZ, Zhang Y (2021) MEIM: a multi-source software knowledge entity extraction integration model. Comput Mater Continua 66(1):1027–1042. https://doi.org/10.32604/cmc.2020.012478CrossRef

18.

Tabassum J, Maddela M, Xu W, Ritter A (2020) Code and named entity recognition in StackOverflow. In: Proceedings of the 58th annual meeting of the association for computational linguistics, pp 4913–4926. https://doi.org/10.18653/v1/2020.acl-main.443

19.

Chiu JPC, Nichols E (2016) Named entity recognition with bidirectional LSTM-CNNs. Trans Assoc Comput Linguistics. https://doi.org/10.1162/tacl_a_00104CrossRef

20.

Strubell E, Verga P, Belanger D, McCallum A (2017) Fast and accurate entity recognition with iterated dilated convolutions. In: Proceedings of the 2017 conference on empirical methods in natural language processing, Copenhagen, Denmark, pp 2670–2680. https://doi.org/10.18653/v1/D17-1283

21.

Xu MB, Jiang H, Watcharawittayakul S (2017) A local detection approach for named entity recognition and mention detection. In: Proceedings of the 55th annual meeting of the association for computational linguistics, Vancouver, Canada, pp 1237–1247. https://doi.org/10.18653/v1/P17-1114

22.

Devlin J, Chang MW, Lee K, Toutanova K (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the north American chapter of the association for computational linguistics, Minneapolis, Minnesota, pp 4171–4186. https://doi.org/10.18653/v1/N19-1423

23.

Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735CrossRef

24.

Yao L, Mao CS, Luo Y (2019) Graph convolutional networks for text classification. In: the 33rd AAAI conference on artificial intelligence, pp 7370–7377

25.

Marcheggiani D, Titov I (2017) Encoding sentences with graph convolutional networks for semantic role labeling. In: Proceedings of the 2017 conference on empirical methods in natural language processing, Copenhagen, Denmark, pp 1506–1515. https://doi.org/10.18653/v1/D17-1159

26.

Guo ZJ, Zhang Y, Lu W (2019) Attention guided graph convolutional networks for relation extraction. In: Proceedings of the 57th annual meeting of the association for computational linguistics, pp 241–251. https://doi.org/10.18653/v1/P19-1024

27.

Bastings J, Titov I, Aziz W, Marcheggiani D, Sima'an K (2017) Graph convolutional encoders for syntax-aware neural machine translation. In: Proceedings of the 2017 conference on empirical methods in natural language processing, pp 1957–1967. https://doi.org/10.18653/v1/D17-1209

28.

Nie YY, Tian YH, Wan X, Song Y, Dai B (2020) Named entity recognition for social media texts with semantic augmentation. In: Proceedings of the 2020 conference on empirical methods in natural language processing, pp 1383–1391. https://doi.org/10.18653/v1/2020.emnlp-main.107

29.

Chen X, Chen CY, Zhang D, Xing ZC (2019) SEthesaurus: WordNet in software engineering. IEEE Trans Softw Eng. https://doi.org/10.1109/TSE.2019.2940439CrossRef

30.

Tian Y, Lo D, Lawall J (2014) Automated construction of a software-specific word similarity database. In: 2014 software evolution week—IEEE conference on software maintenance, reengineering, and reverse engineering, Antwerp, Belgium, pp 44–53. https://doi.org/10.1109/CSMR-WCRE.2014.6747213

31.

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gonez AN, et al (2017) Attention is all you need. In: Proceedings of the 31th conference on neural information processing systems, pp 5998–6008

32.

Margatina K, Baziotis C, Potamianos A (2019) Attention-based conditioning methods for external knowledge integration. In: Proceedings of the 57th annual meeting of the association for computational linguistics, Florence, Italy, pp 3944–3951. https://doi.org/10.18653/v1/P19-1385

33.

Alsaaran N, Alrabiah M (2021) Arabic named entity recognition: a BERT-BGRU approach. Comput Mater Continua 68(1):471–485. https://doi.org/10.32604/cmc.2021.016054CrossRef

Titel: AttenSy-SNER: software knowledge entity extraction with syntactic features and semantic augmentation information
verfasst von: Mingjing Tang
Tong Li
Wei Gao
Yu Xia
Publikationsdatum: 02.06.2022
Verlag: Springer International Publishing
Erschienen in: Complex & Intelligent Systems / Ausgabe 1/2023
Print ISSN: 2199-4536
Elektronische ISSN: 2198-6053
DOI: https://doi.org/10.1007/s40747-022-00742-5

Springer Professional

AttenSy-SNER: software knowledge entity extraction with syntactic features and semantic augmentation information

Abstract

Publisher's Note

Introduction

Proposed method

Input representation layer based on BERT model

Context encoder layer based on multi-feature fusion

Context feature encoder based on BiLSTM model

Syntactic feature encoder based on GCN model

Semantic enhancement strategy based on attention mechanism

Similar word extraction based on semantic consistency

Semantic contribution weight calculation based on attention mechanism

Feature vector fusion

Tag decoding layer

Experimental setup

Dataset

Construction of pre-trained word vectors based on BERT model

Construction of annotated data sets in software engineering domain

Parameter setting

Evaluation metrics

Experimental results and discussion

Contribution of pre-trained word vectors to model performance

Contribution of context features based on BiLSTM to model performance

Contribution of syntactic features based on GCN to model performance

Contribution of semantic enhancement based on attention mechanism to model performance

Contrastive analysis of model training

Conclusion

Declarations

Conflict of interest

Publisher's Note

Premium Partner

Springer Professional

Abstract

Publisher's Note

Introduction

Related work

Proposed method

Input representation layer based on BERT model

Context encoder layer based on multi-feature fusion

Context feature encoder based on BiLSTM model

Syntactic feature encoder based on GCN model

Semantic enhancement strategy based on attention mechanism

Similar word extraction based on semantic consistency

Semantic contribution weight calculation based on attention mechanism

Feature vector fusion

Tag decoding layer

Experimental setup

Dataset

Construction of pre-trained word vectors based on BERT model

Construction of annotated data sets in software engineering domain

Parameter setting

Evaluation metrics

Experimental results and discussion

Contribution of pre-trained word vectors to model performance

Contribution of context features based on BiLSTM to model performance

Contribution of syntactic features based on GCN to model performance

Contribution of semantic enhancement based on attention mechanism to model performance

Contrastive analysis of model training

Conclusion

Declarations

Conflict of interest

Publisher's Note

Weitere Artikel der Ausgabe 1/2023

A novel feature relearning method for automatic sleep staging based on single-channel EEG

A deterministic and nature-inspired algorithm for the fuzzy multi-objective path optimization problem

The fuzzy Weighted Influence Nonlinear Gauge System method extended with D numbers and MICMAC

Multi-granularity scenarios understanding network for trajectory prediction

An environment-driven hybrid evolutionary algorithm for dynamic multi-objective optimization problems

A coin selection strategy based on the greedy and genetic algorithm

Premium Partner