Top

Complex & Intelligent Systems

Published in:

Open Access 28-09-2023 | Original Article

Deeply integrating unsupervised semantics and syntax into heterogeneous graphs for inductive text classification

Authors: Yue Gao, Xiangling Fu, Xien Liu, Ji Wu

Published in: Complex & Intelligent Systems | Issue 1/2024

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Patentsearch

Off

Abstract

Graph-based neural networks and unsupervised pre-trained models are both cutting-edge text representation methods, given their outstanding ability to capture global information and contextualized information, respectively. However, both representation methods meet obstacles to further performance improvements. On one hand, graph-based neural networks lack knowledge orientation to guide textual interpretation during global information interaction. On the other hand, unsupervised pre-trained models imply rich semantic and syntactic knowledge which lacks sufficient induction and expression. Therefore, how to effectively integrate graph-based global information and unsupervised contextualized semantic and syntactic information to achieve better text representation is an important issue pending for solution. In this paper, we propose a representation method that deeply integrates Unsupervised Semantics and Syntax into heterogeneous Graphs (USS-Graph) for inductive text classification. By constructing a heterogeneous graph whose edges and nodes are totally generated by knowledge from unsupervised pre-trained models, USS-Graph can harmonize the two perspectives of information under a bidirectionally weighted graph structure and thereby realizing the intra-fusion of graph-based global information and unsupervised contextualized semantic and syntactic information. Based on USS-Graph, we also propose a series of optimization measures to further improve the knowledge integration and representation performance. Extensive experiments conducted on benchmark datasets show that USS-Graph consistently achieves state-of-the-art performances on inductive text classification tasks. Additionally, extended experiments are conducted to deeply analyze the characteristics of USS-Graph and the effectiveness of our proposed optimization measures for further knowledge integration and information complementation.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Introduction

Text classification is one of the most fundamental tasks in the field of natural language processing (NLP), which can be extended to numerous branch tasks in different application scenarios, such as sentiment analysis, spam filtering [1, 2], intent detection [3, 4], and EHR-based disease diagnoses [5, 6].

Text representation learning is the kernel step for the text classification problem, as it is crucial to the classification performance of the method. Till now, research on text representation can be summarized into four stages. The first stage is traditional methods represented by Naive Bayes [7], k-Nearest Neighbor [8], and Support Vector Machine [9] whose representation is dependent on the hand-crafted features, such as bag-of-word features, sparse lexical features [10], and entity-based features [11]. Such feature extraction ways, however, tend to cost plenty of labor and efficiency. The second stage is sequential-based deep learning methods represented by convolutional neural networks [12, 13] and recurrent neural networks [14, 15], both of which realize feature extraction from local consecutive word sequences. Such sequential-based deep learning methods focus on the locality of words and thereby lacking of long-distance and non-consecutive word interactions. The third stage is unsupervised pre-trained models represented by BERT [16]. With the utilization of self-attention mechanism and multi-task pre-training based on large corpora, unsupervised pre-trained models present impressive text representation performance. However, even if all the tokens get involved in the encoding process of each token, these attention-based deep neural networks mainly focus on local consecutive word sequences, which provides local context information. Therefore, the self-attention mechanism only alleviates the locality problem, and they still encode contextualized information.

In contrast, graph-based deep learning method represented by graph neural network [17, 18] can take global features into account to deal with complex text structures and thereby being considered as the forth stage of the study on text representation. Recently, research on applying graph neural networks into text representation tasks has attracted widespread attention [19, 20]. Depending on the data sources involved in the training process, graph neural networks can be divided into transductive learning and inductive learning. For the transductive models represented by TextGCN [20], both training and test documents are mandatory to construct a single graph containing all the documents and words, while for the inductive models represented by GAT and TextING [21, 22], every document owns its specific graph structure, which means that test documents are unneeded during training. Thus, due to their flexibility and adaptability, inductive graph neural networks inherently take advantage in text classification tasks.

Graph construction is the first essential stage to realize a graph neural network. For general text-oriented graph paradigms, edge construction is based on traditional machine learning methods such as the term frequency-inverse document frequency (TF-IDF) and point-wise mutual information (PMI), which judge the relationship between node units depending on the statistical frequency of node co-occurrence. Information contained in such edges expresses the sequential dependency relationship between nodes. However, to better guide graph neural networks interact in a more interpretative way, it is necessary to integrate graph with richer textual knowledge orientation. The existing method attempted to get richer semantic and syntactic information involved into edge weights by building a TensorGCN framework [23]. However, they realized the semantic representation by using a LSTM embedding trained previously based on the same training set, which is inefficient and poor in performance. Nowadays, unsupervised pre-trained language models were interpreted to imply numerous linguistic knowledge [24], which may potentially be a better solution to build the semantic and syntactic edges.

Pre-training neural networks to learn transferable knowledge for downstream tasks has recently been demonstrated effective to further improve representation performance [25]. In terms of the field of NLP, based on self-attention mechanism, diverse unsupervised pre-trained models, e.g., BERT, have been proposed [16]. All these unsupervised pre-trained models encode contextualized information with rich semantic and syntactic knowledge [26]. However, such contextualized knowledge lacks sufficient induction and expression during the general representation process. Meanwhile, as a kind of information complement, the unsupervised contextualized information provides potential to further get fused with the global information learning from graph neural networks. Figure 1 raises an example that visualizes the necessity and complementarity of the two perspectives of information. The existing method attempted to combine BERT and graph-based models [27]. However, they only shallowly concatenate the representation results of BERT and a transductive graph-based model to realize text classification, which means that the two different perspectives of information are not intra-integrated, and the application scenarios of the method are limited. Meanwhile, in view of the diversity of unlabeled corpora utilized for pre-trained models, for example, those in biology domains, the potential of graph neural networks for domain-specific text classification tasks deserves further discussion.

Inspired by the recent progress claimed above, in this paper, we propose a representation method that deeply integrates Unsupervised Semantics and Syntax into heterogeneous Graphs (USS-Graph) for inductive text classification. First, potential semantics and syntax implicit in the unsupervised pre-trained models are effectively induced based on our designed perturbed masking strategy. Second, the unsupervised impact matrix is leveraged to introduce unsupervised semantic and syntactic knowledge into both edges and nodes, so that a totally unsupervised-knowledge-based USS-Graph is constructed. Then, as the information in the word-level and document-level nodes are globally propagated through an adapted relation-aware Gated Graph Neural Networks (GGNN) [28, 29], the representation method realizes the intra-fusion of graph-based global information and unsupervised contextualized information inside the bidirectionally weighted heterogeneous graph structure. Finally, the representation of a distinct document can be aggregated and be utilized for text classification. Furthermore, based on USS-Graph, a series of optimization measures are proposed to explore extended methods that can further optimize the knowledge integration and information complementation from different perspectives. The main contributions of this work are summarized as follows:

We propose an inductive text classification method with unsupervised semantics and syntax deeply integrated into graph representations. With nodes and edges totally constructed based on unsupervised semantics and propagated in USS-Graph, our method realizes the intra-fusion of graph-based global information and unsupervised contextualized information inside the bidirectionally weighted heterogeneous graph structure.
A series of optimization measures are proposed to further improve the knowledge integration of graph networks and unsupervised pre-trained models from diverse perspectives based on USS-Graph.
Extensive experiments are conducted on five English benchmark datasets and one Chinese dataset, and demonstrate that our proposed method achieves state-of-the-art performance in inductive text classification tasks.
Extended experiments are conducted to deeply analyze the characteristics of USS-Graph and the effectiveness of our proposed optimization measures for further knowledge integration and information complementation.

In recent years, diverse research on the two cutting-edge text representation methods, unsupervised pre-trained models and graph neural networks, have been conducted, both of which present beneficial characteristics that contribute to the better representation performance based on different modelling mechanisms. However, both types of the existing methods meet obstacles to further performance improvements, which reveals the necessity of in-depth knowledge integration and information complementation.

Graph neural network

Graph neural networks have already been an influential representation learning method and are widely applied in diverse communities for feature extraction [30, 31] . By mapping data to topological space, graph neural networks have better capability of representing non-Euclidean data [25]. A graph consists of node and edge features. By following a message passing paradigm, graph neural networks can leverage the graph connectivity and realize global information interaction during the representation learning process.

In the field of text classification, a series of graph-based representation methods with differentiated processing regulations have been proposed. Given the richness of information types within a graph, graph neural networks can be divided into homogeneous and heterogeneous graphs [20, 32]. Specifically, homogeneous graph neural networks consist of only one type of node and edge, e.g., document-level or word-level node, while heterogeneous graph neural networks consist of two or more types of nodes and edges, and thereby implying richer textual information. Given the dataset involved in the graph construction stage, graph neural networks can be divided into transductive and inductive learning [22]. Specifically, transductive graph neural networks require all the data involved for prior graph construction, which is relatively redundant, while inductive graph neural networks can treat different dataset more flexibly and distinctively. Given the granularity of edge feature, the edges of coarse-grained graph neural networks are constructed without weights and directions, while those of fine-grained graph neural networks tend to be weighted and bi-directional, which means that the fine-grained edges tend to express richer connectivity information [33].

Currently, one issue of graph-based representation methods pending to be optimized is the knowledge orientation during the global interaction, since a coarse-grained and randomly generated graph neural network cannot efficiently mine the knowledge that caters to the certain text classification tasks. To guide the representation process in a more appropriate path, integration of modified view of textual knowledge [26, 34], e.g., semantics, sequence, and syntax of text, into graph modeling is known as an efficient solution and deserves further study.

Unsupervised pre-trained model

To date, unsupervised pre-trained models via pre-training and fine-tuning paradigm has shown superior performance and attracted considerable research interest in the NLP community. Benefiting from self-supervised learning based on huge amount of corpora, unsupervised pre-trained models have been demonstrated to imply rich contextualized knowledge. For example, BERT [16], as one of the most representative unsupervised models, adopts two self-supervised learning strategies: masked language modeling (MLM) and next sentence prediction (NSP). As Rogers et al. [26] summarized, multiple views of knowledge have been encoded into BERT weights, including semantic and syntactic knowledge. Semantically, BERT representations encode tokens with knowledge about entity types and semantic roles [35]. Syntactically, BERT representations are hierarchical rather than linear, which makes the tokens present syntactic tree structure [24].

However, existing works prefer to utilize BERT for embedding initialization to transform context into low dimensional vectors. [36] Such operation is explicit and shallow, which means that the intrinsic knowledge is not sufficiently induced and expressed. For example, according to the experimental results presented in the work of Sentence-BERT [37], which is a BERT-based text representation method for the task of semantic textual similarity, unsupervised pre-trained models, e.g., BERT and RoBERTa, may lead to rather poor performances. Utilizing the average embeddings and [CLS] token of un-fine-tuned BERT both perform worse than utilizing the average GloVe embeddings in the task of semantic textual similarity. Therefore, the conflict between the performance limitations of the existing proposed methods and the multiple views of knowledge have been encoded into BERT do reflect that the rich semantic and syntactic knowledge implicit in the unsupervised pre-trained models lacks sufficient induction and expression. Therefore, how to skillfully extract the intrinsic knowledge from these unsupervised pre-trained models for better text representation is an important issue pending for solution.

Method

Text classification based on USS-Graph representation can be decomposed into three stages: graph construction integrating unsupervised knowledge in perspective of both nodes and edges, graph-based global information interaction, and downstream readout of the representation outputs. In this section, we first illustrate the implementation of the method stage by stage, and the whole framework is shown in Figure 2. Then, we propose some optimization measures to further improve USS-Graph.

USS-Graph construction

Graph construction is the kernel stage for the USS-Graph representation method. Given that a graph is denoted as $G = (V, E)$ where $V(|V|= n)$ is the set of graph nodes and E is the set of graph edges. Inspired by the study on the interpretability of BERT [24], to sufficiently induce the unsupervised semantic and syntactic knowledge, and to get unsupervised contextualized information fully involved into a graph structure, we redesign the graph construction strategies for the only two components of the graph: nodes and edges, respectively. As a result, every document can own its distinct USS-Graph for inductive learning.

Preliminary: perturbed masking for token impact

Benefiting from the representation learning based on huge amount of corpus, unsupervised pre-traind models exhibit strong interpretation on text semantics and syntax, and thus can better express relationship between contexts. Therefore, we adopt perturbed masking method to induce the textual impact among tokens in the text.

Here we take BERT as a representative case of the unsupervised pre-trained models. Given a sentence which is tokenized as a list $x = [x_1,\dots , x_T ]$, benefiting from self-attention mechanism, BERT can map each token $x_i$ into a contextualized representation $H_\theta (x)_i$, where $\theta $ represents the network’s parameters. The representation results differ with context information and position information changed, and thus can reflect the interdependence degree between tokens. To capture the dependency of $x_i$ on $x_j$ in the context, we firstly replace $x_i$ with the [MASK] token and feed the new sequence $x\setminus \{x_i\}$ into BERT, and thus get $H_\theta (x\setminus \{x_i\})_i$ denoted as a representation of $x_i$. Such representation contains the impact of the whole context on $x_i$. Then, to further reflect the impact of $x_j$ on $x_i$, we replace both $x_i$ and $x_j$ with the [MASK] token and feed the new sequence $x\setminus \{x_i, x_j\}$ into BERT, and thus get $H_\theta (x\setminus \{x_i, x_j \})_i$ denoted as a new representation of $x_i$. Such representation contains the impact of the whole context except $x_j$ on $x_i$. As a result, distance between the two representations of $x_i$ can be calculated representing the dependency of $x_i$ on $x_j$. We define the function $f(x_i, x_j)$ denoting impact a context token $x_j$ has on the prediction of another token $x_i$ as below:

$$\begin{aligned} f(x_i,x_j) = dis(H_\theta (x\setminus \{x_i\})_i, H_\theta (x\setminus \{x_i,x_j\})_i ) \end{aligned}$$

(1)

where dis(x, y) is the distance metric of the representation difference. Here we utilize Euclidean distance to evaluate the difference. The bigger the distance is, the bigger the impact of $x_j$ is on $x_i$.

Edge construction

Differing from traditional method that statistically construct word-word edges through PMI and document-word edge through TF-IDF, we utilize unsupervised pre-trained models to extract inter-word information, in which way to assess the semantic and syntactic relationship between words and between word and document.

Since USS-Graph is constructed with nodes in units of word and document, to assess word-level and document-level dependency relationships so as to realize edge construction, we similarly replace the sequential tokens $[x_m : x_n] (m<n)$ that constitute the word $w_i$ we want to assess with the [MASK] tokens and then get the contextualized representation from BERT. The new representation of $w_i$ is calculated from the average of the corresponding token representations. The impact assessment function is modified as below:

$$\begin{aligned} \begin{aligned} f(w_i,w_j) = dis(Avg(H_\theta (x\setminus \{x_{m:n}\})_{m:n}) ,\\ Avg(H_\theta (x\setminus \{x_{m:n},x_{p:q}\})_{m:n})) \end{aligned} \end{aligned}$$

(2)

where $x_{m:n}$ and $x_{p:q}$ correspond to tokens of $w_i$ and $w_j$, respectively.

In terms of assessment on word-document dependency relationship, we utilize the [CLS] token as the document representation. Therefore impact assessment function is modified as below:

$$\begin{aligned} f(w_i,d) ={} & {} dis(Avg(H_\theta (x_{m:n}\setminus x_{m:n})_{m:n}),\nonumber \\ {}{} & {} Avg(H_\theta (x\setminus \{x_{m:n}\})_{m:n}))\end{aligned}$$

(3)

$$\begin{aligned} f(d,w_i) ={} & {} dis(H_\theta (x_{[CLS]} ), \nonumber \\{} & {} H_\theta (x\setminus \{ x_m:x_n\})_{[CLS]}) \end{aligned}$$

(4)

where $f(w_i,d)$ assesses the impact of the document d on word $w_i$ and $f(d,w_i)$ assesses the impact of word $w_i$ on the document d, respectively.

The differences between the formulas result from the representation of document-level node [CLS] token. In terms of document-word dependency, we conduct single masking strategy, which means that we first get the encoding result of [CLS] token with every token involved, and then mask the tokens forming the word to get a new encoding result of [CLS] token. The distance between the twice [CLS] token representations is the impact from a word to the document. In terms of word-document dependency, we first mask the tokens of the word and get the representation of word by averaging the encoded tokens of the word. Then, we mask all the tokens, and get the representation of word by averaging the encoded tokens of the word. The distance between the twice word representations is the impact from the document to a word.

By going through all the word-word pair and word-document pair in a context, an impact matrix $M\in R^{|W|+1\times |W|+1}$ can be constructed, where $|W|$ denotes the number of words in a context. Thus, here we get an adjacent-matrix-alike matrix which contains semantic and syntactic relationship between words and document. To present the interpretability of the impact matrix, we visualize an example “For those who follow social media transitions on Capitol Hill, this will be a little different”. As is shown in Figure 3, potential semantic and syntactic dependencies expressed in the impact matrix are marked.

To make the impact matrix more applicable as an adjacent matrix, inspired by the idea of sigmoid function, here we propose a new activation function to filter the impact matrix and get the normalized adjacent matrix A we want. The activation function is presented as below:

$$\begin{aligned} a_{i,j} = \frac{1-e^{-m_{i,j}}}{1+e^{-m_{i,j}}}, i,j\in [1,|W|+1] \end{aligned}$$

(5)

where $a_{i,j}$ and $m_{i,j}$ denote the elements in the adjacent matrix $A\in R^{|W|+1\times |W|+1}$ and impact matrix $M\in R^{|W|+1\times |W|+1}$, respectively. $|W|$ denotes the number of word nodes. The first $|W|\times |W|$ matrix in A denotes edges between words in the context, and the $|W|+1$ th row and column denotes edges between words and the document.

Figure 4 shows the concrete edge construction process. Taking the sentence “I love strawberry milk but not strawberries.” as an input example, the adjacent matrix is generated as is shown above. Since the edge weights fully get the contextualized information involved, the edge weights of the same word “strawberry” differs with context environment altered and express more fine-grained semantic and syntactic relationships. Therefore, the USS-Graph we proposed is actually a bidirectionally weighted graph and is a fully connected heterogeneous graph consisting of word-level and document-level nodes.

Node construction

Constructing nodes for a graph involves two perspectives of work: node initialization and node selection, both of which are realized through unsupervised semantic and syntactic knowledge for USS-Graph.

In terms of node initialization, based on unsupervised pre-trained models, we take the contextualized [CLS] token representation as the initialization of the document-level node feature, and take the average of contextualized involved tokens as the initialization of word-level nodes. In this way, compared with traditional node initialization method that utilizes pre-trained GloVe representations as word-level node features, node initialization based on unsupervised pre-trained model is more flexible and contains richer contextualized semantic knowledge.

In terms of node selection, USS-Graph is a heterogeneous graph consisting of word-level and document-level nodes. For English context, selection on word-level nodes is easy to conduct. However, to other language, for example Chinese, whose words in a sentence are made up of unsplit characters, word segmentation is an essential work. Similar to the idea used in the edge construction, we use the impact matrix to realize word segmentation. Taking Chinese contexts as an example, by feeding the Chinese tokens into BERT, we can get character-level representations to calculate the impact matrix $M \in R^{|C|\times |C|}$, which reflects the semantic and syntactic dependency between characters. Then, we only need to consider the correlation between adjacent characters. By setting function F as the correlation evaluation metric and the super parameter $\alpha $ as the threshold value, we cut the two adjacent characters whose correlation is smaller than the threshold value and splice the two characters whose correlation is greater than or equal to the threshold value, in which way to realize unsupervised word segmentation for word-level nodes.

$$\begin{aligned} F = \frac{f(c_i,c_i+1)+f(c_i+1,c_i)}{2} \end{aligned}$$

(6)

where $c_i$ denotes a Chinese character-level representation in the context.

Given that unsupervised pre-trained models, for example Chinese BERT-wwm [38], take whole word masking in the Masked Language Model (MLM) task during the pre-training stage, word segmentation based on unsupervised semantic and syntactic knowledge is expectable to be good. Here we build a syntactic parsing tree of a Chinese context based on the character-level impact matrix in Figure 5, which present a visualization proof of our theory.

Graph-based global information interaction

With every distinct graph constructed, we utilize an adapted GGNN [28] to realize the global information interaction between nodes. GGNN learns node representations through neural networks with gated recurrent units (GRU) [39], so that information from neighborhood can be fused with own representation. With times t of the interaction increased, information fusion between nodes deepens continuously, and finally achieve global interaction of the whole structure. At this point, graph-based global information and unsupervised contextualized information are really intra-fused and we can finally get USS-Graph representations for every document.

Distinctive from original GGNN, to get the adjacent matrix of USS-Graph more appropriate for node interaction in different times, here we apply a self-adaptive way to fine tune the adjacent matrix with weight $W_a$ and bias $b_a$ in each interaction. Meanwhile, given that original GGNN can’t differentially handle heterogeneous information during global information interaction, here we design a relation-aware feature transformation $M_{i,j}\in {\{M_0,M_1,M_2\}}$ to make full use of the relational type of edges, where $M_0,M_1,M_2$ denote trainable parameters for the word-word edge, word-document edge, and document-word edge, respectively. Detailed interaction process is listed as follow:

$$\begin{aligned} a^t&= (W_a A + b_a)Mh^{t-1}\end{aligned}$$

(7)

$$\begin{aligned} z^t&= \sigma (W_z a^t + U_z h^{t-1} + b_z)\end{aligned}$$

(8)

$$\begin{aligned} r^t&= \sigma (W_r a^t + U_r h^{t-1} + b_r)\end{aligned}$$

(9)

$$\begin{aligned} \tilde{h}^t&= tanh(W_h a^t + U_h (r^{t} \odot h^{t-1}) + b_h)\end{aligned}$$

(10)

$$\begin{aligned} h^t&= \tilde{h}^t \odot z^t + h^{t-1} \odot (1-z^t) \end{aligned}$$

(11)

where M is the relation-aware feature matrix, $\sigma $ is the sigmoid function, and parameters W, U and b are trainable weights and biases. $z^t$ and $r^t$ are functions that control update gate and reset gate, respectively, which determine to what degree the neighborhood information contributes to the current node embedding.

Representation readout

Having got the USS-Graph representations of the documents, we then need to aggregate these graph-based representations and then transfer them into final classification predictions in the readout stage. Inspired by the idea of TextING [22], readout functions are as follow:

$$\begin{aligned} h_{w,d}&= \sigma (f_1(h_{w,d}^t))\odot tanh(f_2 (h_{w,d}^t)) \end{aligned}$$

(12)

$$\begin{aligned} h_G&= \frac{1}{|W|+1}\sum _{w\in W, d}{h_{w,d}+Maxpool(h_1,\dots ,h_w,h_d)} \end{aligned}$$

(13)

where $f_1$ and $f_2$ are two multi-layer perceptrons (MLP) which perform as a soft attention weight and a non-linear feature transformation, respectively. Node representations from words and the document all contribute to the information aggregation by getting through averaging function and a max-pooling function, while part of nodes with higher weights distributed by attention mechanism contribute more.

The aggregated USS-Graph representation then get fed into a softmax layer to make prediction. Parameters are trained through the cross-entropy function.

$$\begin{aligned} \hat{y}_G&= softmax(Wh_G+b) \end{aligned}$$

(14)

$$\begin{aligned} Loss&= -\sum _i{y_{Gi} log(\hat{y}_{Gi})} \end{aligned}$$

(15)

where $y_{Gi}$ denotes the i-th element of the one-hot label.

Optimization measures

Apart from the implementation of USS-Graph, we also introduce some extended optimization methods, which we adopt to further improve the knowledge integration and representation performance of USS-Graph in different perspectives.

Fine-tuning unsupervised pre-trained models

Even though the USS-Graph with only pre-trained unsupervised models can achieve outstanding representation performance, fine-tuning the model based on the target dataset before the graph construction stage does be able to further improve the graph-based representation performance. However, it is a little inefficient in time and labor, which to some extent obey our tentative idea that directly utilize the unsupervised semantic and syntactic knowledge.

Convolutional embedding mode & truncation

Limited by the maximum embedding length of existing mainstream unsupervised pre-trained models, raw USS-Graph can process context whose embedding length is shorter than 512 tokens. To deal with contexts with longer token length, one intuitive way is to truncate the original token embeddings. The truncation can be on the post-positional tokens in a context that exceed the maximum length, which is efficient to long context not to longer than the maximum length, for example, no more than 600 tokens. Alternatively, the truncation can be on the suffixal tokens in a word so that every word is represented by its most prefixal token. Both methods may make information loss to some extent with context sequence getting longer.

In order to obtain token embeddings from unsupervised pre-trained models with unlimited length, a convolutional embedding mode is applicable. Given a fixed window length D, representation of token i can be calculated by feeding the exact sequence consisting of tokens from D to the left of i to D to the right of i into the unsupervised pre-trained model. Theoretically, the maximum value of D can be set as 256, and thus representation of every token in the document is contextualized from the 512 tokens closest to it. And the embedding function to calculate the impact matrix during graph construction stage should be modified to $H_\theta (x_{i-D:i+D}\setminus \{x_{i}\})_{i}$. Impact of $x_j$ on $x_i$ is neglected and set as 0 if $x_j$ is out of the 2D range of $x_i$. Document-level representation is calculated through the average of $|x|//2D$ [CLS] tokens, where $|x|$ denotes the number of tokens in the context. In this way, the semantic knowledge from unsupervised models can be sufficiently captured during node construction.

Tensor USS-Graph

Inspired by the idea of TensorGCN [23] which constructs a graph based on three types of contextual knowledge, including semantic, syntactic, and sequential knowledge, here we extend the structure of USS-Graph by introducing sequential information during the graph construction stage. Thus a Tensor USS-Graph with a two-tensor graph structure can be generated. The edge construction of the sequential tensor graph is based on PMI for word-word edge and TF-IDF for word-document edge as shown below. Although it makes USS-Graph no longer a inductive learning method, the variant explores the potential to further enrich the representation information of USS-Graph.

$$\begin{aligned} A_{i, j}= {\left\{ \begin{array}{ll}{\text {PMI}}(i, j), &{} i, j \text{ are } \text{ words } \text{ and } i \ne j \\ {\text {TF}}-{\text {IDF}}(i, j), &{} i \text{ is } \text{ a } \text{ document, } j \text{ is } \text{ a } \text{ word } \\ 1, &{} i=j \\ 0, &{} \text{ otherwise } \end{array}\right. } \end{aligned}$$

(16)

Similar to the interaction strategy applied in TensorGCN, the interaction process of Tensor USS-Graph consists of two steps. First, intra-graph propagation is conducted, where the unsupervised knowledge based tensor graph and the sequential knowledge based tensor graph get interacted distinctively. Second, inter-graph propagation is conducted. Since the node constructions in the two tensor graphs are consistent, here we intuitively exchange the node vectors between the corresponding nodes in the two tensor graphs, in which way to realize further integration of semantic, syntactic, and sequential knowledge.

Edge pruning

Given that a raw USS-Graph is a bidirectionally weighted fully connected graph, on one hand, the graph contains rich semantic and syntactic knowledge, on the other hand, noise may exist in the graph which may cause biases of the representation. Therefore, edge pruning is a necessary measure to keep the balance between fine grain and bias. Given a threshold value p, after the impact matrix $M \in R^{|W|+1\times |W|+1}$ being generated, we cut off distances smaller than p in the matrix and reset them as 0, considering that the impact of these edges are relatively smaller. In this way, the unsupervised knowledge orientation towards graph interaction is expected to be more precise.

Domain-specific pre-trained models

Unsupervised pre-trained models prevail in textual representation, due to pre-training based on large scale unlabeled corpora. Therefore, the domain diversity of corpora can contribute better representation performance of pre-trained models in domain-specific tasks. Since the semantic and syntactic knowledge of USS-Graph in both nodes and edges are induced from unsupervised pre-trained models, while abundant domain-specific textual representation models have been launched recently, such as clinicalBERT [40] and bioBERT [41], it is expected that USS-Graph representation can also learn the domain-specific knowledge with the unsupervised pre-trained model altered.

Experiments

In this section, (1) we first evaluate the performance of our proposed USS-Graph and its optimized variant in the text classification tasks and compare them with other representative models. (2) Then, we make ablation studies to test the contribution each structure makes for the USS-Graph representation. (3) We also attempt to seek an appropriate threshold value p during edge pruning to balance fine grain and bias for the information in the graph structure. (4) Finally, we demonstrate the adaptability of USS-Graph representation in diverse domain-specific tasks.

Performance comparison

Table 1

Summary statistics of datasets

Dataset	# Docs	# Training	# Test	# Classes	Average Length	Language
MR	10,662	7108	3554	2	20.39	English
R8	7674	5485	2189	8	65.72	English
R52	9100	6532	2568	52	69.82	English
Ohsumed	7400	3357	4043	23	135.82	English
20NG	18,846	11,314	7532	20	221.26	English
Weibo	35,000	23,333	11,667	50	7.47	Chinese

Datasets

Being consistent with experiments done by Yao et al. [20], here we adopt five English benchmark datasets and one Chinese benchmark dataset for our textual classification tasks, which are (1) Movie Review dataset (MR¹) which classifies the sentiment of movie reviews into positive or negative; (2) R8 Reuters dataset (R8) and (3) R52 Reuters dataset (R52²) which respectively classify documents appearing on Reuters newswire into 8 and 52 categories; (4) Ohsumed dataset³ which classifies cardiovascular diseases abstracts into 23 diseases categories; (5) 20-Newsgroups dataset (20NG⁴) which classifies news documents into 20 different categories.; (6) Weibo [42] dataset which classify the collection of Chinese microblog data into 50 categories. Summary statistics of the datasets are presented in Table 1.

Baselines

Keeping track of the study on text classification, we compare the performances of USS-Graph and its optimized variant with those of representative baseline models, which can be categorized into five types: (1) traditional machine learning methods, for example TF-IDF + LR; (2) word embedding based methods including PV-DBOW [43], PV-DM [43], PTE [44], fastText [45], SWEM [46] and LEAM [47]; (3) sequential-based deep learning methods including TextCNN [12], LSTM [14] and Bi-LSTM; (4) graph-based deep learning methods including Graph-CNN [48‐50], TextGCN [20], TensorGCN [23] and TextING [22]; (5) unsupervised pre-trained model based deep learning methods including fine-tuned BERT [16], STGCN+BERT+Bi-LSTM [27].

Table 2

Test accuracy (%) of various models on the six datasets. The mean ± standard deviation of our model is reported according to 10 times run. Note that some baseline results are from [20] and their corresponding original papers, respectively

Model	MR	R8	R52	Ohsumed	20NG	Weibo
TF-IDF + LR	74.59 ± 0.00	93.74 ± 0.00	86.95 ± 0.00	54.66 ± 0.00	83.19 ± 0.00	50.13 ± 0.00
CNN-rand	74.98 ± 0.70	94.02 ± 0.57	85.37 ± 0.47	43.87 ± 1.00	76.93 ± 0.61	–
CNN-non-static	77.75 ± 0.72	95.71 ± 0.52	87.59 ± 0.48	58.44 ± 1.06	82.15 ± 0.52	52.37 ± 0.54
LSTM	75.06 ± 0.44	93.68 ± 0.82	85.54 ± 1.13	41.13 ± 1.17	65.71 ± 1.52	51.42 ± 1.09
LSTM (pre-train)	77.33 ± 0.89	96.09 ± 0.19	90.48 ± 0.86	51.10 ± 1.50	75.43 ± 1.72	–
Bi-LSTM	77.68 ± 0.86	96.31 ± 0.33	90.54 ± 0.91	49.27 ± 1.07	73.18 ± 1.85	52.21 ± 1.14
PV-DBOW	61.09 ± 0.10	85.87 ± 0.10	78.29 ± 0.11	46.65 ± 0.19	74.36 ± 0.18	–
PV-DM	59.47 ± 0.38	52.07 ± 0.04	44.92 ± 0.05	29.50 ± 0.07	51.14 ± 0.22	–
PTE	70.23 ± 0.36	96.69 ± 0.13	90.71 ± 0.14	53.58 ± 0.29	76.74 ± 0.29	–
fastText	72.17 ± 1.30	86.04 ± 0.24	71.55 ± 0.42	14.59 ± 0.00	11.38 ± 1.18	51.62 ± 0.76
fastText (bigrams)	76.24 ± 0.12	94.74 ± 0.11	90.99 ± 0.05	55.69 ± 0.39	79.67 ± 0.29	52.39 ± 0.42
SWEM	76.65 ± 0.63	95.32 ± 0.26	92.94 ± 0.24	63.12 ± 0.55	85.16 ± 0.29	–
LEAM	76.95 ± 0.45	93.31 ± 0.24	91.84 ± 0.23	58.58 ± 0.79	81.91 ± 0.24	–
Graph-CNN-C	77.22 ± 0.27	96.99 ± 0.12	92.75 ± 0.22	63.86 ± 0.53	81.42 ± 0.32	–
Graph-CNN-S	76.99 ± 0.14	96.80 ± 0.20	92.74 ± 0.24	62.82 ± 0.37	–	–
Graph-CNN-F	76.74 ± 0.21	96.89 ± 0.06	93.20 ± 0.04	63.04 ± 0.77	–	–
Text GCN	76.74 ± 0.20	97.07 ± 0.10	93.56 ± 0.18	68.36 ± 0.56	86.34 ± 0.09	53.37 ± 0.35
TensorGCN	77.91 ± 0.07	98.04 ± 0.08	95.05 ± 0.11	70.11 ± 0.24	87.74 ± 0.05	54.19 ± 0.26
TextING	79.82 ± 0.20	98.04 ± 0.25	95.48 ± 0.19	70.42 ± 0.39	–	54.41 ± 0.37
fine-tuned BERT	83.62 ± 0.23	95.92 ± 0.15	95.53 ± 0.13	69.34 ± 0.21	84.22 ± 0.18	56.20 ± 0.20
STGCN+BERT+Bi-LSTM	82.46 ± 0.72	97.42 ± 0.69	–	–	–	57.15 ± 0.64
USS-Graph	86.07 ± 0.13	98.51 ± 0.17	96.44 ± 0.11	72.78 ± 0.07	89.32 ± 0.19	58.11 ± 0.27
Tensor USS-Graph	86.23 ± 0.16	98.58 ± 0.19	96.47 ± 0.11	73.65 ± 0.12	89.63 ± 0.21	58.96 ± 0.32

Bold value denotes that the performance of our proposed method is state-of-the-art

Table 3

Precision, Recall, and F1 (%) of USS-Graph and representative baseline methods on MR and R8

	MR			R8
Model	Precision	Recall	F1	Precision	Recall	F1
Text GCN	76.41	76.39	76.38	92.28	91.90	92.03
TensorGCN	77.91	77.88	77.90	93.47	93.72	93.6
TextING	79.76	79.74	79.75	93.45	93.75	93.58
Fine-tuned BERT	83.62	83.59	83.61	94.61	94.72	94.68
STGCN+BERT+Bi-LSTM	82.45	82.48	82.46	94.79	95.03	94.87
USS-Graph	86.10	86.07	86.08	95.54	95.69	95.61
Tensor USS-Graph	86.25	86.22	86.23	95.61	95.74	95.68

Bold value denotes that the performance of our proposed method is state-of-the-art

Experiment settings

To reflect the representation performance of USS-Graph in a more general way, here we utilize BERT-base as the unsupervised pre-trained model to construct the USS-Graph rather than some other up-to-date models. Context larger than 512 tokens are applied with convolutional embedding mode and the D is set as 256. A two-layer adapted GGNN is applied in the global information interaction stage, whose learning rate is initialized as 0.0005 with Adam [51] and dropout rate is initialized as 0.5. Before graph construction stage, we firstly fine tune the BERT model with the training set and the learning rate is initialized as 0.0001. For all the five datasets, we randomly select 10% of the training set as the validation set while the left as training set.

Result analysis

Performances of our models and baselines are presented in Table 2 and Table 3. Both USS-Graph and its variant Tensor USS-Graph outperform baselines on all the datasets and can be claimed to achieve SOTA performance inductively. Specifically, the accuracy of USS-Graph is higher than those of TextING and fine-tuned BERT whose representations are learned from global textual information and contextualized information, respectively, which demonstrates that our proposed method does effectively intra-integrate these two types of information in the graph structure. Results on Weibo dataset demonstrate that with unsupervised knowledge based node selection operation applied, USS-Graph can also perform well on diverse languages apart from English. Comparing the accuracy results between datasets, we can observe that as the average word length of dataset increases (from MR to 20NG), such performance advantage can be further expanded. Explanation can be that the interaction between contextualized and global information gets more complex and frequent in the graph structure, and the semantic and syntactic knowledge from unsupervised pre-trained model further enriches the interacted information. Meanwhile, by introducing sequential information into the USS-Graph, accuracy of Tensor USS-Graph gets further improved, which demonstrates that USS-Graph can obtain multiple knowledge by being constructed in a tensor way.

Characteristic analysis

We also conduct a series of extended experiments to deeply analyze the potential characteristics of USS-Graph and the effectiveness of our proposed optimization measures for knowledge integration.

Table 4

Accuracy (%) of USS-Graph on MR and Ohsumed with different ablation study settings

Model	MR	Ohsumed
USS-Graph	86.07	72.78
fine-tuning (w/o)	85.81	72.67
unsupervised knowledge edge (w/o)	84.63	71.24
unsupervised knowledge node (w/o)	80.13	70.92

Table 5

Test accuracy (%) of USS-Graph on MR and Ohsumed with different perturbed masking settings

Setting	MR	Ohsumed
Single masking operation (intuitive)	85.26 ± 0.27	72.46 ± 0.19
Dual masking operation (ours)	86.07 ± 0.13	72.78 ± 0.07

Ablation study

We make ablation studies aiming to explore the contribution of each operation step to the final representation performance of USS-Graph, including fine-tuning BERT step before graph construction, and edge construction and node construction step with unsupervised knowledge. Experiments are conducted on MR and Ohsumed. The results are shown in Table 4.

First, we cut off the unsupervised knowledge based edge and node construction, respectively, by replacing them with sequential-based edges and pre-trained GloVe [52]. Results show that the two perspectives of construction both contribute to improving the final representation and are complementary to each other, which can be interpreted as the efficient semantic and syntactic knowledge orientations from two construction perspectives. Meanwhile, given the differences in the applied datasets, the contribution significance of the two construction perspectives tends to be different.

Second, fine-tuning BERT previously to enhance its contextualized information is beneficial to the final representation, although this operation may be a little inefficient. Furthermore, USS-Graph representation, even without fine-tuning, can perform very well, which demonstrates that our designed graph architecture in USS-Graph do have strong ability to induce and express the semantic and syntactic knowledge implicit in the unsupervised pre-trained models.

Specifically, We have noticed the related statements in the work of Sentence-BERT [37] that directly using the output of BERT leads to rather poor performances and utilizing the average embeddings and [CLS] token of un-fine-tuned BERT both perform worse than utilizing the average GloVe embeddings in the task of semantic textual similarity. However, we want to point out that the background of that work is different from ours. First, in the work of Sentence-BERT, they only utilize the [CLS] token or the average embedding of not fine-tuned BERT as the representation of a sentence, which results in the loss of both the context and granularity of the original information. In contrast, in the graph construction stage of USS-Graph, we utilize every distinct token from BERT to integrate word-level and document-level information into both edge and node construction. Second, the task of Sentence-BERT is semantic textual similarity, which maps a pair of sentence embeddings to a similarity score. In contrast, the task of USS-Graph is text classification. Compared with the task of text classification, we believe that the task of semantic textual similarity has higher requirements on the differentiation between vector space mapping of sentences. Third, in the work of Sentence-BERT, the representation of two sentences from not fine-tuned BERT are directly conducted with similarity calculation without processing. In contrast, in the work of USS-Graph, we deeply integrate the fine-grained BERT-based contextualized information into graph neural networks in the perspective of both edge and node construction to realize further global information interaction.

Therefore, the good performance of USS-Graph without fine-tuning BERT is reasonable. Moreover, these results in some way also demonstrate our initial purpose that our proposed USS-Graph, based on global information interaction of graph neural network, does realize the induction and expression of rich semantic and syntactic knowledge implicit in unsupervised pre-trained models, in which way to further improve the performance of text classification.

The effect of perturbed masking strategy

Regarding the detailed design of the perturbed masking for the assessment of dependency relationship in the graph construction stage, we utilize dual masking strategy which masks the both assessed nodes in two steps. Another intuitive perturbed strategy is to only mask $w_j$ once, and computing the distance of $w_i$ before and after masking $w_j$ to assess the impact from $w_j$ to $w_i$.To demonstrate the rationality of our applied strategy, we conduct ablation study on MR and Ohsumed to compare the two perturbed masking strategies. As is shown in Table 5, our applied dual masking operation performs better than the intuitive single masking operation. The explanation of the results can be that the tokens of $w_i$ themself imply rich information. Such information would be the noise interfering the accurate assessment of impact from the context. In the intuitive perturbed masking strategy, the encoding result of such assessment method will get the information of $w_i$ involved into encoding, which we think is the noise that have side effect on the contextualized information. Therefore, our applied perturbed masking method is more effective for dependency relationship assessment.

The effect of edge pruning

Edge pruning is proposed to keep the balance between fine grain and biases for information interaction inside the graph structure. To prove the correctness of our hypothesis, here we conduct the experiment on MR. The distribution of word pair distance in the impact matrices of MR and the results of the pruning adjustment are shown in Figure 6. We can see that the accuracy peaks when the parameter p is set as 1, which to some extent corresponds to the distance distribution and also proves the effectiveness of the edge pruning method.

It can be interpreted that the unsupervised knowledge orientation for graph interaction is further refined through edge pruning operation, since edge pruning can tease out knowledgeable and precise interaction paths for the bidirectional weighted and fully connected graph.

Adaptability on domain-specific tasks

Table 6

Accuracy (%) of USS-Graph on Ohsumed and MR with different unsupervised pre-trained models

Model	Ohsumed	MR
BERT-base	72.78	86.07
BioBERT	73.06	84.37
ClinicalBERT	73.25	84.72

Unsupervised textual models can acquire domain knowledge by pre-training them based on large-scale domain-specific corpora, which brings potential for graph-based models to better take over domain-specific representation ability. To examine the adaptability of USS-Graph on domain-specific tasks, we select two unsupervised pre-trained models clinicalBERT [40] and bioBERT [41] which are two medical domain-related pre-trained models. Since Ohsumed is a medical benchmark dataset, we compare the accuracy of USS-Graphs adopting the two domain-specific models with that adopting BERT-base based on Ohsumed and MR as a contrast. Results presented in Table 6 show that performances of USS-Graphs adopting clinicalBERT and bioBERT are both better than that of the raw version in terms of medical domain text, which just demonstrates our hypothesis that benefiting from the domain-specific knowledge transferred from unsupervised pre-trained models, USS-Graph can be better adaptive to domain-specific text classification tasks.

Case study for the underlying mechanism

It is believed that contextualized information originates from local consecutive word sequences in a document, while global information originates from long-distance individual words. The integration of the two perspectives of information can therefore further enriches the text representation. To explore the underlying contributions of unsupervised models and graph neural networks in USS-Graph. Specifically, we conduct the case study to visualize the example sentence “Successfully blended satire, high camp and yet another sexual taboo into a really funny movie.” under three different representation mechanisms. The first visualization result is based on BERT + Attention framework. The second visualization result is based on graph-based framework with GloVe embedding and co-occurrence edges setting. The third visualization result is based on USS-Graph framework. All the visualization results are presented according to the attention weights in the three frameworks. As is shown in Figure 7, the BERT-based mechanism focuses more on local consecutive sequences, while the contributions of the individual words are hard to be distinguished. the GNN-based mechanism focuses more on individual words, while the contextualized environments are not fully considered. As the integration of the two mechanisms, USS-Graph-based mechanism leverage both long-distance individual words and their contextualized sequences.

The effect of convolutional embedding mode

Convolutional embedding mode is proposed aiming to deal with the textual length limitation of BERT, since the existing pre-trained BERT can process at most 512 tokens in one context. To demonstrate the effectiveness of convolutional embedding mode, we conduct the ablation study of USS-Graph under different settings of textual processing strategy in BERT. The results are shown in Figure 8. It can be seen that with the average length of the dataset increases, the gap between the performance of USS-Graph based on convolutional embedding mode and USS-Graph based on truncation gradually increases. With convolutional embedding mode measure applied, the performance of USS-Graph is less sensitive to the textual length of the dataset.

Conclusion

In this paper, we propose USS-Graph representation method for inductive text classification. By constructing a bidirectionally weighted heterogeneous graph structure which deeply integrates unsupervised semantics and syntax, USS-Graph can realize the intra-fusion of graph-based global information and unsupervised contextualized information. Furthermore, we propose a series of optimization measures to improve the knowledge integration of USS-Graph. Therefore, the semantic and syntactic knowledge implicit in unsupervised pre-trained models is sufficiently induced, and the global interaction of graph neural networks is knowledgeably oriented. Experiments demonstrate the SOTA performance of our method. Additionally, the adaptability of USS-Graph on domain-specific text classification tasks is demonstrated. With the continuous development of unsupervised representation models, it is expected that text representation based on USS-Graph method will be increasingly distinguished. In view of the feasibility of the integration and complementation between the two cutting-edge text representation methods, in future work, we plan to further explore a “human-in-the-loop” framework based on our proposed USS-Graph. By designing a series of tuning strategies, e.g., instruction tuning and prompt tuning, we expect to involve human feedbacks into the text representation framework, so as to dynamically instruct the graph modelling and unsupervised knowledge induction. Thereby, our proposed method could better fit in the domain-specific and human intention-specific text classification tasks in complex human-computer interaction scenarios.

Acknowledgements

This work was supported in part by the National Key R &D Program of China under Grant 2021ZD0113404, the Beijing Natural Science Foundation under Grant M22012, and BUPT Excellent Ph.D. Students Foundation under Grant CX2022133.

Declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

previous article LTACL: long-tail awareness contrastive learning for distantly supervised relation extraction

next article Computer vision-based hand gesture recognition for human-robot interaction: a review

http://www.cs.cornell.edu/people/pabo/movie-review-data/.

https://www.cs.umb.edu/~smimarog/textmining/datasets/.

http://disi.unitn.it/moschitti/ corpora.htm.

http://qwone.com/~jason/20Newsgroups/.

Shaaban MA, Hassan YF, Guirguis SK (2021) Deep convolutional forest: a dynamic deep ensemble approach for spam detection in text. Complex Intel Syst 8:4897–4909CrossRef

Wang AH (2010) Don’t follow me: Spam detection in twitter. In: 2010 International Conference on Security and Cryptography (SECRYPT), pp. 1–10

Rashid A, Farooq MS, Abid A, Umer T, Bashir, AK, Zikria YB (2021) Social media intention mining for sustainable information systems: categories, taxonomy, datasets and challenges. Complex & Intelligent Systems

Shekhar S, Garg H, Agrawal R, Shivani S, Sharma B (2021) Hatred and trolling detection transliteration framework using hierarchical lstm in code-mixed social media text. Complex & Intelligent Systems

Che Z, Kale D, Li W, Bahadori MT, Liu Y (2015) Deep Computational Phenotyping, pp. 507–516. Association for Computing Machinery, New York, NY, USA

Miotto R, Li L, Kidd B (2016) Deep patient: An unsupervised representation to predict the future of patients from the electronic health records. Sci Rep 6:26094. https://doi.org/10.1038/srep26094ADSCrossRefPubMedPubMedCentral

Androutsopoulos I, Koutsias J, Chandrinos K, Paliouras G, Spyropoulos C (2000) An evaluation of naive bayesian anti-spam filtering. CoRR arXiv:cs.CL/0006013

Tan S (2006) An effective refinement strategy for knn text classifier. Expert Syst Appl 30:290–298. https://doi.org/10.1016/j.eswa.2005.07.019CrossRef

Forman G (2008) Bns feature scaling: An improved representation over tf-idf for svm text classification. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management. CIKM ’08, pp. 263–270. Association for Computing Machinery, New York, NY, USA

10.

Wang S, Manning C (2012) Baselines and bigrams: Simple, good sentiment and topic classification. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Jeju Island, Korea, pp. 90–94

11.

Chenthamarakshan V, Melville P, Sindhwani V, Lawrence RD (2011) Concept labeling: Building text classifiers with minimal supervision. In: IJCAI, pp. 1225–1230

12.

Kim Y (2014) Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1746–1751. Association for Computational Linguistics, Doha, Qatar

13.

Conneau A, Schwenk H, Barrault L, LeCun Y (2016) Very deep convolutional networks for natural language processing. ArXiv arXiv:1606.01781

14.

Liu, P., Qiu, X., Huang, X.: Recurrent neural network for text classification with multi-task learning. In: Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence. IJCAI’16, pp. 2873–2879. AAAI Press, Washington (2016)

15.

Tai KS, Socher R, Manning CD (2015) Improved semantic representations from tree-structured long short-term memory networks. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics, Beijing, China, pp. 1556–1566

16.

Devlin J, Chang M-W, Lee K, Toutanova K (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. In: NAACL

17.

Cai H, Zheng V, Chang K (2018) A comprehensive survey of graph embedding: Problems, techniques, and applications. IEEE Trans Knowl Data Eng 30:1616–1637CrossRef

18.

Battaglia P, Hamrick JB, Bapst V, Sanchez-Gonzalez A, Zambaldi V, Malinowski M, Tacchetti A, Raposo D, Santoro A, Faulkner R, Çaglar Gülçehre Song HF, Ballard AJ, Gilmer J, Dahl GE, Vaswani A, Allen KR, Nash C, Langston V, Dyer C, Heess N, Wierstra D, Kohli P, Botvinick M, Vinyals O, Li Y, Pascanu R (2018) Relational inductive biases, deep learning, and graph networks. ArXiv arXiv:1806.01261

19.

Kipf, T., Welling, M.: Semi-supervised classification with graph convolutional networks. ArXiv arXiv:1609.02907 (2017)

20.

Yao, L., Mao, C., Luo, Y.: Graph convolutional networks for text classification. In: AAAI (2019)

21.

Velickovic, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., Bengio, Y.: Graph attention networks. ArXiv arXiv:1710.10903 (2018)

22.

Zhang, Y., Yu, X., Cui, Z., Wu, S., Wen, Z., Wang, L.: Every document owns its structure: Inductive text classification via graph neural networks. In: ACL (2020)

23.

Liu, X., You, X., Zhang, X., Wu, J., Lv, P.: Tensor graph convolutional networks for text classification. ArXiv arXiv:2001.05313 (2020)

24.

Wu, Z., Chen, Y., Kao, B., Liu, Q.: Perturbed masking: Parameter-free probing for analyzing and interpreting bert. In: ACL (2020)

25.

Lu, Y., Jiang, X., Fang, Y., Shi, C.: Learning to pre-train graph neural networks. In: AAAI (2021)

26.

Rogers A, Kovaleva O, Rumshisky A (2021) A primer in bertology: What we know about how bert works. Trans Assoc Comput Linguistics 8:842–866CrossRef

27.

Ye, Z., Jiang, G., Liu, Y., Li, Z., Yuan, J.: Document and word representations generated by graph convolutional network and bert for short text classification. In: ECAI (2020)

28.

Li, Y., Tarlow, D., Brockschmidt, M., Zemel, R.: Gated graph sequence neural networks. CoRR arXiv:1511.05493 (2016)

29.

Li, Z., Cui, Z., Wu, S., Zhang, X., Wang, L.: Fi-gnn: Modeling feature interactions via graph neural networks for ctr prediction. Proceedings of the 28th ACM International Conference on Information and Knowledge Management (2019)

30.

Yang Y, Miao R, Wang Y, Wang X (2022) Contrastive graph convolutional networks with adaptive augmentation for text classification. Inform Proces Manag 59(4):102946CrossRef

31.

Wang X, Ma W, Guo L, Jiang H, Liu F, Xu C (2022) Hgnn: Hyperedge-based graph neural network for mooc course recommendation. Inform Processing Manag 59(3):102938CrossRef

32.

Lu G, Li J, Wei J (2022) Aspect sentiment analysis with heterogeneous graph neural networks. Inform Process Manag 59(4):102953. https://doi.org/10.1016/j.ipm.2022.102953CrossRef

33.

Shen, W., Wu, S., Yang, Y., Quan, X.: Directed acyclic graph network for conversational emotion recognition. ArXiv arXiv:2105.12907 (2021)

34.

Etaiwi W, Awajan A (2020) Graph-based arabic text semantic representation. Inform Proces Manag 57(3):102183CrossRef

35.

Ettinger A (2020) What bert is not: Lessons from a new suite of psycholinguistic diagnostics for language models. Transa Association Computational Linguistics 8:34–48CrossRef

36.

Írfan Aygün, Kaya, B., Kaya, M.S.: Aspect based twitter sentiment analysis on vaccination and vaccine types in covid-19 pandemic with deep learning. IEEE Journal of Biomedical and Health Informatics 26, 2360–2369 (2022)

37.

Reimers, N., Gurevych, I.: Sentence-bert: Sentence embeddings using siamese bert-networks. ArXiv arXiv:1908.10084 (2019)

38.

Cui Y, Che W, Liu T, Qin B, Yang Z, Wang S, Hu G (2021) Pre-training with whole word masking for chinese bert. IEEE/ACM Trans Audio, Speech, Language Proces 29:3504–3514CrossRef

39.

Cho, K., Merrienboer, B.V., Çaglar Gülçehre, Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using rnn encoder-decoder for statistical machine translation. In: EMNLP (2014)

40.

Alsentzer, E., Murphy, J., Boag, W., Weng, W., Jin, D., Naumann, T., McDermott, M.B.A.: Publicly available clinical bert embeddings. ArXiv arXiv:1904.03323 (2019)

41.

Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, Kang J (2020) Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36:1234–1240

42.

He, Y.: Extracting topical phrases from clinical documents. In: AAAI (2016)

43.

Le, Q.V., Mikolov, T.: Distributed representations of sentences and documents. ArXiv arXiv:1405.4053 (2014)

44.

Tang, J., Qu, M., Mei, Q.: Pte: Predictive text embedding through large-scale heterogeneous text networks. Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2015)

45.

Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. In: EACL (2017)

46.

Shen, D., Wang, G., Wang, W., Min, M.R., Su, Q., Zhang, Y., Li, C., Henao, R., Carin, L.: Baseline needs more love: On simple word-embedding-based models and associated pooling mechanisms. In: ACL (2018)

47.

Wang, G., Li, C., Wang, W., Zhang, Y., Shen, D., Zhang, X., Henao, R., Carin, L.: Joint embedding of words and labels for text classification. ArXiv arXiv:1805.04174 (2018)

48.

Defferrard, M., Bresson, X., Vandergheynst, P.: Convolutional neural networks on graphs with fast localized spectral filtering. In: NIPS (2016)

49.

Bruna, J., Zaremba, W., Szlam, A.D., LeCun, Y.: Spectral networks and locally connected networks on graphs. CoRR arXiv:1312.6203 (2014)

50.

Henaff, M., Bruna, J., LeCun, Y.: Deep convolutional networks on graph-structured data. ArXiv arXiv:1506.05163 (2015)

51.

Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. CoRR arXiv:1412.6980 (2015)

52.

Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)

Title: Deeply integrating unsupervised semantics and syntax into heterogeneous graphs for inductive text classification
Authors: Yue Gao
Xiangling Fu
Xien Liu
Ji Wu
Publication date: 28-09-2023
Publisher: Springer International Publishing
Published in: Complex & Intelligent Systems / Issue 1/2024
Print ISSN: 2199-4536
Electronic ISSN: 2198-6053
DOI: https://doi.org/10.1007/s40747-023-01228-8

Springer Professional

Deeply integrating unsupervised semantics and syntax into heterogeneous graphs for inductive text classification

Abstract

Publisher's Note

Introduction

Graph neural network

Unsupervised pre-trained model

Method

USS-Graph construction

Preliminary: perturbed masking for token impact

Edge construction

Node construction

Graph-based global information interaction

Representation readout

Optimization measures

Fine-tuning unsupervised pre-trained models

Convolutional embedding mode & truncation

Tensor USS-Graph

Edge pruning

Domain-specific pre-trained models

Experiments

Performance comparison

Datasets

Baselines

Experiment settings

Result analysis

Characteristic analysis

Ablation study

The effect of perturbed masking strategy

The effect of edge pruning

Adaptability on domain-specific tasks

Case study for the underlying mechanism

The effect of convolutional embedding mode

Conclusion

Acknowledgements

Declarations

Conflict of interest

Publisher's Note

Premium Partner

Springer Professional

Abstract

Publisher's Note

Introduction

Related work

Graph neural network

Unsupervised pre-trained model

Method

USS-Graph construction

Preliminary: perturbed masking for token impact

Edge construction

Node construction

Graph-based global information interaction

Representation readout

Optimization measures

Fine-tuning unsupervised pre-trained models

Convolutional embedding mode & truncation

Tensor USS-Graph

Edge pruning

Domain-specific pre-trained models

Experiments

Performance comparison

Datasets

Baselines

Experiment settings

Result analysis

Characteristic analysis

Ablation study

The effect of perturbed masking strategy

The effect of edge pruning

Adaptability on domain-specific tasks

Case study for the underlying mechanism

The effect of convolutional embedding mode

Conclusion

Acknowledgements

Declarations

Conflict of interest

Publisher's Note

Other articles of this Issue 1/2024

A novel adaptive parameter strategy differential evolution algorithm and its application in midcourse guidance maneuver decision-making

Bionic visual navigation model for enhanced template matching and loop closing in challenging lighting environments

Multi-modal mutation cooperatively coevolving algorithm for resource allocation of large-scale D2D communication system

Applying particle swarm optimization-based dynamic adaptive hyperlink evaluation to focused crawler for meteorological disasters

Learning features from irrelevant domains through deep neural network

Uncertainty guided ensemble self-training for semi-supervised global field reconstruction

Premium Partner