Introduction
Research into network based approaches in machine learning has been ongoing with the growing sizes of networks produced by users on commercial online social networking platforms. The ways users engage with each other directly on online social network platforms from independent activities have given rise to networks with billions of users (nodes) [
1]. The information provided by the users can be used for investigating different phenomena. The interlinking information that produces edges which then can form a network, or series of networks, has been studied in the growing field of network science [
2,
3]. Network science offers many tools and approaches to learn and to draw insight about the community structures [
4], the centrality distribution [
5] of the nodes (users), and provides models for how the networks can grow from an initial set of nodes [
6,
7]. There are many applications for these insights such as in targeted advertising [
8] where brands seek to have highly central nodes spread advertising content, and another application is in the effort to understand the ‘landscape’ of political polarization between communities in Twitter [
9].
As well as the node interlinking information which can produce a network (graph), there are the attributes of the users which can provide important useful information in improving upon predictive analytics. A notable technique used by online shopping platforms is
collaborative filtering [
10] which works as a recommendation system for improving the shopping experience of customers by optimizing the product view to those items predicted to be of interest. These new link predictions are based upon past purchases and the historical records of other customers. Here ‘clusters’ are formed between groups of items in a multidimensional space of these choices [
11,
12]. This relies on the observation that the items are not selected independently of previous purchases and that there is information gained from utilizing the data collected [
13]. Another area where predictive analytics has used information in order to make predictions is in student performance [
14]. Logistic regression has been used in situations where a model is required in a decision framework in order to predict achievements [
15]. A key difference between the network based approach and these approaches is that the information contained in the network and how the links influence the node of concern are excluded from the model which can play a key role in economic behaviors predicted [
16]. In the effort to fuse these sources together for models to incorporate [
17] discusses how user identity linkage across online social networks can be accomplished.
Online social networks have been at the forefront of the interest in networks and approaches which use the interlinking information due to their size and the effect they have on human behaviors [
18]. The network topologies of the virtual networks can find applications but they also carry over into the physical world. The techniques have been used in other domains such as examining the centrality in streets of urban spaces [
19] which also can be seen as a continuation of the original network/topological graph theoretical formulation of Euler’s investigation of the ‘Seven Bridges of Konigsberg’ problem [
20]. As these urban networks fit within networks of urban spaces themselves, multilayer networks [
21] are produced that span over the globe allowing an analysis of even global migration patterns [
22]. In a similar fashion, it is also possible to consider academic literature as a network with similar properties governing its construction, such as homophily [
23]. The nodes in such academic networks are publications and the links are the citations between the articles that provide information of association. There is active research in this field [
24] which notes the key motivation is that researchers can spend considerable amounts of time searching for the relevant research in order to not allocate time on topics already explored with similar approaches. Being able to find associative research is of importance since it is possible for research to be directed in areas already investigated and waste time as well as materials in research such as studies requiring expensive lab equipment. Navigating the network to extract relevant research is therefore a key activity in preventing this. The work of [
25] discusses how the investigators can seek from these datasets insight for the dynamics of the growth and the inter-connectivity of scientific thought. With the growth of the citation datasets (such as the ones described in “
Data” section) the concerns on the processing time, complexity of the models and the ability to interpret the results are becoming a key issue.
This then poses key questions about how to process and then reason about the results from large datasets with large variations. Questions about the results even require effort in their interpretation. Work such as [
26] look at the problem from a conceptual perspective on the areas of focus for big data and how the user can interact with the data that are results from a post analysis. The work of [
27] provides a high level overview of the tools and approaches available in visually investigating the data and the results of different methods return. It is possible to include the full set of interpretable outcomes and the full set of relevant data features, but that does produce a challenge for the practitioner to determine which features are or prime interest. A dimensionality reduction approach [
28] provides a more effective experience for the practitioner.
Graph Neural Networks (GNNs), [
29], provides a methodological framework for combining node feature and the network data information in order to produce predictions within a machine learning paradigm. There are many applications ranging from image object positions in a non-euclidean space representation, molecular properties and citation networks [
30]. Therefore this work deals with investigating an even more simple GNN, the Simple Graph Convolution (SGC) [
31], which has a simpler methodological definition and a competitive predictive accuracy. It is in developing a modified SGC that the task of reducing the dimensionality in large datasets with a GNN will be explored. As will be shown in “
Methodology” section, the simplicity of the model allows for it to be a basis for extensions that can incorporate constraints such as shrinkage upon the parameters. This is done in a manner similar to the regularization procedure of Lasso [
32]. This will allow the large complex datasets to be processed in such a manner as to be interpretable and more accessible in terms of the computations resources required. Models other than the SGC would incorporate more complexity upon procedures already complex making large dataset investigations an increasingly large challenge to apply. The work effectively takes the SGC and extends it so that the model can introduce constraints upon the parameter vectors in such a manner as to allow the model to be more easily interpreted. This allows the parameters for each class to be more sparse and for each class to have less of an overlap between each other. Altogether this results in a parameter matrix which can more easily be inspected.
In “
Related work” section, a selection of previous work highlights the development of the GNN which led to the SGC is acknowledged. The “
Data” section describes the datasets used in exploring the results of the proposed modification on the SGC with regularization. The methodology is described in “
Methodology” section, where the SGC formalism is presented and the proposed modification where the regularization upon the features and their parameters allows for a reduction of the redundant features and hence ease in the interpretability. The results are displayed in “
Results” section where the ability for the SGC and the proposed SGC allow the model to fit data which would not be linearly separable but is made separable by incorporating the graph information, and then the application to a scientific citation dataset (Cora [
33]) is shown.
Convolutional Neural Networks (CNNs) [
34] has brought a methodological approach for handling high dimensional problems more efficiently than other paradigms. As noted in [
35] in conjunction with deep learning, CNNs have greatly improved the ability to classify sound and image data. The work of [
36] introduces formally how graph based methods can be used with CNNs. A key contribution of [
36] is that the extension of the model to generalize to graphs is founded upon localized graph filters instead of the CNN’s localized convolution filter (or kernel). It presents a spectral graph formulation and how filters can be defined in respect to individual nodes in the graph with a certain number of ‘hops’ distance. These ‘hops’ are representative of the number of edges traversed between nodes and is the result of the powers of the adjacency matrix where the number of walks can be calculated [
5] (walks are paths which allow node revisits).
An introduction to the motivation from basic principals can be found in [
37], where the fundamental analysis operations of signals from regular grids (lattice structures) to more general graphs is developed. The authors in [
38] utilize the theory of signals on graphs in order to show how a shift-invariant convolution filter can be formulated as a polynomial of adjacency matrices. The discussion of how low pass filters are an underlying principal in the GNN is discussed in [
39] which is also described in the work of the SGC. [
40] proposes a Graph Convolutional Network (GCN) by adapting Convolutional Neural Networks (CNNs) for graph-structured data. The GCN learns a graph representation via layer-wise propagation rules that represents localized spectral filters.
The GNN can allow for the augmentation of a users social network and their features to make a more accurate prediction and similarly for an academic paper that the features (keywords or low dimensional representation) with the citation links can more accurately place its relevance. The machine learning framework can introduce large overheads in the processing time especially for large datasets but fortunately research has shown that simpler GNN models display peak performance [
41]. The work of [
40] which introduces a semi-supervised approach to GNNs, shows in appendix B the performance of the methodology with the number of ‘layers’ employed in the model and how there is an actual degradation of the performance after a few layers. The SGC [
31] provides an efficient framework to provide a similar model of the data associations as the GCN but avoid the necessity of the layers the GCN introduces. The methodology of the SGC, as shown in “
Methodology” section allows a single layer of matrix computations with a non-linear activation function. This is similar to the processing steps taken for logistic regression which can be computed for large datasets very efficiently. Building upon this efficient model allows an investigator to explore further constraints which would be much more computationally demanding with the incorporation of layers.
Methodology
[
40] develops Graph Convolutional Networks (GCNs) by adapting Convolutional Neural Networks (CNNs) for graph-structured data and the work of [
31] (proposing the Simple Graph Convolution (SGC)) builds upon it. The SGC removes the non-linear transitions between the layers in the model. This simplification speeds up processing time significantly yet still performs on par with GCNs and other state-of-the-art graph neural network models across multiple benchmark graph datasets. The model modification will allow easier interpretability of the parameters fitted by the optimization procedure with the application of a set of constraints. The constraints introduced into the loss function will force the stochastic gradient descent algorithm to find directions which have fewer non-zero values and less overlap for the parameters between the classes. This addresses the problem of how to inspect effectively the matrix of parameters and the vectors of the parameters for each class. This takes inspiration from regularization methods.
We adopt the notations presented in [
40] and [
31] for the GCN and SGC respectively. A graph
\(G = (V; {\mathbf {A}})\) can be defined as a collection of nodes (vertexes) set
\(V = (v_1, v_2 ,...,v_N)\) containing
N nodes and an adjacency matrix
\({\mathbf {A}} \in R^{N\times N}\) where
\(a_{ij}\) is the weighted edge between node
\(v_i\) and
\(v_j\) (
\(a_{ij} = 0\) if
\(v_i\) and
\(v_j\) are not connected). We define the degree matrix
\({\mathbf {D}} = \text {diag}(d_1,d_2,...,d_N)\) as a diagonal matrix whose off-diagonal elements are zero and each diagonal element
\(d_i\) capture the degree of node
\(v_i\) and
\(d_i = \sum _j a_{ij}\). There is a feature matrix (also referred to as the design matrix)
\(X\in {\mathbb {R}}^{N \times D}\) where each row
\(x_i\) is the feature vector measured on each node of the graph. This can be thought of as each row is the feature data belonging to a node, and the columns to a different dimension of the features. Each node
i has a class label from
C classes and hence can be coded as one hot vector
\(y_i \in \{0,1\}^C\).
The GCNs and SGC add self-loops and normalize the adjacency matrix to get the matrix
\({\mathbf {S}}\):
$$\begin{aligned} \mathbf{S } = {{\tilde{\mathbf{D }}}} ^{-\frac{1}{2}} {{\tilde{\mathbf{A }}}} {{\tilde{\mathbf{D }}}} ^{-\frac{1}{2}} \end{aligned}$$
(1)
where
\({{\tilde{\mathbf{A }}}} = \mathbf{A } + \mathbf{I }\) and
\({{\tilde{\mathbf{D }}}} = \text {diag}({{\tilde{\mathbf{A }}}})\). This normalization allows successive powers of the matrix to not influence the overall size the projections. The SGC removes non-linear transformation from the
\(k\)th-layer of the GCN resulting in a linear model of the form:
$$\begin{aligned} \hat{\mathbf{Y }} = \text {softmax} (\mathbf {S} \ldots \mathbf {SSX }\varvec{\Theta }^{(1)}\varvec{\Theta }^{(2)} \ldots \varvec{\Theta }^{(K)}). \end{aligned}$$
(2)
The SGC classifier is then achieved by collapsing the repetitive multiplication of matrix
\({\mathbf {S}}\) into the
\(k\)th power matrix
\({\mathbf {S}}^K\) and reparameterizing the successive weight matrices as
\(\varvec{\Theta } = \varvec{\Theta }^{(1)}\varvec{\Theta }^{(2)} \ldots \varvec{\Theta }^{(K)}\):
$$\begin{aligned} \hat{\mathbf{Y }} = \text {softmax} ({\mathbf {S}}^K {\mathbf {X}}\varvec{\Theta }). \end{aligned}$$
(3)
The parameter
k corresponds to the number of ‘hops’ which is the number of edge traversals in the network adjacency matrix
\({\mathbf {S}}\).
k can be thought of as accumulating information from a certain number of hops away from a node (as described visually in [
31]). If
\(k=0\) the methodology becomes equivalent to a logistic regression application which is known to be scalable to large datasets. Since the SGC introduces the matrix
\({\mathbf {S}}\) as linear operation the same scalability applies. The weight matrix
\(\varvec{\Theta }\) is trained by minimizing the cross entropy loss:
$$\begin{aligned} {\mathcal {L}} = \sum _{l \in {\mathcal {Y}}_L} \sum _{c \in C } Y_{lc}\ln {\hat{Y}_{lc}} \end{aligned}$$
(4)
where
\({\mathcal {Y}}_L\) is a collection of labeled nodes.
As motivated in “
Introduction”, the SGC shows how an efficient formulation of GNNs can be derived, it does not provide as well the ability to reduce the feature set. To reduce the number of parameter values, we introduce a flexible set of constraints as shrinkage operators in the loss for Eq.
4:
$$\begin{aligned} {\mathcal {L}}_R &= {\mathcal {L}} + L_1 \times \sum _{c \in C} \left( \sum _{d=1}^{D} |\varvec{\Theta }_{R(\cdot ,c)}|^{4} \right) ^{(-1)} + L_2 \times \sum _{c \in C}\Vert \varvec{\Theta }_{R(\cdot ,c)}\Vert _2 \nonumber \\ & \quad +L_3 \times \left( \sum _{c_{1} \in C} \sum _{c_{2} \in C} \left( |\varvec{\Theta }_{R(\cdot ,c_{1})}^T| \cdot |\varvec{\Theta }_{R(\cdot ,c_{2})}| : c_1 \prec c_2 \right) \right) \end{aligned}$$
(5)
The first component of
\({\mathcal {L}}_R\) is the loss from SGC being
\({\mathcal {L}}\). Next,
\(L_1\) is the shrinkage term for penalizing the number or parameters by reducing the penalization with a larger skew in the number of elements in the columns of
\(\varvec{\Theta }_R\). The term
\(|\varvec{\Theta }_{R(\cdot ,c)}|^{4}\) denotes the normalized vector for each class projection in the parameter matrix (which are columns) and that each element is raised to the power of 4. The
\(L_2\) term is the total magnitude of the parameter vector so that the distribution of the terms are not influential but only the norm result. The term
\(L_3\) is the term which penalizes class label projection which have large overlaps, so that vectors will be orthogonal or depending upon the value of
\(L_3\) to support opposing directions. The parameters for the regularized fit using the shrinkage in the loss will be referred to as
\(\varvec{\Theta }_R\). To impose an orthogonality constraint between the projection vectors the term for the
\(L_3\) is modified:
$$\begin{aligned} {\mathcal {L}}_R & = {\mathcal {L}} + L_1 \times \sum _{c \in C} \left( \sum _{d=1}^{D} |\varvec{\Theta }_{R(\cdot ,c)}|^{4} \right) ^{(-1)} + L_2 \times \sum _{c \in C}\Vert \varvec{\Theta }_{R(\cdot ,c)}\Vert _2 \nonumber \\ & \quad +L_3 \times \left( \sum _{c_{1} \in C} \sum _{c_{2} \in C} \left( |\varvec{\Theta }_{R(\cdot ,c_{1})}^T| \cdot |\varvec{\Theta }_{R(\cdot ,c_{2})}| : c_1 \prec c_2\right) ^2\right) . \end{aligned}$$
(6)
This methodology therefore delivers a formulation which is based upon an approach with layers as other ‘deep learning’ frameworks provide, but without the computational burdens that come along with it. The simplified model implementation is therefore capable to be run on a personal computer with Pytorch [
43].
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.