Open Access 30072022  Regular Paper
Optimizing graph layout by tSNE perplexity estimation
Published in: International Journal of Data Science and Analytics
Abstract
Perplexity is one of the key parameters of dimensionality reduction algorithm of tdistributed stochastic neighbor embedding (tSNE). In this paper, we investigated the relationship of tSNE perplexity and graph layout evaluation metrics including graph stress, preserved neighborhood information and visual inspection. As we found that a small perplexity is correlated with a relative higher normalized stress while preserving neighborhood information with a higher precision but less global structure information, we proposed our method to estimate appropriate perplexity either based on a modified standard tSNE or the sklearn Barnes–Hut TSNE. Experimental results demonstrate effectiveness and ease of use of our approach when tested on a set of benchmark datasets.
1 Introduction
A particular class of graph layout methods is based on dimensionality reduction (DR) techniques, which are designed to embed original highdimensional data into a two or threedimensional data, so that the graph is visually readable [1, 2]. Some examples of DR techniques successfully applied for graph layout include linear DR such as principal component analysis (PCA) [3], samplingbased approximation DR [4, 5], nonlinear DR such as tSNE [6‐8], where the graphtheoretic distances between pairs of nodes in a graph are mapped as the original highdimensional data for DR [3, 6, 9, 10]. The aim of the DR techniques based on the graphtheoretic distance is to minimize the difference between the graphtheoretic distance and the twodimensional Euclidean distance, such that the twodimensional layout reflects as much as possible the pairwiser highdimensional distances within a given neighborhood.
tSNE [11] is one of the nonlinear DR algorithms which has been widely used in various applications. It is also one of the most successful DR algorithms for graph layouts including IDMAP, UMAP and PBC [12]. Among the set of parameters which determine the tSNE performance, perplexity is the one showing randomness, as different perplexities often yield different visualizations. Maaten and Hinton [11] suggested typical values of the perplexity ranging between 5 and 50. Later, Wattenberg et al. [13] demonstrated examples with various settings of perplexity, which lead to a variety of shapes, clusters and topology of the twodimensional visualization, and they claimed that “getting the most from tSNE may mean analyzing multiple plots with different perplexities.” It seems to be the reality that in practices, the perplexity is usually obtained by a random initialization, or a guess based on experiences, or a grid search, the one with the best performance out of the tries is then selected. For large datasets, a grid search is extremely timeconsuming, which is not a good practice.
Advertisement
The question about how to estimate an appropriate tSNE perplexity is still left open, although it is indicated that the most appropriate value of perplexity depends on the density of data, which means that a larger and/or denser dataset requires a larger perplexity [14]. We are also very interested in how to estimate a perplexity which is appropriate to generate graph layouts with good quality and less random output, and is easy to use without requirement to try multiple times. Since the perplexity is closely related to the size and density of the data as suggested above, we explore the following questions to reveal the truth behind the mysterious perplexity of tSNE in this paper. The questions include: In the remaining of this paper, we first review the stateoftheart tools employing tSNE to draw graphs. We then design our approach to investigate tSNE perplexity. The experimental results are finally presented to validate our approach. Our contributions are: Tested on a set of benchmark datasets, our proposed approach demonstrates better performance when compared with one of the most successful methods tsNET* [6], by reducing 5% of the normalized stress and increasing 2.5% of the neighborhood preservation on average. With our adapted version for perplexity estimation of sklearn Barnes–Hut TSNE [15], the runtime on large datasets can be reduced from hundreds of seconds with the standard tSNE to tens of seconds, meanwhile keeping comparable even better normalized stress and neighborhood preservation. Our proposed approach presents advantages with respect to effectiveness and ease of use.
1.
What it the relationship between different perplexities and the size of a graph dataset?
2.
Is there a reasonable perplexity range for a goodquality graph layout? If the answer is yes, what role the size and density of a graph plays when estimating a reasonable perplexity?
1.
We modify the standard tSNE [14] to fit underestimated perplexity as a valid input to improve its usability (the standard tSNE does not accept the underestimated perplexity as an valid input).
2.
We explore the impact of different perplexities on graph layout, identify the relationship between the perplexity and graph layout quality, and propose a perplexity estimation approach for good graph layouts using the standard tSNE.
3.
We further propose an adapted version of perplexity estimation which can be applied with sklearn Barnes–Hut TSNE [15] to accelerate the process while keeping good layout quality for large graph datasets.
2 Related work
In this section, we review the related work which employed tSNE to draw graph, and introduce the definition of perplexity in tSNE and metrics for graph layout evaluation applied in this paper.
2.1 Graph drawing based on tSNE and tSNE variations
Kruiger et al. [6] demonstrated that their employment of tSNE, named tsNET, outperforms several other DR algorithms like IDMAP in graph layout, where tsNET simplifies the input parameter settings to focus on perplexity only with minimal tuning of other parameters such as learning rate and number of iterations. Based on tsNET, tsNET* initializes a layout using PivotMDS, and optimizes the layout based on the PivotMDS layout. It is mentioned that the perplexities were set on average of 80 with a standard deviation of 45 for most of the test datasets. However, it has not been discussed how perplexity for each dataset was selected (except EVA dataset) and if different perplexity affects the graph layout, tsNET rejects small perplexity as input.
Xu et al. [7] drew graphs using a model named tNeRV [16], which was a generalized tSNE model based on a modified graphtheoretic distance matrix, in which the cost function is a combination of the average of both of the Kullback–Leibler divergence from highdimensional space to lowdimensional space and vice versa, the cost of one of the energy model out of a modified Kamada and Kawai (KK) model, a modified FruchtermanReingold (FR) model, and a modified Linlog model. Their approach also showed that the output graph layout is quite dependent on the finetuning of a set of parameters controlling early compression, adjacent neighbor compression, node repulsive forces, and edge attractive forces during the cost computation. Again, how the perplexity was selected for different results presented in the paper has not been mentioned.
Advertisement
In addition to drawing graph based on the graphtheoretic distances, a parametric version of tSNE named GraphTSNE [8] adopted additional node features in the final graph layout. GraphTSNE trains a twolayer residue gated graph convolutional network using a modified tSNE cost \(C_T\) consisting of both the graph clustering loss \(C_G\) and the feature clustering loss \(C_X\), so that \(C_T = \alpha C_G + (1\alpha ) C_X\). The graph clustering loss is computed based on the shortest graphtheoretic distance, while the feature clustering loss is computed based on the Euclidean distance of word embedding of each node. An important learning goal is to optimize the tradeoff weight \(\alpha \) for the two loss components, and the perplexity is set to 30. However, when tested with tsNET, the CORA dataset works when the perplexity is set bigger than 169. The impact of different perplexities in GraphTSNE has not been investigated.
We can observe that the perplexity initialization of the use cases mentioned above shows randomness. As a fact, optimization of perplexity has raised attention of some researchers. De Rosa et al. [17] investigated several perplexity metaheuristic optimization methods including artificial bee colony algorithm, bat algorithm, genetic programming, and particle swarm optimization on word embedding visualization. A practical issue is that the optimization methods are evaluated by the Kullback–Leibler divergence cost residue, which is closely related to perplexity itself. That means different perplexities correspond to different cost when initializing, the cost residues for different perplexities are not comparable. De Bodt et al. [18, 19] tried perplexityfree parametric tSNE, and also tried to adjust the embeddings (visualization) by using additional class labels instead of the traditional perplexity [20]. Their methods introduced supervised training process. As in our case, tSNE will be employed for unsupervised DR, we will first investigate the impact of different perplexities on graph layout quality, then test our approach to estimate appropriate value of this parameter on benchmark datasets.
2.2 Definition of perplexity in tSNE
The perplexity in tSNE is 2 to the power of Shannon entropy of the conditional distribution induced by a data point \(x_i\) (see Eq. 1). As explained by Maaten [14], “the perplexity of a fair die with k sides is equal to k. In tSNE, the perplexity may be viewed as a knob that sets the number of effective nearest neighbors. It is comparable with the number of nearest neighbors k that is employed in many manifold learners.”whereandwhere \(p_{j\vert i}\) is the conditional probability of data point \(x_j\) in the neighborhood of \(x_i\) based on a Gaussian distribution. tSNE defines joint probabilities \(p_{ij}\) by symmetrizing the two conditional probabilities as follows:Note that the bandwidth of the Gaussian kernels, i.e., \(\sigma _{i}\) in Eq. 3, is set in a way that \(H(P_i)\) equals a predefined perplexity [15]. In addition, \(\sigma _{i}\) is different for each data point \(x_{i}\), which means that it changes for each highdimensional instance in order to capture the variations in density for different highdimensional neighborhoods, where \(\sigma _{i}\) tends to be smaller in regions of the data space with a higher data density compared to the regions with lower density [15]. In practices, \(\sigma _{i}\) is found iteratively by trialanderror [21], using binary search, until a userdefined perplexity H is reached. With this process the probability matrix P in the highdimensional space is obtained. Similarly, the probability matrix Q in the lowdimensional space based on tdistribution is computed asThen, Kullback–Leibler divergence \(C_{\mathrm{KL}}\) as cost is computed to minimize the difference between P and Q, usually by gradient descent.With an underestimated perplexity or the unfounded Gaussian kernel after limited iterations, the data points in the defined neighborhood fail to fit to a Gaussian, resulting in exceptions with undefined \(p_{j\vert i}\). We discuss our solution to this issue in Sect. 3.1.
$$\begin{aligned} \text {Perp}(P_i) = 2^{H(P_i)} \end{aligned}$$
(1)
$$\begin{aligned} H(P_i) = \sum _j p_{j\vert i} {\mathrm{log}_{2}}{p_{j\vert i}} \end{aligned}$$
(2)
$$\begin{aligned} p_{j\vert i} = \frac{\text {exp}(\left\ x_{i}x_{j}\right\ ^2/2\sigma _{i}^2)}{\sum _{k(k\ne i)} \text {exp}(\left\ x_{i}x_{k}\right\ ^2/2\sigma _{i}^2)},\; \; p_{i\vert i} = 0 \end{aligned}$$
(3)
$$\begin{aligned} p_{ij} = \frac{p_{j\vert i} + p_{i\vert j}}{2}. \end{aligned}$$
(4)
$$\begin{aligned} q_{ij} = q_{ji} = \frac{(1+\left\ y_{i}y_{j}\right\ ^2)^{1}}{\sum _{k \ne l} (1+\left\ y_{k}y_{l}\right\ ^2)^{1}},\; q_{i\vert i} = 0 \end{aligned}$$
(5)
$$\begin{aligned} C_{KL} = \sum _{i \ne j} p_{ij}log\frac{p_{ij}}{q_{ij}}. \end{aligned}$$
(6)
2.3 Metrics for graph layout evaluation
Espadoto et al. [12] summarized most widely used metrics for graph layout evaluations. Quality metrics include scalar metrics (trustworthiness, continuity, normalized stress, neighborhood hit, visual separation metrics), local metrics (projection precision score, stretching and compression, average local error), and pointpair metrics. As we would like to keep in line with the work of tsNET and tsNET* [6], the graph layout in this paper is evaluated exactly the same as [6] by (1) normalized stress measurement \(M_{\sigma }\), (2) neighborhood preservation measurement \(M_{np}\), and (3) visual inspection. Below, we list the formal definitions of normalized stress and neighborhood preservation in [6]. As the visual inspection is not a quantitative metric, we are not able to define it as a metric similar to the other two measurements. Instead, we evaluate a visualization by inspecting the structural distortion or overlapping on the benchmark datasets.
2.3.1 Normalized stress measure \(M_{\sigma }\)
A key to the DRapproach is to suitably define the distance d so that the stress is minimized to reflect a good layout.
Given a set of N ndimensional observations \(\{x_i \in R^n\}_{i=1}^N\), a DR technique maps \(x_i\) to a lowerdimensional set \(\{y_i \in R^m\}_{i=1}^N\), where m is usually 2 or 3. The measurement of normalized stress \(M_{\sigma }\) is computed as:where \(d(x_i,x_j)\) is a distance metric over the input space, in this case, is the shortest graphtheoretical distance, and \(\left\ y_i  y_j\right\ \) usually refers to the Euclidean 2D distance.
$$\begin{aligned} M_{\sigma } = \sum _{i,j}\left( \frac{\mathrm{d}(x_i,x_j)\left\ y_i  y_j \right\ }{\mathrm{d}(x_i, x_j)} \right) ^2 \end{aligned}$$
(7)
2.3.2 Neighborhood preservation measure \(M_{np}\)
Given a graph \(G = (V, E)\), let \(N_{G}(x_{i}, r_{G}) = \{x_{j} \in V \vert d_{ij} <= r_{G}\}\) denote the nodes with a graphtheoretic distance no more than \(r_{G}\) from node \(x_{i}\). Then, in the lowdimensional space, which is usually the twodimensional projection space, an equally sized neighborhood of \(x_i\) in the layout \(N_{Y}(x_{i} , k_i)\) is defined as the set of nodes corresponding to the data points that are the \(k_i\) nearestneighbors of \(y_i\), with \(k_i = \vert N_G(x_i, r_G)\vert \). Note that \(k_i\) may differ for different \(x_i\). The neighborhood preservation measurement \(M_{np}\) is then defined as the Jaccard similarity between the neighborhoods in the high and lowdimensional spaces \(N_G\) and \(N_Y\), averaged over G.In our tests, we let \(r_G = 2\) to compare with the results in the related references.
$$\begin{aligned} M_{np} = \frac{1}{\vert V \vert }\sum _{i}\frac{\vert N_G(x_i, r_G) \cap N_Y(x_i , k_i)\vert }{\vert N_G(x_i, r_G) \cup N_Y(x_i , k_i)\vert } \end{aligned}$$
(8)
3 Design of our approach
3.1 Modification on perplexity and bandwidth fitting in standard tSNE
We make modification based on the Python code available from tSNE Github [14] to improve handling of exceptions during bandwidth fitting.
Handling exceptions may vary in practices. The original standard tSNE code based on which we have been working does not accept underestimated perplexities, neither does tsNET. For example, tsNET simply throws the exceptions in case an underestimated perplexity is fed, an increased userdefined perplexity is then expected to try again until it works. It is a reasonable solution in practice, as an increased perplexity means considering a broader neighborhood with more samples for more reliable statistical results. However, the randomness in this perplexity initialization procedure seems to be quite userunfriendly, especially for some extreme examples like the EVA dataset, which needs almost 600 as the perplexity to start off in tsNET [6].
As we identify that the exceptions are caused by data sparsity, we apply two strategies as solutions to the problem. One strategy is to increase the Gaussian kernel \(\sigma _i\) by a controlled exaggeration rate during the process to find an appropriate \(\beta _i\), where \(\beta _i\) is a function of \(\sigma _i\), \(\beta _i = \frac{1}{2\sigma _i^2}\), in order to avoid failures when fitting the data points. The exaggeration rate is 2 in our experiment.
Another strategy is to run an intrinsic grid search within limited steps for a bigger perplexity. When detecting an exception during fitting without updating the initial exaggeration rate of Gaussian kernel, the perplexity is increased by a predefined rate, which is currently 3 in our experiment. The modifications are described in algorithm 1.
×
The two strategies are applied in different situations to ensure the delivery of an output. Generally, the perplexity strategy works for all kinds of datasets. However, as the intrinsic grid search process for a big dataset is timeconsuming, the Gaussian kernel strategy can be considered to deliver a result with any perplexity initialization. In our following experiments, the Gaussian kernel strategy is applied to large graphs, and perplexity increase strategy is applied to small graphs.
3.2 Appropriate perplexity estimation for the standard tSNE
First of all, we would like to identify the relationship between a perplexity and the graph layout quality measured by the normalized stress, neighborhood preservation, and visualization that directly reflects the global connective structure.
The concept of “large” and “small” in terms of perplexity is dependent on the size of dataset; therefore, we map the value of perplexity to a specific percentage of the dataset size. The tSNE perplexity defines the size of neighborhood within which the data points are considered. Based on the above discussion, we propose the following hypothesis:
Hypothesis:
A relative smaller perplexity will generate a graph layout with both higher normalized stress and more neighborhood preservation, while a larger perplexity will generate a graph layout with both lower normalized stress and less neighborhood preservation.
As a higher normalized stress means more overall information will be lost from a global perspective, and a higher neighborhood preservation means neighborhood information can be preserved with a higher precision from a local perspective, and vice versa, a further question based on the hypothesis is, how to find the balance between a perplexity and graph layout quality measured by the two metrics. We propose our perplexity estimation approach by considering the size of dataset, the distribution of input data, and the graph density, which is illustrated as follows.
Given a graph G(V, E), the Gaussian mean \((\mu )\) and kernel \((\sigma )\) of the graphtheoretic distances of G is obtained at first. Then, we estimate a reasonable perplexity perp according to the dataset size \(\vert V\vert \), graph density \(d = \vert E\vert / \vert V\vert \), \(\mu \) and \(\sigma \) of the Gaussian, which can be described as:where \(\delta (d)\) can be regarded as the graph density regulator, as the graph density d is considered as a positively correlated factor of perplexity in addition to the dataset size, a small or large d corresponds to a small or large perplexity, respectively.
$$\begin{aligned} \mathrm{perp} = \left\{ \begin{array}{lr} a\_\mathrm{fixed}\_\mathrm{number}, &{} \text {if } \vert V\vert < \text {threshold,}\\ \vert V\vert \frac{\mu  2\sigma }{\mu }\delta (d), &{} \text {if } \vert V\vert \ge \text {threshold,} \end{array} \right. \end{aligned}$$
(9)
The idea behind the upper part of the perplexity selection in Eq. 9 is that the statistical result for small sized data is less reliable compared to relative large sized data. Therefore, we can set a fixed number as the perplexity. The idea behind the bottom part of the selection method can be explained from three aspects. First, the perplexity can be bigger for large dataset, smaller for small dataset, then the dataset size \(\vert V\vert \) is a direct factor when choosing a perplexity. Second, the perplexity can be bigger for more densely distributed data, whereas a smaller perplexity works for sparsely distributed data. This factor can be measured by how far the Gaussian distribution spreads considering the neighborhood far away, i.e., \(\frac{\mu  2\sigma }{\mu }\). Finally, the perplexity is regulated by the graph density regulator \(\delta (d)\).
3.3 Appropriate perplexity estimation for sklearn Barnes–Hut TSNE
Moreover, we also consider the runtime issue. The runtime on large datasets such as graphs with over 2000 vertices are reported over several hundreds of seconds when applying tsNET and tsNET* (see Table 4 in [6]), similar trend about runtime for our modified version of the standard tSNE is also expected. In the current Python environment, an accelerated tSNE version, sklearn Barnes–Hut TSNE, is available by implementing Barnes–Hut approximations, allowing the tool to be applied on large realworld datasets [15].
We then adapt our perplexity estimation to sklearn Barnes–Hut TSNE by updating Eq. 9 as:The adaption to sklearn Barnes–Hut TSNE in Eq. 10 is expected to work on the large datasets for accelerating purpose without losing too much precision. We test all the proposed methods in the following experiments.
$$\begin{aligned} \mathrm{perp}= \vert V\vert \frac{\mu  \sigma }{3\mu }\delta (d), \; \mathrm{if}\; \vert V\vert \ge \mathrm{threshold}. \end{aligned}$$
(10)
4 Experiment and results
The visualizations presented in this paper are generated in Python with libraries including numpy, sklearn, networkx and plotly, in jupyter notebook or other webbased Python frameworks. We visualize the nodes in a graph as red dots, and the edges in red, green and grey with respect to different edge length. Equation 11 illustrates how the edge color is assigned, where e, \(\mathrm{color}_{e}\), \(d_{\mu }\) and \(d_{\sigma }\) represent the length of the edge, color of the edge, mean distance and standard deviation of the graphtheoretic distances of the individual graph dataset, respectively.
$$\begin{aligned} \mathrm{color}_{e} = \left\{ \begin{array}{ll} \mathrm{red}, &{} \text {if } e < d_{\mu }  d_{\sigma }, \\ \mathrm{green}, &{} \text {if } d_{\mu }  d_{\sigma } \le e \le d_{\mu } + d_{\sigma }, \\ \mathrm{grey}, &{} \text {if } e > d_{\mu } + d_{\sigma }. \end{array} \right. \end{aligned}$$
(11)
4.1 Test data
We test our approach with the 20 datasets released by the tsNET team [22]. Table 1 lists the datasets.
In order to analyze the results, we label the datasets regarding their size and graph density. Small and large graphs refer to graphs with less than 1000 and more than 1000 vertices, respectively. Sparse and dense graphs stand for graphs with graph density less than 3 and more than 3, respectively, where the graph density is calculated by \(\vert E\vert /\vert V\vert \).
Table 1
Benchmark datasets
Dataset  Dataset type  \(\vert V\vert \)  \(\vert E\vert \)  \(\vert E\vert /\vert V\vert \) 

dtw_72  Planar, structural  72  75  1.0417 
lesmis  Collaboration network  77  254  3.2987 
can_96  Meshlike, structural  96  336  3.5000 
rajat11  Miscellaneous  135  377  2.7926 
jazz  Collaboration network  198  2742  13.8485 
visbrazil  Collaboration network  222  236  1.5135 
grid17  Planar, meshlike, structural  289  544  1.8824 
mesh3e1  Meshlike, structural  289  800  2.7682 
netscience  Collaboration network  379  914  2.4116 
dwt_419  Structural  419  1572  3.7518 
price_1000  Planar, tree  1000  999  0.9990 
dwt_1005  Structural  1005  3808  3.7891 
cage8  Miscellaneous  1015  4994  4.9202 
bcsstk09  Meshlike, structural  1083  8677  8.0120 
block_2000  Clusters  2000  9912  4.9560 
sierpinski3d  Structural  2050  6144  2.9971 
CAGrQc  Collaboration network  4158  13,422  3.2280 
EVA  Collaboration network  4475  4652  1.0396 
3elt  Planar, meshlike  4720  13,722  2.9072 
us_powergrid  Structural  4941  6594  1.3345 
4.2 Validation of hypothesis
We carry out a series of tests to validate hypothesis. The perplexity is set as 2%, 5%, 10%, 20%, 30%, 40% and 50% of the size of individual dataset. Figure 1 shows the boxplot of \(M_{np}\) and \(M_{\sigma }\) over the 20 datasets, where the mean value of each subgroup labeled by the percentage is dotted.
×
Figure 1b shows that the average neighborhood preservation \(M_{np}\) increases to the peak when perplexity is less than 10% of dataset size, then decreases with the increase in perplexity. The average normalized stress \(M_{\sigma }\) in Fig. 1a increases to the peak when perplexity is less than 5% of dataset size, then decreases with the increase in perplexity. However, the decrease in \(M_{\sigma }\) is less sharp than the \(M_{np}\), as shown in Table 2 where the linear model coefficient \(\mathrm{lm}\_\mathrm{coef}\) of \(M_{\sigma }\) and \(M_{np}\) is \(\,4.3826\) and \(\,6.1940\), respectively. If we flip over one of the fitted lines against the xaxis, the intersection point of the two lines can be identified when \(x=0.4664\). It indicates that a large perplexity no more than 47% of the dataset size could result in quite stable visualization in which both normalized stress and neighborhood preservation could be more likely to be wellbalanced. The correlation between pairs of variables is presented in Table 2. Pearson’s correlation coefficient presented as cor_coef suggests that both of the average \(M_{np}\) and the average \(M_{\sigma }\) are significantly negatively correlated to perplexity, with correlation coefficient − 0.8299 and − 0.9450, respectively, both are very close to − 1. The results strongly support the hypothesis that a smaller perplexity corresponds to a larger \(M_{\sigma }\) and a larger \(M_{np}\), and that a larger perplexity corresponds to smaller \(M_{\sigma }\) and \(M_{np}\). In addition, the smoothed quadratic means of both \(M_{\sigma }\) and \(M_{np}\) shown in Fig. 2, which are fitted in R by loess method and formula \(y \sim x\), is also a strong visual support to the hypothesis.
Table 2
Correlation of perplexity percentage and average \(M_{\sigma }\) and average \(M_{np}\)
\(\mathrm{Perp}\_\mathrm{pct}\) with  lm coefficient and intercept  Correlation  

\(\mathrm{lm}\_\mathrm{coef}^{1}\)  \(\mathrm{lm}\_\mathrm{intc}^{2}\)  \(\mathrm{cor}\_\mathrm{coef}^{3}\)  95% CI  \(p\_\mathrm{value}^{4}\)  
Average \(M_\sigma \)  \(\,4.3826\)  0.7024  \(\,0.8299\)  \([\,0.9741, \,0.2048]\)  0.0209 
Average \(M_{np}\)  \(\,6.1940\)  4.2302  \(\,0.9450\)  \([\,0.9921, \,0.6655]\)  0.0013 
×
×
To illustrate the issue visually with examples, Fig. 3 demonstrates two datasets with gradually increased perplexities ranging from 2 to 100% of the dataset size using the modified standard tSNE. We can see that the visualizations generated with a larger perplexity are fairly robust from a global perspective (less \(M_{\sigma }\)), meanwhile preserving neighborhood information with less precision (less \(M_{np}\)) as more nodes are overlapped with the increase in perplexity.
We also examine the tsNET and tsNET* visualizations using their released code with the change of perplexities the same as in Fig. 3, which is shown as Fig. 4. The maximal iteration number is set 1100 for all tests presented in Fig. 3 and Fig. 4. A similar trend as Fig. 3 presented can be inspected from the graph layouts presented in Fig. 4 that tsNET and tsNET* visualizations generated with a larger perplexity are quite robust from the global perspective. However, the nodes of dw_1005 in tsNET and tsNET* layouts with a larger perplexity are much compressed with the increase in perplexity before 20% of the dataset size, and spread out with less overlapping with the increase in perplexity after 30% of the dataset size, which is different from what we can observe in Fig. 3, most probably due to the additional control of node repulsion and edge attraction in tsNET [6].
Overall, we can identify the descending trends of both \(M_{np}\) and \(M_{\sigma }\) with the increase in perplexity in Fig. 1, with which the hypothesis can be validated. In addition, as shown in Table 2, \(M_{np}\) and \(M_{\sigma }\) are positively correlated with \(p = 0.63\), it suggests that there is a need to find a tradeoff between \(M_{np}\) and \(M_{\sigma }\) with respect to an appropriate perplexity, which will be large enough to generate a good layout with less \(M_{\sigma }\), but small enough to preserve more neighborhood details with higher \(M_{np}\).
4.3 Perplexity estimation based on our approach based on the modified standard tSNE
We test our perplexity estimation approach with the 20 datasets. The perplexity is set according to Eq. 9, with the fixed value 40 as perplexity for small sized graphs. The threshold is 1000. Further, we roughly set the value of \(\delta (d) \in \{0.1, 0.3\}\) in our experiment as described in Eq. 12, and only apply a big \(\delta (d)\) value to large and very dense graphs to avoid overfitting.
$$\begin{aligned} \delta (d) = \left\{ \begin{array}{ll} 0.1, &{} \text {if } d < 6,\\ 0.3, &{} \text {if } d \ge 6\text { and } \vert V \vert \ge 1000. \end{array} \right. \end{aligned}$$
(12)
4.3.1 \(M_{\sigma }\) and \(M_{np}\)
Table 3 shows the experiment results in details. Overall, we can find that both the average \(M_{\sigma }\) and \(M_{np}\) of our approach show improvements compared with the overall averages of tsNET* reported in [6]. It suggests that our perplexity estimation approach is very robust and effective to balance between normalized stress and neighborhood preservation. When observing the performance of our method on the four different types of graphs, the average \(M_{\sigma }\) and \(M_{np}\) for small and largesized groups as well as the sparse group also demonstrate excellent performance stability. The performance of our approach on dense graphs is also comparable to that of tsNET*, with a less average normalized stress and neighborhood preservation, without significant difference however. The performance on several individual datasets including dwt_72, rajat11, 3elt, us_powergrid and dwt_1005 outperform that of tsNET*, the others show either a better or comparable \(M_{\sigma }\) or a better or comparable \(M_{np}\) at the same time, except dwt_419. Then, we test with a larger perplexity, a value which is over 15% of the dataset size of dwt_419, we can obtain very stable excellent performance.
Table 3
Normalized stress (\(M_\sigma \)) and neighborhood preservation (\(M_{np}\)) measures of graph layouts by the modified standard tSNE (stSNE), where italics and bold numbers correspond to outperformed and underperformed measures compared to tsNET*, respectively
Dataset  Size/density label  \(M_\sigma ^{1}\)  \(M_{np}^{2}\)  

tsNET*  stSNE  tsNET*  stSNE  
Perplexity \(= 40\)  
dtw_72  Small/sparse  0.048  0.0392  0.855  0.8647 
rajat11  Small/sparse  0.097  0.0779  0.717  0.7220 
visbrazil  Small/sparse  0.098  0.0722  0.584  0.5308 
grid17  Small/sparse  0.021  0.0493  0.785  0.8499 
mesh3e1  Small/sparse  0.014  0.0166  0.904  0.9984 
netscience  Small/sparse  0.100  0.1112  0.707  0.7154 
lesmis  Small/dense  0.111  0.0967  0.712  0.6945 
can_96  Small/dense  0.084  0.0743  0.671  0.6495 
jazz  Small/dense  0.128  0.1388  0.804  0.8077 
dwt_419  Small/dense  0.024  0.0386  0.741  0.7108 
Perplexity \(=\vert V\vert \frac{\mu  2\sigma }{\mu } \delta (\frac{\vert E \vert }{\vert V \vert }) \) as in Eq. 9  
price_1000  Large/sparse  0.160  0.1220  0.639  0.6229 
sierpinski3d  Large/sparse  0.093  0.1013  0.580  0.6531 
EVA  Large/sparse  0.161  0.2161  0.802  0.8227 
3elt  Large/sparse  0.090  0.0717  0.715  0.8113 
us_powergrid  Large/sparse  0.101  0.0742  0.457  0.5424 
dwt_1005  Large/dense  0.035  0.0185  0.619  0.6667 
cage8  Large/dense  0.203  0.1658  0.437  0.4219 
bcsstk09  Large/dense  0.022  0.0220  0.867  0.8509 
block_2000  Large/dense  0.189  0.1604  0.372  0.3640 
CAGrQc  Large/dense  0.189  0.2057  0.483  0.4871 
Average  Overall  0.0984  0.0936  0.6726  0.6893 
Small  0.0725  0.0715  0.7480  0.7544  
Large  0.1243  0.1158  0.5971  0.6243  
Sparse  0.0894  0.0865  0.7041  0.7394  
Dense  0.1094  0.1023  0.6340  0.6281 
×
Table 4
Normalized stress (\(M_\sigma \)), neighborhood preservation (\(M_{np}\)) measures and runtime (in seconds) of sklearn Barnes–Hut TSNE (TSNE)
Perplexity \(=\vert V\vert \frac{\mu  \sigma }{3\mu } \delta (\frac{\vert E\vert }{\vert V\vert }) \) as in Eq. 10  

Dataset  TSNE 
\(M_\sigma \)

\(M_{np}\)
 
runtime (s)  tsNET*  TSNE  tsNET*  TSNE  
Large sparse  price_1000  2.8  0.160 
0.1666
 0.639 
0.6332

sierpinski3d  5.0  0.093 
0.1421
 0.580 
0.6538
 
EVA  27.4  0.161 
0.1771
 0.802 
0.8197
 
3elt  23.5  0.090 
0.0660
 0.715 
0.8226
 
us_powergrid  24.8  0.101 
0.0889
 0.457 
0.5789
 
Large dense  dwt_1005  2.4  0.035 
0.0199
 0.619 
0.6744

cage8  2.5  0.203 
0.1551
 0.437 
0.4161
 
bcsstk09  3.5  0.022 
0.0242
 0.867 
0.8420
 
block_2000  5.6  0.189 
0.1519
 0.372 
0.3645
 
CAGrQc  31.3  0.189 
0.2264
 0.483 
0.4915
 
Average  Largeoverall  12.9  0.1243 
0.1218
 0.5971 
0.6296

Largesparse  16.7  0.1210 
0.1281
 0.6386 
0.7014
 
Largedense  9.0  0.1276 
0.1155
 0.5556 
0.5577

×
4.3.2 Visual inspection
Visual inspections of all datasets with perplexity of 2% and 10% of dataset size, as well as estimation with our method based on the modified standard tSNE, are shown in Fig. 6. We can find that a small perplexity often results in a decrease/uncertainty of fitting precision and therefore causes visual distortions of the graph layout. Several meshtype graph examples in Fig. 6, such as dwt_72, can_96, 3elt, dwt_1005, sierpinski3d and bcsstk09 can illustrate this problem clearly, when their perplexities are set as small as 2% of the individual dataset size. Overall, most of the visualizations generated with the perplexity estimated with our approach based on the modified standard tSNE, except dwt_419, display better structures without or with less visual distortions (shown as gray edges). dwt_419 can be better visualized when the perplexity is increased, as mentioned above.
4.3.3 Runtime
The runtime in seconds is shown in Fig. 5, where the datasets are ordered by their number of nodes (size) along the xaxis. We do not compare the runtime in a precise way the same as in Table 3 due to the reason that the runtime reported in the work of tsNET and tsNET* was based on their customized settings of perplexity and stop conditions, which are unknown to us. We can observe that the runtime in Fig. 5 in general is increased with the decrease in perplexity from 2 to 50% for the same dataset, and large datasets need more time. It shows that our perplexity estimation approach chooses the perplexity between 2 and 10% of dataset size for most of the datasets tested.
We focus on the performance of our approach based on the modified standard tSNE rather than the speed in the experiments above. As it takes several hundreds of seconds for tsNET/tsNET* and our modified version of the standard tSNE on large datasets with over 4000 nodes, we try to improve it by the adapted estimation method as described in Eq. 10. Our experiment on the large test datasets also demonstrates very good performance with \(M_\sigma = 0.1218\) and \(M_{np} = 0.6296\) on average, compared to the average tsNET* \(M_\sigma = 0.1243\) and \(M_{np} = 0.5971\), as given in Table 4. We find that our perplexity estimation for the Barnes–Hut tSNE does not work well on small datasets, and it works better on large dense data than on large sparse data, as the Barnes–Hut approximation on sparse or small data causes a higher information loss. The average runtime is dramatically reduced from 598 s with our modified standard tSNE to 12.9 s with sklearn Barnes–Hut TSNE, while keeping overall good graph layout quality comparable to the method based on the modified standard tSNE.
4.4 Discussion
In our approach, we estimate the value of perplexity based on the normality of the input data, which is very effective when tested on the benchmark datasets. We also notice that each test dataset is a connected component such that the distribution of pairwise graphtheoretic distances fits normality well. For graphs with many disconnected components, a layout with much higher stress is supposed to be generated by tSNE, as pilot experiments show, and similar findings is also reported in the work [7]. Focusing on the largest connected component or the connected component of interest is a practical solution to employ tSNE on graphs with many disconnected components.
We estimate the value of perplexity for small sized graph with less than 1000 nodes to 40, which is an approximation between 2 and 10% of 1000, based on the average graph size of the test data. An increased value of graph density regulator \(\delta (d)\) is not applied to the small graphs to avoid overfitting, whose visualizations show more randomness/uncertainty than the large graphs due to the data sparsity. In our experiment, we observe an automatic increase in perplexity from 40 to 100 for the small dense dataset jazz, and it shows the robustness of our approach.
Our approach presents advantages with respect to ease of use and effectiveness.
First, our approach does not require users to try multiple times with different perplexities as input to deliver the acceptable output, which is very easy to use. For example, the EVA dataset can be visualized with any smaller perplexities compared to 600 in tsNET and tsNET*, as shown in Fig. 6. However, an extra grid search could probably help to find an optimum with extra costs of time and effort.
Second, our approach does not rely on the tSNE parameter tuning such as the weight of the compression term [6, 7], and it does not require an additional PivotMDS layout as initialization as tsNET* does, although the employment of a PivotMDS layout seems to be beneficial for small graphs as tested. As the tsNET* works based on the initialization of PivotMDS layouts, tSNE cannot obtain better embeddings in case the PivotMDS layouts could not fully reflect the relations based on graphtheoretic distances.
Experimental results show that our approach is very effective to visualize graph data with good quality evaluated by normalized stress, neighborhood preservation and visual inspections, as well as a significant improvement in runtime when applied with sklearn Barnes–Hut TSNE on large datasets without losing visualization quality. Our work is a step beyond the work of Kruiger et al. [6] and Wattenberg et al. [13], and it not only reveals the impact of tSNE perplexity on graph layouts, but also presents guidelines to estimate an appropriate perplexity for good layout, and offers possibility to accelerate the process using widely used Python tools.
5 Conclusion and future work
In this paper, we investigated the relationship of tSNE perplexity and graph layout, improved the standard tSNE to fit a variety of perplexity initialization, and proposed a perplexity estimation approach for goodquality graph data visualization evaluated by widely recognized quality metrics of graph layout.
Our approach demonstrated effectiveness and ease of use for graph data visualization when tested on a set of different types of benchmark datasets. As tSNE is a widely recognized DR technique and perplexity is the most significant parameter, our research can be beneficial to a broad range of related researches and applications.
In the future, we will investigate other graph structures to improve our estimation, especially the density regulator, and test more types of data such as the disconnected networks and labeled networks. We are also very interested in investigating the relationship of similar parameters as perplexity and graph layout in other manifold algorithms such as UMAP.
6 Statement
Note: “I” refers to the corresponding author Chun Xiao in the following statement.

All authors wrote and reviewed the main manuscript, Chun Xiao prepared the figures.

I confirm that I understand journal International Journal of Data Science and Analytics is a transformative journal. When research is accepted for publication, there is a choice to publish using either immediate gold open access or the traditional publishing route.

I declare that the authors have no competing interests as defined by Springer, or other interests that might be perceived to influence the results and/or discussion reported in this paper.

The results/data/figures in this manuscript have not been published elsewhere, nor are they under consideration by another publisher.

I have read the Springer journal policies on author responsibilities and submit this manuscript in accordance with those policies.

All of the material is owned by the authors and/or no permissions are required.

This research did not involve Human Participants and/or Animals.
Acknowledgements
This research was supported by the Australian Research Council Linkage Grant LP160100935 and Oracle Research Lab Australia.
Declarations
Conflicts of interest
The authors declare no competing interests.
Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.