1 Introduction
-
We first establish the links between Poisson Latent Block Model (PLBM) and Poisson Stochastic Block Model (PSBM). Then we show the interest of considering PLBM rather than PSBM.
-
We propose a Sparse PLBM (SPLBM), a suitable probabilistic model for clustering of multiple graphs. Then we derive an EM-type learning algorithm.
-
We perform extensive numerical experiments and compare our proposal with multi-view and tensor decomposition methods.
-
Finally, using the ensemble method, we prove that the proposed algorithm, which can be viewed as an implicit consensus clustering for multiple graphs, is more effective than explicit clustering obtained by traditional consensus clustering methods.
2 Related work
DC-SBM
1) from an undirected graph on n vertices, possibly including self-edges. Furthermore, they established the equivalence between the maximization of the log-likelihood and the maximization of mutual information used as an objective function for clustering bipartite graphs (Dhillon et al. 2003). It is important to emphasize that the model proposed in Karrer and Newman (2011) is similar to that proposed by Nadif and Govaert (2005), where the authors also showed this connection with the maximization of mutual information; they proposed the Croinfo
algorithm as illustrated in Fig. 2. In fact, the objective function maximized by DC-SBM
, which can also be used for the co-clustering of an undirected graph, is associated with a constrained Poisson LBM commonly used in the co-clustering context; see e.g.; Ailem et al. (2017a, 2017b) and Role et al. (2019). To sum up, considering DC-SBM
which implies that the data are generated according to a Poisson LBM with \(\mathcal {P}(x_{ij}, x_{i.}x_{.j}\gamma _{k\ell })\) where \(\mathcal {P}(x_{ij};\lambda )=\frac{e^{-\lambda } \lambda ^{x_{ij}}}{x_{ij}!}\), the proportions of the classes of the nodes are assumed to be equal. In addition, although both algorithms DC-SBM
or Croinfo
are different, the objective is the same, and the clustering considered is based on an approach similar to that of the traditional hard clustering algorithms; for more detail, the reader can refer to recent works (Govaert and Nadif 2013, 2018).
3 Poisson latent and stochastic block models
3.1 Poisson latent block model (PLBM)
3.2 Poisson stochastic block model
3.3 PLBM for sparse data: sparse PLBM (SPLBM)
3.4 PSBM, PLBM and SPLBM for graphs
4 SPLBM with multiple graphs
4.1 Three-way tensor characteristics
4.2 Definition of the proposed model
5 Variational inference
TSPLBM
.TSPLBM
algorithm scales linearly with the number of non-zero entries. Let us denote nz the number of non-zero entries in \(\mathcal {X}\), it the number of iterations, g the number of clusters and v the number of slices; the computational complexity is given in \(O(it\cdot g\cdot v \cdot nz)\).
6 Experiments
TSPLBM
and multiview clustering (Sect. 6.2). Secondly, we evaluate the interest to consider multiple graphs simultaneously by TSPLBM
, unlike tensor decomposition methods that consider a reduced matrix arising from multiple graphs (Sect. 6.3). Thirdly, we evaluate the impact of considering multiple graphs (Sect. 6.4). Fourthly, we show how we can harness the results obtained by TSPLBM
(Sect. 6.5). Finally, we show that TSPLBM
can be viewed as an implicit consensus clustering and propose a solution to increase its clustering performance in an ensemble method framework (Sect. 6.6).6.1 Datasets description and pre-processing
Datasets | Type | #Graphs | #Nodes | #Clusters | Sparsity (%) |
---|---|---|---|---|---|
UC-digits | Images | 6 | 2000 | 10 | 98 |
100leaves | Images | 3 | 1600 | 100 | 98 |
3sources | Text | 3 | 169 | 6 | 80 |
BBC | Text | 4 | 685 | 5 | 88 |
DBLP1 | Text | 3 | 2223 | 3 | 96 |
Nus-Wide-8 | Text + Images | 6 | 2738 | 8 | 83 |
DBLP3 | Text | 3 | 12,550 | 10 | 99 |
Amazon-products-10 | Text + Images | 7 | 9897 | 10 | 98 |
-
UC-digits consists of 2000 images of handwritten digits (including ten classes correspond to the number 0–9) described by six views Fourier coefficients of the character shapes, profile correlations, Karhunen-Love coefficients, pixel averages, Zernike moments, and morphological features.
-
3sources consists of 169 news texts reported by three newspaper sources BBC, Reuters, and The Guardian.
-
BBC consists of 658 documents from BBC news splited into four segments and addressing five different topics.
-
100leaves consists of 1600 images from one hundred plant species and described by shape descriptor, fine-scale margin, and texture histogram features.
-
DBLP1 consists of 2223 papers published in three different journals and described by words from title, words from abstract, and authors.
-
DBLP3 is similar to the DBLP1 dataset but including 12,550 papers from ten journals.
-
Nus-Wide-8 consists of 2738 images from Flickr addressing eight topics and described by tags, Color Histogram (CH), Color Correlogram (CORR), Edge direction histogram (EDH), Wavelet texture (WT), and block-wise color moments (CW55).
-
Amazon-products-10 consists of 9897 product images from ten product categories and is described by words of product title, words of the product description, LBP features, Haralick features, and Gabor features, co-viewed and co-purchased products. Figure 9 shows all graphs (slices) reorganized according to the true partition into 10 classes.
TSPLBM
model considers count or binary adjacency matrices. Thereby, in order to apply TSPLBM
for image datasets where graphs represent similarities between images according to each type of feature, we had to convert these matrices into binary adjacency matrices (1 if the similarity is higher than ninety-seven percent quantile and 0 otherwise). In this way, we were able to study the robustness of our algorithm even when one or many slices in original data do not respect the expected structure –binary or count data–.6.2 ṮSPLBM versus multi-view clustering
TSPLBM
—after binarisation—with two recent and effective algorithms SwMV
(Nie et al 2017) and MultiNMF
(Liu et al. 2013). We consider 6 bases from the 8 ones for which we have or can apply these two algorithms.TSPLBM
with 30 runs, and we compute the average of ACC, NMI, and Purity (Sripada and Rao 2011). For the MultiNMF
, we pricked up the results in terms of ACC and NMI that are available in Wang et al. (2020) and Wang et al. (2015).
SwMV
does a better job than MultiNMF
; it achieves good results on UC-digits and 100Leaves. However, SwMV
could not give the clustering for DBLP1. On the other hand, TSPLBM
achieves highly better results than SwMV
on the four datasets.TSPLBM
gives encouraging results compared with SwMV
and MultiNMF
applied on graphs with continuous edges.Datasets | MultiNMF | SwMV | TSPLBM | |||||
---|---|---|---|---|---|---|---|---|
ACC | NMI | ACC | NMI | Purity | ACC | NMI | Purity | |
UC-digits | 0.88 | 0.80 | 0.94 | 0.91 | 0.95 | 0.74 | 0.80 | 0.76 |
3sources | 0.48 | 0.46 | 0.35 | 0.10 | 0.36 | 0.66 | 0.54 | 0.66 |
BBC | 0.48 | 0.33 | 0.33 | 0.05 | 0.33 | 0.66 | 0.66 | 0.66 |
100leaves | 0.67 | 0.86 | 0.59 | 0.87 | 0.61 | 0.46 | 0.81 | 0.46 |
DBLP1 | – | – | NA | NA | NA | 0.83 | 0.57 | 0.85 |
Nus-Wide-8 | – | – | 0.28 | 0.004 | 0.28 | 0.56 | 0.41 | 0.56 |
6.3 ṮSPLBM versus and tensor decomposition approaches
Kmeans++
(Arthur and Vassilvitskii 2007), Spectral clustering (SC)
(Ng et al. 2001), and the EM algorithm (Dempster et al. 1977) derived from diagonal Gaussian Mixture Model (GMM
) available in the Scikit-Learn
package. Thereby, we compared the sparse tensor co-clustering algorithm TSPLBM
with PARAFAC
(Harshman and Lundy 1994) and Tucker decomposition
(Tucker 1966) on the six datasets presented in the previous section. We used different ranks (10, 20, and 50) and performed 30 runs with random initialization. Thus, we computed ACC, NMI, and purity by averaging all runs.
TSPLBM
) and the clustering algorithmsapplied on the obtained tensor decomposition. The results concern tensor decomposition approaches with rank number equal to 10 (The results for rank 20 and 50 are similar to those using rank equal to 10). We observe that in most of the cases TSPLBM
does a better job than PARAFAC and Tucker decomposition methods. For the 3sources and Caltech-7 datasets, PARAFAC and TUCKER decomposition with GMM obtain close results in terms of Purity and Accuracy but TSPLBM
achieves higher performances in terms of NMI.
TSPLBM
and tensor decomposition approaches, we represent in Fig. 11 the time execution in seconds. We notice that for the four datasets 3sources, BBC, DBLP1, and UCI-digits, TSPLBM
is close to all other approaches in terms of time execution. However, with Nus-wide-8 and 100Leaves, the time execution is more important, this is due to the dataset size and the number of clusters for Nus-wide-8 and 100Leaves. Note however, in Fig. 10, we observe that TSPLBM
outperforms tensor decomposition approaches with approximately 25 points of ACC for both datasets.6.4 TSPLBM
versus PSBM
, PLBM
, and SPLBM
TSPLBM
with PSBM
, PLBM
, and SPLBM
that consider the slices separately (Sect. 3.4).PSBM
, PLBM
, SPLBM
, and TSPLBM
. PSBM
, PLBM
, and SPLBM
are applied on each slice \(\mathbf {x}^b\) separately unlike TSPLBM
which is applied on \(\mathcal {X}\) considering all graphs simultaneously. We notice that, in most cases, TSPLBM
is better than other algorithms applied to each graph and allows us to achieve the best trade-off. TSPLBM
includes all graphs and also the graphs with a very complex structure. DBLP3 obtains the lowest results due to the complex structure of dataset composed of 12K papers with very close or complementary topics on computer science. We observe that PLBM
and SPLBM
do a better job than PSBM
for all datasets on the more informative slices. It is also worth noting that PLBM
does good performances in terms of Accuracy on DBLP1 and in terms of NMI on DBLP3. TSPLBM
performs a natural consensus when considering all slices and allows us to obtain a unique partition at the end with good clustering results.6.5 Interpretation of multiple graph clustering results
TSPLBM
. In Fig. 13, are projected the tags and topics on the two first dimensions of CA including the top tags in terms of contribution4 on the CA results.TSPLBM
, and \(\mathcal {H}\) is the co-tags matrix. We can notice that there are some topics with a strong relationship like plants-snow and town-persons. On the other hand, some topics with a weak link like animals-town and animals-temple. This representation highlights that there are some tags used with confused meaning. In this context, it is possible to use tensor models for tags completion and tags correction (Tang et al. 2017; Veit et al. 2017).6.6 Discussion: implicit consensus versus explicit consensus
TSPLBM
applied on all slices simultaneously is, in most of the cases, better than the other algorithms. As we are in an unsupervised context, we have found it helpful to run the calculation with several different random initial conditions and take the best result in terms of maximum log-likelihood, overall runs. This is the usual procedure in clustering. Next we study why and how we can improve this task.6.6.1 Towards a consensus clustering
TSPLBM
. We also draw the ACC and NMI curve according to the 30 runs. We observe that for DBLP1, the best runs leading to maximal NL are the best runs in terms of clustering (ACC and NMI). However, this observation is not noticed in all datasets; for instance, some best runs can achieve less good results in terms of ACC and NMI. This problem is recurrent with all unsupervised methods where the best runs in terms of the objective function are not necessarily the best ones in terms of clustering. On the other hand, we may see the proposed model as an implicit consensus model for graphs clustering, and it is tempting to compare the proposed model to ensemble-based clustering methods.
6.6.2 Ensemble method
CSPA
), HyperGraph Partitioning Algorithm (HGPA
), and Meta-CLustering Algorithm (MCLA
).
TSPLBM
we will adopt the ensemble approach. We explore in the next part, how implicit consensus clustering through TSPLBM
behaves compared to explicit consensus through cluster ensembles of multiple graphs. In Fig. 17, we report the proposed approach to compare TSPLBM
with the clustering ensemble methods proposed by Strehl and Ghosh (2002). To do this, we used the implementation of python package Cluster_Ensembles
.5 It relies on CSPA
, HGPA
, and MCLA
and returns the best results in terms of the mean of NMI between the obtained consensus clustering \(\mathbf {Z}^{*}\) and the different clustering solutions {\(\mathbf {Z}_{1}\), \(\mathbf {Z}_{2}\), \(\mathbf {Z}_{3}\),..., \(\mathbf {Z}_{r}\)}. Therebey, with TSPLBM
, we select the top ten runs maximizing log-likelihood then we carry out the consensus by using the cluster-ensembles methods. With SPLBM
, PLBM
, and PSBM
, we consider two steps. The first step is the same as that used with TSPLBM
to select the top ten runs and apply the cluster-ensembles methods. The second one consists in applying another clustering consensus between graphs to obtain a unique partition. Note that the consensus clustering information is implicitly provided by the TSPLBM
algorithm.
TSPLBM
achieves the highest NMI for all datasets. SPLBM
does a better or similar job than PLBM
on three datasets, while PSBM
obtains the lowest NMI measures on all datasets. These results can be explained by the fact that the implicit consensus achieved by TSPLBM
is optimized within the objective function of the algorithm, unlike the explicit consensus, where the partitions are obtained separately.7 Conclusion
TSPLBM
capable of performing clustering from a tensor data. On real datasets of text and image graphs, we have shown that TSPLBM
, is better than the cited baselines algorithms in terms of clustering.TSPLBM
can be seen as an implicit consensus clustering for multiple graphs. To reinforce our idea that TSPLBM
can be used in this sense, a comparative study with explicit consensus through ensemble clustering methods was realized. Experiments on several real graphs datasets highlight the effectiveness of TSPLBM
. Thereby, this work gives an extra dimension to LBM as an ensemble method. Our approach has made it possible to propose a like-EM learning algorithm. It is possible to develop a like-Classification EM version. To do this, all that is needed is to insert a classification step between E and M steps. This could lead to propose an extension of DC-SBM
for multiple graphs.