Skip to main content
Erschienen in: Data Mining and Knowledge Discovery 2/2017

06.06.2016

Graph summarization with quality guarantees

verfasst von: Matteo Riondato, David García-Soriano, Francesco Bonchi

Erschienen in: Data Mining and Knowledge Discovery | Ausgabe 2/2017

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

We study the problem of graph summarization. Given a large graph we aim at producing a concise lossy representation (a summary) that can be stored in main memory and used to approximately answer queries about the original graph much faster than by using the exact representation. In this work we study a very natural type of summary: the original set of vertices is partitioned into a small number of supernodes connected by superedges to form a complete weighted graph. The superedge weights are the edge densities between vertices in the corresponding supernodes. To quantify the dissimilarity between the original graph and a summary, we adopt the reconstruction error and the cut-norm error. By exposing a connection between graph summarization and geometric clustering problems (i.e., k-means and k-median), we develop the first polynomial-time approximation algorithms to compute the best possible summary of a certain size under both measures. We discuss how to use our summaries to store a (lossy or lossless) compressed graph representation and to approximately answer a large class of queries about the original graph, including adjacency, degree, eigenvector centrality, and triangle and subgraph counting. Using the summary to answer queries is very efficient as the running time to compute the answer depends on the number of supernodes in the summary, rather than the number of nodes in the original graph.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
We discuss the case of directed graphs in Sect. 3.5.
 
2
A skew-symmetric matrix (also known as antisymmetric or antimetric matrix) is a square matrix A whose transpose is also its negative: \(-A = A^\intercal \).
 
3
If \(v_1, \ldots , v_n \in {\mathbb R}^d\), then \(\left\| {v_i - v_j} \right\| _2^2 = \left\| {v_i} \right\| _2^2 + \left\| {v_j} \right\| _2^2 - 2 \langle v_i, v_j \rangle \). Since the quantities \(\left\| {v_i} \right\| _2^2\) can be easily precomputed, the problem reduces to computing all inner products \(\langle v_i, v_j \rangle \). These form the entries of \(A A^\intercal \), where A is the \(n\times d\) matrix with rows \(v_1, \ldots , v_n\).
 
4
For \(\ell _2\), we can also use the Johnson-Lindenstrauss transform (Johnson and Lindenstrauss 1984).
 
5
We denote as \(\left( {\begin{array}{c}X\\ k\end{array}}\right) \) the set of k-subsets of X, i.e., the subsets of X of size k.
 
6
Further space-saving can be achieved by storing only densities above a certain threshold using adjacency lists; the superedges removed increase the reconstruction error.
 
7
Minor modifications are needed if self-loops are allowed.
 
10
For speed reasons, we modified the algorithm by Arya et al. (2004) to try only a limited number of local improvements and did not run it to completion. It could otherwise achieve even better approximations.
 
11
The implementation is available from https://​github.​com/​rionda/​graphsumm.
 
Literatur
Zurück zum Zitat Aggarwal A, Deshpande A, Kannan R (2009) Adaptive sampling for k-means clustering. Approximation, randomization, and combinatorial optimization. Algorithms and techniques, APPROX-RANDOM. Springer, Berlin, pp 15–28CrossRef Aggarwal A, Deshpande A, Kannan R (2009) Adaptive sampling for k-means clustering. Approximation, randomization, and combinatorial optimization. Algorithms and techniques, APPROX-RANDOM. Springer, Berlin, pp 15–28CrossRef
Zurück zum Zitat Aloise D, Deshpande A, Hansen P, Popat P (2009) NP-hardness of Euclidean sum-of-squares clustering. Mach Learn 75(2):245–248CrossRef Aloise D, Deshpande A, Hansen P, Popat P (2009) NP-hardness of Euclidean sum-of-squares clustering. Mach Learn 75(2):245–248CrossRef
Zurück zum Zitat Alon N, Duke RA, Lefmann H, Rödl V, Yuster R (1994) The algorithmic aspects of the regularity lemma. J Algorithms 16(1):80–109MathSciNetCrossRef Alon N, Duke RA, Lefmann H, Rödl V, Yuster R (1994) The algorithmic aspects of the regularity lemma. J Algorithms 16(1):80–109MathSciNetCrossRef
Zurück zum Zitat Alon N, Naor A (2006) Approximating the cut-norm via Grothendieck’s inequality. SIAM J Comput 35(4):787–803MathSciNetCrossRef Alon N, Naor A (2006) Approximating the cut-norm via Grothendieck’s inequality. SIAM J Comput 35(4):787–803MathSciNetCrossRef
Zurück zum Zitat Arthur D, Vassilvitskii S (2007) \(k\)-means++: the advantages of careful seeding. In: Proceedings of the 18th annual ACM-SIAM symposium on discrete algorithms, SIAM, SODA ’07, pp 1027–1035 Arthur D, Vassilvitskii S (2007) \(k\)-means++: the advantages of careful seeding. In: Proceedings of the 18th annual ACM-SIAM symposium on discrete algorithms, SIAM, SODA ’07, pp 1027–1035
Zurück zum Zitat Arya V, Garg N, Khandekar R, Meyerson A, Munagala K, Pandit V (2004) Local search heuristics for \(k\)-median and facility location problems. SIAM J Comput 33(3):544–562MathSciNetCrossRef Arya V, Garg N, Khandekar R, Meyerson A, Munagala K, Pandit V (2004) Local search heuristics for \(k\)-median and facility location problems. SIAM J Comput 33(3):544–562MathSciNetCrossRef
Zurück zum Zitat Bahmani B, Moseley B, Vattani A, Kumar R, Vassilvitskii S (2012) Scalable \(k\)-means++. Proc VLDB Endow 5(7):622–633CrossRef Bahmani B, Moseley B, Vattani A, Kumar R, Vassilvitskii S (2012) Scalable \(k\)-means++. Proc VLDB Endow 5(7):622–633CrossRef
Zurück zum Zitat Boldi P, Rosa M, Santini M, Vigna S (2011) Layered label propagation: a multiresolution coordinate-free ordering for compressing social networks. In: Proceedings of the 20th international conference on World Wide Web, ACM, WWW ’11, pp 587–596 Boldi P, Rosa M, Santini M, Vigna S (2011) Layered label propagation: a multiresolution coordinate-free ordering for compressing social networks. In: Proceedings of the 20th international conference on World Wide Web, ACM, WWW ’11, pp 587–596
Zurück zum Zitat Boldi P, Vigna S (2004) The webgraph framework i: compression techniques. In: Proceedings of the 13th international conference on World Wide Web, ACM, WWW ’04, pp 595–602 Boldi P, Vigna S (2004) The webgraph framework i: compression techniques. In: Proceedings of the 13th international conference on World Wide Web, ACM, WWW ’04, pp 595–602
Zurück zum Zitat Campan A, Truta TM (2009) Data and structural k-anonymity in social networks. Privacy, security, and trust in KDD. Springer, Berlin, pp 33–54CrossRef Campan A, Truta TM (2009) Data and structural k-anonymity in social networks. Privacy, security, and trust in KDD. Springer, Berlin, pp 33–54CrossRef
Zurück zum Zitat Cormode G, Srivastava D, Yu T, Zhang Q (2010) Anonymizing bipartite graph data using safe groupings. VLDB J 19(1):115–139CrossRef Cormode G, Srivastava D, Yu T, Zhang Q (2010) Anonymizing bipartite graph data using safe groupings. VLDB J 19(1):115–139CrossRef
Zurück zum Zitat Dasgupta S (2008) The hardness of \(k\)-means clustering. Tech. Rep. 09-16. University of California, San Diego Dasgupta S (2008) The hardness of \(k\)-means clustering. Tech. Rep. 09-16. University of California, San Diego
Zurück zum Zitat Dellamonica DJ, Kalyanasundaram S, Martin DM, Rödl V, Shapira A (2012) A deterministic algorithm for the Frieze-Kannan regularity lemma. SIAM J Discret Math 26(1):15–29MathSciNetCrossRef Dellamonica DJ, Kalyanasundaram S, Martin DM, Rödl V, Shapira A (2012) A deterministic algorithm for the Frieze-Kannan regularity lemma. SIAM J Discret Math 26(1):15–29MathSciNetCrossRef
Zurück zum Zitat Dellamonica DJ, Kalyanasundaram S, Martin DM, Rödl V, Shapira A (2015) An optimal algorithm for finding Frieze-Kannan regular partitions. Comb Prob Comput 24(02):407–437MathSciNetCrossRef Dellamonica DJ, Kalyanasundaram S, Martin DM, Rödl V, Shapira A (2015) An optimal algorithm for finding Frieze-Kannan regular partitions. Comb Prob Comput 24(02):407–437MathSciNetCrossRef
Zurück zum Zitat Fan W, Li J, Wang X, Wu Y (2012) Query preserving graph compression. In: Proceedings of the 2012 ACM SIGMOD international conference on management of data, ACM, SIGMOD ’12, pp 157–168 Fan W, Li J, Wang X, Wu Y (2012) Query preserving graph compression. In: Proceedings of the 2012 ACM SIGMOD international conference on management of data, ACM, SIGMOD ’12, pp 157–168
Zurück zum Zitat Gowers WT (1997) Lower bounds of tower type for Szemerédi’s uniformity lemma. Geom Funct Anal 7(2):322–337MathSciNetCrossRef Gowers WT (1997) Lower bounds of tower type for Szemerédi’s uniformity lemma. Geom Funct Anal 7(2):322–337MathSciNetCrossRef
Zurück zum Zitat Hay M, Miklau G, Jensen D, Towsley D, Li C (2010) Resisting structural re-identification in anonymized social networks. VLDB J 19(6):797–823CrossRef Hay M, Miklau G, Jensen D, Towsley D, Li C (2010) Resisting structural re-identification in anonymized social networks. VLDB J 19(6):797–823CrossRef
Zurück zum Zitat Hernández C, Navarro G (2011) Compression of web and social graphs supporting neighbor and community queries. In: Proceedings of the 6th ACM workshop on social network mining and analysis, ACM, SNAKDD ’11 Hernández C, Navarro G (2011) Compression of web and social graphs supporting neighbor and community queries. In: Proceedings of the 6th ACM workshop on social network mining and analysis, ACM, SNAKDD ’11
Zurück zum Zitat Indyk P (2006) Stable distributions, pseudorandom generators, embeddings, and data stream computation. J ACM 53(3):307–323MathSciNetCrossRef Indyk P (2006) Stable distributions, pseudorandom generators, embeddings, and data stream computation. J ACM 53(3):307–323MathSciNetCrossRef
Zurück zum Zitat Jain K, Vazirani VV (2001) Approximation algorithms for metric facility location and \(k\)-median problems using the primal-dual schema and Lagrangian relaxation. J ACM 48(2):274–296MathSciNetCrossRef Jain K, Vazirani VV (2001) Approximation algorithms for metric facility location and \(k\)-median problems using the primal-dual schema and Lagrangian relaxation. J ACM 48(2):274–296MathSciNetCrossRef
Zurück zum Zitat Johnson WB, Lindenstrauss J (1984) Extensions of Lipschitz mappings into a Hilbert space. Contemp Math 26:189–206MathSciNetCrossRef Johnson WB, Lindenstrauss J (1984) Extensions of Lipschitz mappings into a Hilbert space. Contemp Math 26:189–206MathSciNetCrossRef
Zurück zum Zitat LeFevre K, Terzi E (2010) GraSS: graph structure summarization. In: Proceedings of the 2010 SIAM international conference on data mining, SIAM, SDM ’10, pp 454–465CrossRef LeFevre K, Terzi E (2010) GraSS: graph structure summarization. In: Proceedings of the 2010 SIAM international conference on data mining, SIAM, SDM ’10, pp 454–465CrossRef
Zurück zum Zitat Liu Z, Yu JX, Cheng H (2012) Approximate homogeneous graph summarization. Inf Media Technol 7(1):32–43 Liu Z, Yu JX, Cheng H (2012) Approximate homogeneous graph summarization. Inf Media Technol 7(1):32–43
Zurück zum Zitat Lovász L (2012) Large networks and graph limits. American Mathematical Society, ProvidenceCrossRef Lovász L (2012) Large networks and graph limits. American Mathematical Society, ProvidenceCrossRef
Zurück zum Zitat Maserrat H, Pei J (2010) Neighbor query friendly compression of social networks. In: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, KDD ’10, pp 533–542 Maserrat H, Pei J (2010) Neighbor query friendly compression of social networks. In: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, KDD ’10, pp 533–542
Zurück zum Zitat Megiddo N, Supowit KJ (1984) On the complexity of some common geometric location problems. SIAM J Comput 13(1):182–196MathSciNetCrossRef Megiddo N, Supowit KJ (1984) On the complexity of some common geometric location problems. SIAM J Comput 13(1):182–196MathSciNetCrossRef
Zurück zum Zitat Navlakha S, Rastogi R, Shrivastava N (2008) Graph summarization with bounded error. In: Proceedings of the 2008 ACM SIGMOD international conference on Management of data, ACM, SIGMOD ’08, pp 419–432 Navlakha S, Rastogi R, Shrivastava N (2008) Graph summarization with bounded error. In: Proceedings of the 2008 ACM SIGMOD international conference on Management of data, ACM, SIGMOD ’08, pp 419–432
Zurück zum Zitat Riondato M, García-Soriano D, Bonchi F (2014) Graph summarization with quality guarantees. In: 2014 IEEE international conference on data mining, IEEE, ICDM ’14, pp 947–952 Riondato M, García-Soriano D, Bonchi F (2014) Graph summarization with quality guarantees. In: 2014 IEEE international conference on data mining, IEEE, ICDM ’14, pp 947–952
Zurück zum Zitat Szemerédi E (1976) Regular partitions of graphs. In: Problèmes Combinatoires et Théorie des Graphes, Colloq. Internat. CNRS, Univ. Orsay., pp 399–401 Szemerédi E (1976) Regular partitions of graphs. In: Problèmes Combinatoires et Théorie des Graphes, Colloq. Internat. CNRS, Univ. Orsay., pp 399–401
Zurück zum Zitat Tassa T, Cohen DJ (2013) Anonymization of centralized and distributed social networks by sequential clustering. IEEE Trans Knowl Data Eng 25(2):311–324CrossRef Tassa T, Cohen DJ (2013) Anonymization of centralized and distributed social networks by sequential clustering. IEEE Trans Knowl Data Eng 25(2):311–324CrossRef
Zurück zum Zitat Tian Y, Hankins RA, Patel JM (2008) Efficient aggregation for graph summarization. In: Proceedings of the 2008 ACM SIGMOD international conference on management of data, ACM, SIGMOD ’08, pp 567–580 Tian Y, Hankins RA, Patel JM (2008) Efficient aggregation for graph summarization. In: Proceedings of the 2008 ACM SIGMOD international conference on management of data, ACM, SIGMOD ’08, pp 567–580
Zurück zum Zitat Toivonen H, Zhou F, Hartikainen A, Hinkka A (2011) Compression of weighted graphs. In: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, KDD ’11, pp 965–973 Toivonen H, Zhou F, Hartikainen A, Hinkka A (2011) Compression of weighted graphs. In: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, KDD ’11, pp 965–973
Zurück zum Zitat Tsourakakis CE (2008) Fast counting of triangles in large real networks without counting: algorithms and laws. In: 2008 IEEE international conference on data mining, IEEE, ICDM ’08, pp 608–617 Tsourakakis CE (2008) Fast counting of triangles in large real networks without counting: algorithms and laws. In: 2008 IEEE international conference on data mining, IEEE, ICDM ’08, pp 608–617
Zurück zum Zitat Vassilevska Williams V (2011) Breaking the Coppersmith–Winograd barrier, unpublished manuscript Vassilevska Williams V (2011) Breaking the Coppersmith–Winograd barrier, unpublished manuscript
Zurück zum Zitat Williams D (1991) Probability with Martingales. Cambridge University Press, CambridgeCrossRef Williams D (1991) Probability with Martingales. Cambridge University Press, CambridgeCrossRef
Zurück zum Zitat Zheleva E, Getoor L (2008) Preserving the privacy of sensitive relationships in graph data. In: Privacy, security, and trust in KDD, Springer, pp 153–171 Zheleva E, Getoor L (2008) Preserving the privacy of sensitive relationships in graph data. In: Privacy, security, and trust in KDD, Springer, pp 153–171
Metadaten
Titel
Graph summarization with quality guarantees
verfasst von
Matteo Riondato
David García-Soriano
Francesco Bonchi
Publikationsdatum
06.06.2016
Verlag
Springer US
Erschienen in
Data Mining and Knowledge Discovery / Ausgabe 2/2017
Print ISSN: 1384-5810
Elektronische ISSN: 1573-756X
DOI
https://doi.org/10.1007/s10618-016-0468-8

Weitere Artikel der Ausgabe 2/2017

Data Mining and Knowledge Discovery 2/2017 Zur Ausgabe

Premium Partner