nach oben

Data Mining and Knowledge Discovery

Erschienen in:

06.06.2016

Graph summarization with quality guarantees

verfasst von: Matteo Riondato, David García-Soriano, Francesco Bonchi

Erschienen in: Data Mining and Knowledge Discovery | Ausgabe 2/2017

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

We study the problem of graph summarization. Given a large graph we aim at producing a concise lossy representation (a summary) that can be stored in main memory and used to approximately answer queries about the original graph much faster than by using the exact representation. In this work we study a very natural type of summary: the original set of vertices is partitioned into a small number of supernodes connected by superedges to form a complete weighted graph. The superedge weights are the edge densities between vertices in the corresponding supernodes. To quantify the dissimilarity between the original graph and a summary, we adopt the reconstruction error and the cut-norm error. By exposing a connection between graph summarization and geometric clustering problems (i.e., k-means and k-median), we develop the first polynomial-time approximation algorithms to compute the best possible summary of a certain size under both measures. We discuss how to use our summaries to store a (lossy or lossless) compressed graph representation and to approximately answer a large class of queries about the original graph, including adjacency, degree, eigenvector centrality, and triangle and subgraph counting. Using the summary to answer queries is very efficient as the running time to compute the answer depends on the number of supernodes in the summary, rather than the number of nodes in the original graph.

Vorheriger Artikel Active learning: an empirical study of common baselines

Nächster Artikel Unsupervised group matching with application to cross-lingual topic matching without alignment information

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

We discuss the case of directed graphs in Sect. 3.5.

A skew-symmetric matrix (also known as antisymmetric or antimetric matrix) is a square matrix A whose transpose is also its negative: \(-A = A^\intercal \).

If \(v_1, \ldots , v_n \in {\mathbb R}^d\), then \(\left\| {v_i - v_j} \right\| _2^2 = \left\| {v_i} \right\| _2^2 + \left\| {v_j} \right\| _2^2 - 2 \langle v_i, v_j \rangle \). Since the quantities \(\left\| {v_i} \right\| _2^2\) can be easily precomputed, the problem reduces to computing all inner products \(\langle v_i, v_j \rangle \). These form the entries of \(A A^\intercal \), where A is the \(n\times d\) matrix with rows \(v_1, \ldots , v_n\).

For \(\ell _2\), we can also use the Johnson-Lindenstrauss transform (Johnson and Lindenstrauss 1984).

We denote as \(\left( {\begin{array}{c}X\\ k\end{array}}\right) \) the set of k-subsets of X, i.e., the subsets of X of size k.

Further space-saving can be achieved by storing only densities above a certain threshold using adjacency lists; the superedges removed increase the reconstruction error.

Minor modifications are needed if self-loops are allowed.

http://snap.stanford.edu/data/.

http://irefindex.org.

For speed reasons, we modified the algorithm by Arya et al. (2004) to try only a limited number of local improvements and did not run it to completion. It could otherwise achieve even better approximations.

The implementation is available from https://github.com/rionda/graphsumm.

Aggarwal A, Deshpande A, Kannan R (2009) Adaptive sampling for k-means clustering. Approximation, randomization, and combinatorial optimization. Algorithms and techniques, APPROX-RANDOM. Springer, Berlin, pp 15–28CrossRef

Aloise D, Deshpande A, Hansen P, Popat P (2009) NP-hardness of Euclidean sum-of-squares clustering. Mach Learn 75(2):245–248CrossRef

Alon N, Duke RA, Lefmann H, Rödl V, Yuster R (1994) The algorithmic aspects of the regularity lemma. J Algorithms 16(1):80–109MathSciNetCrossRef

Alon N, Naor A (2006) Approximating the cut-norm via Grothendieck’s inequality. SIAM J Comput 35(4):787–803MathSciNetCrossRef

Arthur D, Vassilvitskii S (2007) \(k\)-means++: the advantages of careful seeding. In: Proceedings of the 18th annual ACM-SIAM symposium on discrete algorithms, SIAM, SODA ’07, pp 1027–1035

Arya V, Garg N, Khandekar R, Meyerson A, Munagala K, Pandit V (2004) Local search heuristics for \(k\)-median and facility location problems. SIAM J Comput 33(3):544–562MathSciNetCrossRef

Bahmani B, Moseley B, Vattani A, Kumar R, Vassilvitskii S (2012) Scalable \(k\)-means++. Proc VLDB Endow 5(7):622–633CrossRef

Boldi P, Santini M, Vigna S (2009) Permuting web and social graphs. Internet Math 6(3):257–283MathSciNetCrossRef

Boldi P, Rosa M, Santini M, Vigna S (2011) Layered label propagation: a multiresolution coordinate-free ordering for compressing social networks. In: Proceedings of the 20th international conference on World Wide Web, ACM, WWW ’11, pp 587–596

Boldi P, Vigna S (2004) The webgraph framework i: compression techniques. In: Proceedings of the 13th international conference on World Wide Web, ACM, WWW ’04, pp 595–602

Bonchi F, García-Soriano D, Kutzkov K (2013) Local correlation clustering. arXiv preprint arXiv:1312.5105v1

Campan A, Truta TM (2009) Data and structural k-anonymity in social networks. Privacy, security, and trust in KDD. Springer, Berlin, pp 33–54CrossRef

Conlon D, Fox J (2012) Bounds for graph regularity and removal lemmas. Geom Funct Anal 22(5):1191–1256MathSciNetCrossRef

Cormode G, Srivastava D, Yu T, Zhang Q (2010) Anonymizing bipartite graph data using safe groupings. VLDB J 19(1):115–139CrossRef

Dasgupta S (2008) The hardness of \(k\)-means clustering. Tech. Rep. 09-16. University of California, San Diego

Dellamonica DJ, Kalyanasundaram S, Martin DM, Rödl V, Shapira A (2012) A deterministic algorithm for the Frieze-Kannan regularity lemma. SIAM J Discret Math 26(1):15–29MathSciNetCrossRef

Dellamonica DJ, Kalyanasundaram S, Martin DM, Rödl V, Shapira A (2015) An optimal algorithm for finding Frieze-Kannan regular partitions. Comb Prob Comput 24(02):407–437MathSciNetCrossRef

Fan W, Li J, Wang X, Wu Y (2012) Query preserving graph compression. In: Proceedings of the 2012 ACM SIGMOD international conference on management of data, ACM, SIGMOD ’12, pp 157–168

Frieze A, Kannan R (1999) Quick approximation to matrices and applications. Combinatorica 19(2):175–220MathSciNetCrossRef

Gowers WT (1997) Lower bounds of tower type for Szemerédi’s uniformity lemma. Geom Funct Anal 7(2):322–337MathSciNetCrossRef

Hay M, Miklau G, Jensen D, Towsley D, Li C (2010) Resisting structural re-identification in anonymized social networks. VLDB J 19(6):797–823CrossRef

Hernández C, Navarro G (2011) Compression of web and social graphs supporting neighbor and community queries. In: Proceedings of the 6th ACM workshop on social network mining and analysis, ACM, SNAKDD ’11

Indyk P (2006) Stable distributions, pseudorandom generators, embeddings, and data stream computation. J ACM 53(3):307–323MathSciNetCrossRef

Jain K, Vazirani VV (2001) Approximation algorithms for metric facility location and \(k\)-median problems using the primal-dual schema and Lagrangian relaxation. J ACM 48(2):274–296MathSciNetCrossRef

Johnson WB, Lindenstrauss J (1984) Extensions of Lipschitz mappings into a Hilbert space. Contemp Math 26:189–206MathSciNetCrossRef

LeFevre K, Terzi E (2010) GraSS: graph structure summarization. In: Proceedings of the 2010 SIAM international conference on data mining, SIAM, SDM ’10, pp 454–465CrossRef

Liu Z, Yu JX, Cheng H (2012) Approximate homogeneous graph summarization. Inf Media Technol 7(1):32–43

Lloyd S (1982) Least squares quantization in PCM. IEEE Trans Inf Theory 28(2):129–137MathSciNetCrossRef

Lovász L (2012) Large networks and graph limits. American Mathematical Society, ProvidenceCrossRef

Maserrat H, Pei J (2010) Neighbor query friendly compression of social networks. In: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, KDD ’10, pp 533–542

Megiddo N, Supowit KJ (1984) On the complexity of some common geometric location problems. SIAM J Comput 13(1):182–196MathSciNetCrossRef

Mettu RR, Plaxton CG (2003) The online median problem. SIAM J Comput 32(3):816–832MathSciNetCrossRef

Navlakha S, Rastogi R, Shrivastava N (2008) Graph summarization with bounded error. In: Proceedings of the 2008 ACM SIGMOD international conference on Management of data, ACM, SIGMOD ’08, pp 419–432

Riondato M, García-Soriano D, Bonchi F (2014) Graph summarization with quality guarantees. In: 2014 IEEE international conference on data mining, IEEE, ICDM ’14, pp 947–952

Schaeffer SE (2007) Graph clustering. Comput Sci Rev 1(1):27–64CrossRef

Szemerédi E (1976) Regular partitions of graphs. In: Problèmes Combinatoires et Théorie des Graphes, Colloq. Internat. CNRS, Univ. Orsay., pp 399–401

Tassa T, Cohen DJ (2013) Anonymization of centralized and distributed social networks by sequential clustering. IEEE Trans Knowl Data Eng 25(2):311–324CrossRef

Tian Y, Hankins RA, Patel JM (2008) Efficient aggregation for graph summarization. In: Proceedings of the 2008 ACM SIGMOD international conference on management of data, ACM, SIGMOD ’08, pp 567–580

Toivonen H, Zhou F, Hartikainen A, Hinkka A (2011) Compression of weighted graphs. In: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, KDD ’11, pp 965–973

Tsourakakis CE (2008) Fast counting of triangles in large real networks without counting: algorithms and laws. In: 2008 IEEE international conference on data mining, IEEE, ICDM ’08, pp 608–617

Vassilevska Williams V (2011) Breaking the Coppersmith–Winograd barrier, unpublished manuscript

Ward JH (1963) Hierarchical grouping to optimize an objective function. J Am Stat Assoc 58(301):236–244MathSciNetCrossRef

Williams D (1991) Probability with Martingales. Cambridge University Press, CambridgeCrossRef

Zheleva E, Getoor L (2008) Preserving the privacy of sensitive relationships in graph data. In: Privacy, security, and trust in KDD, Springer, pp 153–171

Titel: Graph summarization with quality guarantees
verfasst von: Matteo Riondato
David García-Soriano
Francesco Bonchi
Publikationsdatum: 06.06.2016
Verlag: Springer US
Erschienen in: Data Mining and Knowledge Discovery / Ausgabe 2/2017
Print ISSN: 1384-5810
Elektronische ISSN: 1573-756X
DOI: https://doi.org/10.1007/s10618-016-0468-8

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 2/2017

Modeling user interests from web browsing activities

Unsupervised group matching with application to cross-lingual topic matching without alignment information

Multiple Bayesian discriminant functions for high-dimensional massive data classification

Survey on using constraints in data mining

Discovering rare categories from graph streams

Comparison of local outlier detection techniques in spatial multivariate data

Premium Partner