Abstract
In this article, we study the problem of approximate local triangle counting in large graphs. Namely, given a large graph G=(V,E) we want to estimate as accurately as possible the number of triangles incident to every node v∈ V in the graph. We consider the question both for undirected and directed graphs. The problem of computing the global number of triangles in a graph has been considered before, but to our knowledge this is the first contribution that addresses the problem of approximate local triangle counting with a focus on the efficiency issues arising in massive graphs and that also considers the directed case. The distribution of the local number of triangles and the related local clustering coefficient can be used in many interesting applications. For example, we show that the measures we compute can help detect the presence of spamming activity in large-scale Web graphs, as well as to provide useful features for content quality assessment in social networks.
For computing the local number of triangles (undirected and directed), we propose two approximation algorithms, which are based on the idea of min-wise independent permutations [Broder et al. 1998]. Our algorithms operate in a semi-streaming fashion, using O(|V|) space in main memory and performing O(log |V|) sequential scans over the edges of the graph. The first algorithm we describe in this article also uses O(|E|) space of external memory during computation, while the second algorithm uses only main memory. We present the theoretical analysis as well as experimental results on large graphs, demonstrating the practical efficiency of our approach.
- Alon, N., Yuster, R., and Zwick, U. 1997. Finding and counting given length cycles. Algorithmica 17, 3, 209--223.Google ScholarCross Ref
- Bar-Yossef, Z., Kumar, R., and Sivakumar, D. 2002. Reductions in streaming algorithms, with an application to counting triangles in graphs. In Proceedings of the 13th Annual ACM-SIAM Symposium on Discrete Mathematics (SODA). 623--632. Google ScholarDigital Library
- Batagelj, V. and Mrvar, A. 2001. A subquadratic triad census algorithm for large sparse networks with small maximum degree. Social Netw. 23, 237--243.Google ScholarCross Ref
- Becchetti, L., Boldi, P., Castillo, C., and Gionis, A. 2008. Efficient semi-streaming algorithms for local triangle counting in massive graphs. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 16--24. Google ScholarDigital Library
- Bohman, T., Cooper, C., and Frieze, A. M. 2000. Min-wise independent linear permutations. Electronic J. Combinat. 7.Google Scholar
- Boldi, P. and Vigna, S. 2004. The webgraph framework I: compression techniques. In Proceedings of the 13th International Conference on World Wide Web. 595--602. Google ScholarDigital Library
- Bordino, I., Donato, D., Gionis, A., and Leonardi, S. 2008. Mining large networks with subgraph counting. In Proceedings of the IEEE International Conference on Data Mining (ICDM). 737--742. Google ScholarDigital Library
- Boykin, P. O. and Roychowdhury, V. P. 2005. Leveraging social networks to fight spam. Computer 38, 4, 61--68. Google ScholarDigital Library
- Broder, A. Z. 1998. On the resemblance and containment of documents. In Proceedings of the Compression and Complexity of Sequences, IEEE Computer Society. 21--29. Google ScholarDigital Library
- Broder, A. Z. 2000. Identifying and filtering near-duplicate documents. In Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching (CPM). 1--10. Google ScholarDigital Library
- Broder, A. Z., Charikar, M., Frieze, A. M., and Mitzenmacher, M. 1998. Min-wise independent permutations (extended abstract). In Proceedings of the 30th Annual ACM Symposium on Theory of Computing (STOC). 327--336. Google ScholarDigital Library
- Broder, A. Z., Glassman, S. C., Manasse, M. S., and Zweig, G. 1997. Syntactic clustering of the web. In Proceedings of the 6th International Conference on World Wide Web. Elsevier Science Publishers Ltd., 1157--1166. Google ScholarDigital Library
- Buriol, L. S., Frahling, G., Leonardi, S., Marchetti-Spaccamela, A., and Sohler, C. 2006. Counting triangles in data streams. In Proceedings of the 25th ACM Symposium on Principles of Database Systems (PODS). 253--262. Google ScholarDigital Library
- Buriol, L. S., Frahling, G., Leonardi, S., and Sohler, C. 2007. Estimating clustering indexes in data streams. In Proceedings of the 15th Annual European Symposium on Algorithms. 618--632. Google ScholarDigital Library
- Castillo, C., Donato, D., Becchetti, L., Boldi, P., Leonardi, S., Santini, M., and Vigna, S. 2006. A reference collection for web spam. SIGIR Forum 40, 2, 11--24. Google ScholarDigital Library
- Castillo, C., Donato, D., Gionis, A., Murdock, V., and Silvestri, F. 2007. Know your neighbors: Web spam detection using the web topology. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 423--430. Google ScholarDigital Library
- Coppersmith, D. and Kumar, R. 2004. An improved data stream algorithm for frequency moments. In Proceedings of the 15th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA). 151--156. Google ScholarDigital Library
- Coppersmith, D. and Winograd, S. 1990. Matrix multiplication via arithmetic progressions. J. Symb. Computat. 9, 3, 251--280. Google ScholarDigital Library
- de Graaf, M., Schrijver, A., and Seymour, P. D. 1992. Directed triangles in directed graphs. Discr. Math. 110, 1-3, 279--282. Google ScholarDigital Library
- Demetrescu, C., Finocchi, I., and Ribichini, A. 2006. Trading off space for passes in graph streaming problems. In Proceedings of the 7th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA). 714--723. Google ScholarDigital Library
- Eckmann, J.-P. and Moses, E. 2002. Curvature of co-links uncovers hidden thematic layers in the world wide web. In Proceedings of the National Academy of Sciences (PNAS) 99, 9, 5825--5829.Google ScholarCross Ref
- Feigenbaum, J., Kannan, S., Gregor, M. A., Suri, S., and Zhang, J. 2004. On graph problems in a semi-streaming model. In Proceedings of the 31st International Colloquium on Automata, Languages and Programming (ICALP). 207--216.Google Scholar
- Fetterly, D., Manasse, M., and Najork, M. 2004. Spam, damn spam, and statistics: Using statistical analysis to locate spam web pages. In Proceedings of the 7th Workshop on the Web and Databases (WebDB). 1--6. Google ScholarDigital Library
- Fogaras, D. and Rácz, B. 2005. Scaling link-based similarity search. In Proceedings of the 14th International Conference on World Wide Web. 641--650. Google ScholarDigital Library
- Gibson, D., Kumar, R., and Tomkins, A. 2005. Discovering large dense subgraphs in massive graphs. In Proceedings of the 31st International Conference on Very Large Data Bases (VLDB). 721--732. Google ScholarDigital Library
- Gulli, A. and Signorini, A. 2005. The indexable Web is more than 11.5 billion pages. In Poster Proceedings of the 14th International Conference on World Wide Web. 902--903. Google ScholarDigital Library
- Haveliwala, T. 1999. Efficient computation of pagerank. Tech. Rep., Stanford University.Google Scholar
- Henzinger, M. R., Raghavan, P., and Rajagopalan, S. 1999. Computing on data streams. Dimacs Series in Discrete Mathematics and Theoretical Computer Science, 107--118. Google ScholarDigital Library
- Holme, P., Edling, C., and Liljeros, F. 2004. Structure and time-evolution of an internet dating community. Soc. Netw. 26, 155.Google ScholarCross Ref
- Indyk, P. 1999. A small approximately min-wise independent family of hash functions. In Proceedings of the 10th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA). 454--456. Google ScholarDigital Library
- Itai, A. and Rodeh, M. 1978. Finding a minimum circuit in a graph. SIAM J. Comput. 7, 4, 413--423.Google ScholarDigital Library
- Jeh, G. and Widom, J. 2002. Simrank: A measure of structural-context similarity. In KDD '02: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM Press, New York, 538--543. Google ScholarDigital Library
- Kumar, R., Raghavan, P., Rajagopalan, S., and Tomkins, A. 1999. Trawling the Web for emerging cyber-communities. Comput. Netwo. 31, 11--16, 1481--1493. Google ScholarDigital Library
- Latapy, M. 2008. Main-memory triangle computations for very large (sparse (power-law)) graphs. Theoret. Comput. Sci. 407, 1-3, 458--473. Google ScholarDigital Library
- Latapy, M. and Magnien, C. 2006. Measuring fundamental properties of real-world complex networks. arXiv:cs/0609115v2.Google Scholar
- Leskovec, J., Kleinberg, J., and Faloutsos, C. 2005. Graphs over time: densification laws, shrinking diameters and possible explanations. In Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining. 177--187. Google ScholarDigital Library
- Mitzenmacher, M. and Upfal, E. 2005. Probability and Computing: Randomized Algorithms and Probabilistic Analysis. Cambridge University Press. Google ScholarDigital Library
- Newman, M. E. J. 2003. The structure and function of complex networks. SIAM Review 45, 2, 167--256.Google ScholarDigital Library
- Schank, T. and Wagner, D. 2005. Finding, counting and listing all triangles in large graphs, an experimental study. In Proceedings of the 4th International Workshop on Experimental and Efficient Algorithms (WEA). Google ScholarDigital Library
- Tsourakakis, C. E. 2008. Fast counting of triangles in large real networks without counting: Algorithms and laws. In ICDM '08: Proceedings of the 2008 8th IEEE International Conference on Data Mining. IEEE Computer Society, Press, Los Alamitos, CA, 608--617. Google ScholarDigital Library
- Vitter, J. S. 2001. External memory algorithms and data structures. ACM Comput. Surveys 33, 2, 209--271. Google ScholarDigital Library
- Welser, H. T., Gleave, E., Fisher, D., and Smith, M. 2007. Visualizing the signatures of social roles in online discussion groups. J. Soc. Struct. 8, 2.Google Scholar
Index Terms
- Efficient algorithms for large-scale local triangle counting
Recommendations
Efficient semi-streaming algorithms for local triangle counting in massive graphs
KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data miningIn this paper we study the problem of local triangle counting in large graphs. Namely, given a large graph G = (V;E) we want to estimate as accurately as possible the number of triangles incident to every node υ ∈ V in the graph. The problem of ...
Large Induced Forests in Triangle-Free Planar Graphs
AbstractGiven a planar graph G, what is the largest subset of vertices of G that induces a forest? Albertson and Berman [2] conjectured that every planar graph has an induced subgraph on at least half of the vertices that is a forest. For bipartite planar ...
Dense graphs with a large triangle cover have a large triangle packing
It is well known that a graph with m edges can be made triangle-free by removing (slightly less than) m /2 edges. On the other hand, there are many classes of graphs which are hard to make triangle-free, in the sense that it is necessary to remove ...
Comments