skip to main content
research-article

Efficient algorithms for large-scale local triangle counting

Published:22 October 2010Publication History
Skip Abstract Section

Abstract

In this article, we study the problem of approximate local triangle counting in large graphs. Namely, given a large graph G=(V,E) we want to estimate as accurately as possible the number of triangles incident to every node vV in the graph. We consider the question both for undirected and directed graphs. The problem of computing the global number of triangles in a graph has been considered before, but to our knowledge this is the first contribution that addresses the problem of approximate local triangle counting with a focus on the efficiency issues arising in massive graphs and that also considers the directed case. The distribution of the local number of triangles and the related local clustering coefficient can be used in many interesting applications. For example, we show that the measures we compute can help detect the presence of spamming activity in large-scale Web graphs, as well as to provide useful features for content quality assessment in social networks.

For computing the local number of triangles (undirected and directed), we propose two approximation algorithms, which are based on the idea of min-wise independent permutations [Broder et al. 1998]. Our algorithms operate in a semi-streaming fashion, using O(|V|) space in main memory and performing O(log |V|) sequential scans over the edges of the graph. The first algorithm we describe in this article also uses O(|E|) space of external memory during computation, while the second algorithm uses only main memory. We present the theoretical analysis as well as experimental results on large graphs, demonstrating the practical efficiency of our approach.

References

  1. Alon, N., Yuster, R., and Zwick, U. 1997. Finding and counting given length cycles. Algorithmica 17, 3, 209--223.Google ScholarGoogle ScholarCross RefCross Ref
  2. Bar-Yossef, Z., Kumar, R., and Sivakumar, D. 2002. Reductions in streaming algorithms, with an application to counting triangles in graphs. In Proceedings of the 13th Annual ACM-SIAM Symposium on Discrete Mathematics (SODA). 623--632. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Batagelj, V. and Mrvar, A. 2001. A subquadratic triad census algorithm for large sparse networks with small maximum degree. Social Netw. 23, 237--243.Google ScholarGoogle ScholarCross RefCross Ref
  4. Becchetti, L., Boldi, P., Castillo, C., and Gionis, A. 2008. Efficient semi-streaming algorithms for local triangle counting in massive graphs. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 16--24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Bohman, T., Cooper, C., and Frieze, A. M. 2000. Min-wise independent linear permutations. Electronic J. Combinat. 7.Google ScholarGoogle Scholar
  6. Boldi, P. and Vigna, S. 2004. The webgraph framework I: compression techniques. In Proceedings of the 13th International Conference on World Wide Web. 595--602. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Bordino, I., Donato, D., Gionis, A., and Leonardi, S. 2008. Mining large networks with subgraph counting. In Proceedings of the IEEE International Conference on Data Mining (ICDM). 737--742. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Boykin, P. O. and Roychowdhury, V. P. 2005. Leveraging social networks to fight spam. Computer 38, 4, 61--68. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Broder, A. Z. 1998. On the resemblance and containment of documents. In Proceedings of the Compression and Complexity of Sequences, IEEE Computer Society. 21--29. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Broder, A. Z. 2000. Identifying and filtering near-duplicate documents. In Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching (CPM). 1--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Broder, A. Z., Charikar, M., Frieze, A. M., and Mitzenmacher, M. 1998. Min-wise independent permutations (extended abstract). In Proceedings of the 30th Annual ACM Symposium on Theory of Computing (STOC). 327--336. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Broder, A. Z., Glassman, S. C., Manasse, M. S., and Zweig, G. 1997. Syntactic clustering of the web. In Proceedings of the 6th International Conference on World Wide Web. Elsevier Science Publishers Ltd., 1157--1166. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Buriol, L. S., Frahling, G., Leonardi, S., Marchetti-Spaccamela, A., and Sohler, C. 2006. Counting triangles in data streams. In Proceedings of the 25th ACM Symposium on Principles of Database Systems (PODS). 253--262. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Buriol, L. S., Frahling, G., Leonardi, S., and Sohler, C. 2007. Estimating clustering indexes in data streams. In Proceedings of the 15th Annual European Symposium on Algorithms. 618--632. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Castillo, C., Donato, D., Becchetti, L., Boldi, P., Leonardi, S., Santini, M., and Vigna, S. 2006. A reference collection for web spam. SIGIR Forum 40, 2, 11--24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Castillo, C., Donato, D., Gionis, A., Murdock, V., and Silvestri, F. 2007. Know your neighbors: Web spam detection using the web topology. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 423--430. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Coppersmith, D. and Kumar, R. 2004. An improved data stream algorithm for frequency moments. In Proceedings of the 15th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA). 151--156. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Coppersmith, D. and Winograd, S. 1990. Matrix multiplication via arithmetic progressions. J. Symb. Computat. 9, 3, 251--280. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. de Graaf, M., Schrijver, A., and Seymour, P. D. 1992. Directed triangles in directed graphs. Discr. Math. 110, 1-3, 279--282. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Demetrescu, C., Finocchi, I., and Ribichini, A. 2006. Trading off space for passes in graph streaming problems. In Proceedings of the 7th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA). 714--723. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Eckmann, J.-P. and Moses, E. 2002. Curvature of co-links uncovers hidden thematic layers in the world wide web. In Proceedings of the National Academy of Sciences (PNAS) 99, 9, 5825--5829.Google ScholarGoogle ScholarCross RefCross Ref
  22. Feigenbaum, J., Kannan, S., Gregor, M. A., Suri, S., and Zhang, J. 2004. On graph problems in a semi-streaming model. In Proceedings of the 31st International Colloquium on Automata, Languages and Programming (ICALP). 207--216.Google ScholarGoogle Scholar
  23. Fetterly, D., Manasse, M., and Najork, M. 2004. Spam, damn spam, and statistics: Using statistical analysis to locate spam web pages. In Proceedings of the 7th Workshop on the Web and Databases (WebDB). 1--6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Fogaras, D. and Rácz, B. 2005. Scaling link-based similarity search. In Proceedings of the 14th International Conference on World Wide Web. 641--650. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Gibson, D., Kumar, R., and Tomkins, A. 2005. Discovering large dense subgraphs in massive graphs. In Proceedings of the 31st International Conference on Very Large Data Bases (VLDB). 721--732. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Gulli, A. and Signorini, A. 2005. The indexable Web is more than 11.5 billion pages. In Poster Proceedings of the 14th International Conference on World Wide Web. 902--903. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Haveliwala, T. 1999. Efficient computation of pagerank. Tech. Rep., Stanford University.Google ScholarGoogle Scholar
  28. Henzinger, M. R., Raghavan, P., and Rajagopalan, S. 1999. Computing on data streams. Dimacs Series in Discrete Mathematics and Theoretical Computer Science, 107--118. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Holme, P., Edling, C., and Liljeros, F. 2004. Structure and time-evolution of an internet dating community. Soc. Netw. 26, 155.Google ScholarGoogle ScholarCross RefCross Ref
  30. Indyk, P. 1999. A small approximately min-wise independent family of hash functions. In Proceedings of the 10th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA). 454--456. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Itai, A. and Rodeh, M. 1978. Finding a minimum circuit in a graph. SIAM J. Comput. 7, 4, 413--423.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Jeh, G. and Widom, J. 2002. Simrank: A measure of structural-context similarity. In KDD '02: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM Press, New York, 538--543. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Kumar, R., Raghavan, P., Rajagopalan, S., and Tomkins, A. 1999. Trawling the Web for emerging cyber-communities. Comput. Netwo. 31, 11--16, 1481--1493. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Latapy, M. 2008. Main-memory triangle computations for very large (sparse (power-law)) graphs. Theoret. Comput. Sci. 407, 1-3, 458--473. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Latapy, M. and Magnien, C. 2006. Measuring fundamental properties of real-world complex networks. arXiv:cs/0609115v2.Google ScholarGoogle Scholar
  36. Leskovec, J., Kleinberg, J., and Faloutsos, C. 2005. Graphs over time: densification laws, shrinking diameters and possible explanations. In Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining. 177--187. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Mitzenmacher, M. and Upfal, E. 2005. Probability and Computing: Randomized Algorithms and Probabilistic Analysis. Cambridge University Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Newman, M. E. J. 2003. The structure and function of complex networks. SIAM Review 45, 2, 167--256.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Schank, T. and Wagner, D. 2005. Finding, counting and listing all triangles in large graphs, an experimental study. In Proceedings of the 4th International Workshop on Experimental and Efficient Algorithms (WEA). Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Tsourakakis, C. E. 2008. Fast counting of triangles in large real networks without counting: Algorithms and laws. In ICDM '08: Proceedings of the 2008 8th IEEE International Conference on Data Mining. IEEE Computer Society, Press, Los Alamitos, CA, 608--617. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Vitter, J. S. 2001. External memory algorithms and data structures. ACM Comput. Surveys 33, 2, 209--271. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Welser, H. T., Gleave, E., Fisher, D., and Smith, M. 2007. Visualizing the signatures of social roles in online discussion groups. J. Soc. Struct. 8, 2.Google ScholarGoogle Scholar

Index Terms

  1. Efficient algorithms for large-scale local triangle counting

            Recommendations

            Reviews

            Chenyi Hu

            Becchetti et al. present efficient approximation algorithms that count the number of local triangles in large graphs (both undirected and directed). They suggest possible applications of the counting algorithms for Web spam detection and local clustering of social networks. Counting local triangles in a small graph can be done with a brute-force approach. However, when counting local triangles in a large graph with millions of vertices and edges that model the Web, one must consider the spatial and time complexities of the counting algorithms. In this paper, the authors report that their algorithms use O (| V |) space in main memory and perform O ( log | V |) sequential scans over the edges of the graph. The basic building block of their algorithmic approach is "to estimate the size of the intersection of two sets" by using fast pseudorandom numbers to label the vertices of a graph. After presenting the theoretical foundations, the authors propose their counting algorithms for both undirected graphs and digraphs. Then, they report their experimental results on large graphs, mostly with millions of nodes and billions of edges. The experiments not only demonstrate the efficiency of their algorithms, but also show improved accuracy in some cases. This paper is theoretical. That being said, practitioners could reasonably implement the algorithms in pseudocode, particularly since readers can follow the links provided at the end of the paper to download the Java code and test applications. Online Computing Reviews Service

            Access critical reviews of Computing literature here

            Become a reviewer for Computing Reviews.

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            Full Access

            • Published in

              cover image ACM Transactions on Knowledge Discovery from Data
              ACM Transactions on Knowledge Discovery from Data  Volume 4, Issue 3
              October 2010
              191 pages
              ISSN:1556-4681
              EISSN:1556-472X
              DOI:10.1145/1839490
              Issue’s Table of Contents

              Copyright © 2010 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 22 October 2010
              • Revised: 1 February 2010
              • Accepted: 1 February 2010
              • Received: 1 April 2009
              Published in tkdd Volume 4, Issue 3

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article
              • Research
              • Refereed

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader