research-article

Efficient algorithms for large-scale local triangle counting

Authors:
Luca Becchetti

“Sapienza” Università di Roma, Rome, Italy

“Sapienza” Università di Roma, Rome, Italy
View Profile

,
Paolo Boldi

Università degli Studi di Milano, Milan, Italy

Università degli Studi di Milano, Milan, Italy
View Profile

,
Carlos Castillo

Yahoo! Research, Spain, Barcelona Spain

Yahoo! Research, Spain, Barcelona Spain
View Profile

,
Aristides Gionis

Yahoo! Research, Spain, Barcelona Spain

Yahoo! Research, Spain, Barcelona Spain
View Profile

ACM Transactions on Knowledge Discovery from Data Volume 4 Issue 3Article No.: 13pp 1–28https://doi.org/10.1145/1839490.1839494

Published:22 October 2010Publication History

ACM Transactions on Knowledge Discovery from Data

Abstract

In this article, we study the problem of approximate local triangle counting in large graphs. Namely, given a large graph G=(V,E) we want to estimate as accurately as possible the number of triangles incident to every node v∈ V in the graph. We consider the question both for undirected and directed graphs. The problem of computing the global number of triangles in a graph has been considered before, but to our knowledge this is the first contribution that addresses the problem of approximate local triangle counting with a focus on the efficiency issues arising in massive graphs and that also considers the directed case. The distribution of the local number of triangles and the related local clustering coefficient can be used in many interesting applications. For example, we show that the measures we compute can help detect the presence of spamming activity in large-scale Web graphs, as well as to provide useful features for content quality assessment in social networks.

For computing the local number of triangles (undirected and directed), we propose two approximation algorithms, which are based on the idea of min-wise independent permutations [Broder et al. 1998]. Our algorithms operate in a semi-streaming fashion, using O(|V|) space in main memory and performing O(log |V|) sequential scans over the edges of the graph. The first algorithm we describe in this article also uses O(|E|) space of external memory during computation, while the second algorithm uses only main memory. We present the theoretical analysis as well as experimental results on large graphs, demonstrating the practical efficiency of our approach.

References

Alon, N., Yuster, R., and Zwick, U. 1997. Finding and counting given length cycles. Algorithmica 17, 3, 209--223.Google ScholarCross Ref
Bar-Yossef, Z., Kumar, R., and Sivakumar, D. 2002. Reductions in streaming algorithms, with an application to counting triangles in graphs. In Proceedings of the 13th Annual ACM-SIAM Symposium on Discrete Mathematics (SODA). 623--632. Google ScholarDigital Library
Batagelj, V. and Mrvar, A. 2001. A subquadratic triad census algorithm for large sparse networks with small maximum degree. Social Netw. 23, 237--243.Google ScholarCross Ref
Becchetti, L., Boldi, P., Castillo, C., and Gionis, A. 2008. Efficient semi-streaming algorithms for local triangle counting in massive graphs. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 16--24. Google ScholarDigital Library
Bohman, T., Cooper, C., and Frieze, A. M. 2000. Min-wise independent linear permutations. Electronic J. Combinat. 7.Google Scholar
Boldi, P. and Vigna, S. 2004. The webgraph framework I: compression techniques. In Proceedings of the 13th International Conference on World Wide Web. 595--602. Google ScholarDigital Library
Bordino, I., Donato, D., Gionis, A., and Leonardi, S. 2008. Mining large networks with subgraph counting. In Proceedings of the IEEE International Conference on Data Mining (ICDM). 737--742. Google ScholarDigital Library
Boykin, P. O. and Roychowdhury, V. P. 2005. Leveraging social networks to fight spam. Computer 38, 4, 61--68. Google ScholarDigital Library
Broder, A. Z. 1998. On the resemblance and containment of documents. In Proceedings of the Compression and Complexity of Sequences, IEEE Computer Society. 21--29. Google ScholarDigital Library
Broder, A. Z. 2000. Identifying and filtering near-duplicate documents. In Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching (CPM). 1--10. Google ScholarDigital Library
Broder, A. Z., Charikar, M., Frieze, A. M., and Mitzenmacher, M. 1998. Min-wise independent permutations (extended abstract). In Proceedings of the 30th Annual ACM Symposium on Theory of Computing (STOC). 327--336. Google ScholarDigital Library
Broder, A. Z., Glassman, S. C., Manasse, M. S., and Zweig, G. 1997. Syntactic clustering of the web. In Proceedings of the 6th International Conference on World Wide Web. Elsevier Science Publishers Ltd., 1157--1166. Google ScholarDigital Library
Buriol, L. S., Frahling, G., Leonardi, S., Marchetti-Spaccamela, A., and Sohler, C. 2006. Counting triangles in data streams. In Proceedings of the 25th ACM Symposium on Principles of Database Systems (PODS). 253--262. Google ScholarDigital Library
Buriol, L. S., Frahling, G., Leonardi, S., and Sohler, C. 2007. Estimating clustering indexes in data streams. In Proceedings of the 15th Annual European Symposium on Algorithms. 618--632. Google ScholarDigital Library
Castillo, C., Donato, D., Becchetti, L., Boldi, P., Leonardi, S., Santini, M., and Vigna, S. 2006. A reference collection for web spam. SIGIR Forum 40, 2, 11--24. Google ScholarDigital Library
Castillo, C., Donato, D., Gionis, A., Murdock, V., and Silvestri, F. 2007. Know your neighbors: Web spam detection using the web topology. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 423--430. Google ScholarDigital Library
Coppersmith, D. and Kumar, R. 2004. An improved data stream algorithm for frequency moments. In Proceedings of the 15th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA). 151--156. Google ScholarDigital Library
Coppersmith, D. and Winograd, S. 1990. Matrix multiplication via arithmetic progressions. J. Symb. Computat. 9, 3, 251--280. Google ScholarDigital Library
de Graaf, M., Schrijver, A., and Seymour, P. D. 1992. Directed triangles in directed graphs. Discr. Math. 110, 1-3, 279--282. Google ScholarDigital Library
Demetrescu, C., Finocchi, I., and Ribichini, A. 2006. Trading off space for passes in graph streaming problems. In Proceedings of the 7th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA). 714--723. Google ScholarDigital Library
Eckmann, J.-P. and Moses, E. 2002. Curvature of co-links uncovers hidden thematic layers in the world wide web. In Proceedings of the National Academy of Sciences (PNAS) 99, 9, 5825--5829.Google ScholarCross Ref
Feigenbaum, J., Kannan, S., Gregor, M. A., Suri, S., and Zhang, J. 2004. On graph problems in a semi-streaming model. In Proceedings of the 31st International Colloquium on Automata, Languages and Programming (ICALP). 207--216.Google Scholar
Fetterly, D., Manasse, M., and Najork, M. 2004. Spam, damn spam, and statistics: Using statistical analysis to locate spam web pages. In Proceedings of the 7th Workshop on the Web and Databases (WebDB). 1--6. Google ScholarDigital Library
Fogaras, D. and Rácz, B. 2005. Scaling link-based similarity search. In Proceedings of the 14th International Conference on World Wide Web. 641--650. Google ScholarDigital Library
Gibson, D., Kumar, R., and Tomkins, A. 2005. Discovering large dense subgraphs in massive graphs. In Proceedings of the 31st International Conference on Very Large Data Bases (VLDB). 721--732. Google ScholarDigital Library
Gulli, A. and Signorini, A. 2005. The indexable Web is more than 11.5 billion pages. In Poster Proceedings of the 14th International Conference on World Wide Web. 902--903. Google ScholarDigital Library
Haveliwala, T. 1999. Efficient computation of pagerank. Tech. Rep., Stanford University.Google Scholar
Henzinger, M. R., Raghavan, P., and Rajagopalan, S. 1999. Computing on data streams. Dimacs Series in Discrete Mathematics and Theoretical Computer Science, 107--118. Google ScholarDigital Library
Holme, P., Edling, C., and Liljeros, F. 2004. Structure and time-evolution of an internet dating community. Soc. Netw. 26, 155.Google ScholarCross Ref
Indyk, P. 1999. A small approximately min-wise independent family of hash functions. In Proceedings of the 10th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA). 454--456. Google ScholarDigital Library
Itai, A. and Rodeh, M. 1978. Finding a minimum circuit in a graph. SIAM J. Comput. 7, 4, 413--423.Google ScholarDigital Library
Jeh, G. and Widom, J. 2002. Simrank: A measure of structural-context similarity. In KDD '02: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM Press, New York, 538--543. Google ScholarDigital Library
Kumar, R., Raghavan, P., Rajagopalan, S., and Tomkins, A. 1999. Trawling the Web for emerging cyber-communities. Comput. Netwo. 31, 11--16, 1481--1493. Google ScholarDigital Library
Latapy, M. 2008. Main-memory triangle computations for very large (sparse (power-law)) graphs. Theoret. Comput. Sci. 407, 1-3, 458--473. Google ScholarDigital Library
Latapy, M. and Magnien, C. 2006. Measuring fundamental properties of real-world complex networks. arXiv:cs/0609115v2.Google Scholar
Leskovec, J., Kleinberg, J., and Faloutsos, C. 2005. Graphs over time: densification laws, shrinking diameters and possible explanations. In Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining. 177--187. Google ScholarDigital Library
Mitzenmacher, M. and Upfal, E. 2005. Probability and Computing: Randomized Algorithms and Probabilistic Analysis. Cambridge University Press. Google ScholarDigital Library
Newman, M. E. J. 2003. The structure and function of complex networks. SIAM Review 45, 2, 167--256.Google ScholarDigital Library
Schank, T. and Wagner, D. 2005. Finding, counting and listing all triangles in large graphs, an experimental study. In Proceedings of the 4th International Workshop on Experimental and Efficient Algorithms (WEA). Google ScholarDigital Library
Tsourakakis, C. E. 2008. Fast counting of triangles in large real networks without counting: Algorithms and laws. In ICDM '08: Proceedings of the 2008 8th IEEE International Conference on Data Mining. IEEE Computer Society, Press, Los Alamitos, CA, 608--617. Google ScholarDigital Library
Vitter, J. S. 2001. External memory algorithms and data structures. ACM Comput. Surveys 33, 2, 209--271. Google ScholarDigital Library
Welser, H. T., Gleave, E., Fisher, D., and Smith, M. 2007. Visualizing the signatures of social roles in online discussion groups. J. Soc. Struct. 8, 2.Google Scholar

Index Terms

Efficient algorithms for large-scale local triangle counting

Recommendations

Efficient semi-streaming algorithms for local triangle counting in massive graphs
KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining

In this paper we study the problem of local triangle counting in large graphs. Namely, given a large graph G = (V;E) we want to estimate as accurately as possible the number of triangles incident to every node υ ∈ V in the graph. The problem of ...
Read More
Large Induced Forests in Triangle-Free Planar Graphs
Abstract
Given a planar graph G, what is the largest subset of vertices of G that induces a forest? Albertson and Berman [2] conjectured that every planar graph has an induced subgraph on at least half of the vertices that is a forest. For bipartite planar ...
Read More
Dense graphs with a large triangle cover have a large triangle packing

It is well known that a graph with m edges can be made triangle-free by removing (slightly less than) m /2 edges. On the other hand, there are many classes of graphs which are hard to make triangle-free, in the sense that it is necessary to remove ...
Read More

Reviews

Reviewer: Chenyi Hu

Becchetti et al. present efficient approximation algorithms that count the number of local triangles in large graphs (both undirected and directed). They suggest possible applications of the counting algorithms for Web spam detection and local clustering of social networks. Counting local triangles in a small graph can be done with a brute-force approach. However, when counting local triangles in a large graph with millions of vertices and edges that model the Web, one must consider the spatial and time complexities of the counting algorithms. In this paper, the authors report that their algorithms use O (| V |) space in main memory and perform O ( log | V |) sequential scans over the edges of the graph. The basic building block of their algorithmic approach is "to estimate the size of the intersection of two sets" by using fast pseudorandom numbers to label the vertices of a graph. After presenting the theoretical foundations, the authors propose their counting algorithms for both undirected graphs and digraphs. Then, they report their experimental results on large graphs, mostly with millions of nodes and billions of edges. The experiments not only demonstrate the efficiency of their algorithms, but also show improved accuracy in some cases. This paper is theoretical. That being said, practitioners could reasonably implement the algorithms in pseudocode, particularly since readers can follow the links provided at the end of the paper to download the Java code and test applications. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Knowledge Discovery from Data Volume 4, Issue 3
October 2010
191 pages
ISSN:1556-4681
EISSN:1556-472X
DOI:10.1145/1839490
Issue’s Table of Contents

Copyright © 2010 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 22 October 2010
- Revised: 1 February 2010
- Accepted: 1 February 2010
- Received: 1 April 2009
Published in tkdd Volume 4, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Clustering coefficient
Web computing
massive-graph computing
social networks
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 59
  Total Citations
  View Citations
- 845
  Total Downloads
- Downloads (Last 12 months)37
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Efficient algorithms for large-scale local triangle counting

ACM Transactions on Knowledge Discovery from Data

Abstract

References

Cited By

Index Terms

Recommendations

Efficient semi-streaming algorithms for local triangle counting in massive graphs

Large Induced Forests in Triangle-Free Planar Graphs

Dense graphs with a large triangle cover have a large triangle packing

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Efficient algorithms for large-scale local triangle counting

ACM Transactions on Knowledge Discovery from Data

Abstract

References

Cited By

Index Terms

Recommendations

Efficient semi-streaming algorithms for local triangle counting in massive graphs

Large Induced Forests in Triangle-Free Planar Graphs

Dense graphs with a large triangle cover have a large triangle packing

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media