ABSTRACT
The clustering coefficient of a node in a social network is a fundamental measure that quantifies how tightly-knit the community is around the node. Its computation can be reduced to counting the number of triangles incident on the particular node in the network. In case the graph is too big to fit into memory, this is a non-trivial task, and previous researchers showed how to estimate the clustering coefficient in this scenario. A different avenue of research is to to perform the computation in parallel, spreading it across many machines. In recent years MapReduce has emerged as a de facto programming paradigm for parallel computation on massive data sets. The main focus of this work is to give MapReduce algorithms for counting triangles which we use to compute clustering coefficients.
Our contributions are twofold. First, we describe a sequential triangle counting algorithm and show how to adapt it to the MapReduce setting. This algorithm achieves a factor of 10-100 speed up over the naive approach. Second, we present a new algorithm designed specifically for the MapReduce framework. A key feature of this approach is that it allows for a smooth tradeoff between the memory available on each individual machine and the total memory available to the algorithm, while keeping the total work done constant. Moreover, this algorithm can use any triangle counting algorithm as a black box and distribute the computation across many machines. We validate our algorithms on real world datasets comprising of millions of nodes and over a billion edges. Our results show both algorithms effectively deal with skew in the degree distribution and lead to dramatic speed ups over the naive implementation.
- Albert-László Barabási and Réka Albert. Emergence of scaling in random networks. Science, 286(5439):509--512, October 1999.Google ScholarCross Ref
- Luca Becchetti, Paolo Boldi, Carlos Castillo, and Aristides Gionis. Efficient semi-streaming algorithms for local triangle counting in massive graphs. In Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD '08, pages 16--24, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
- Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, and Geoffrey Zweig. Syntactic clustering of the web. Computer Networks, 29(8-13):1157--1166, 1997. Google ScholarDigital Library
- Luciana S. Buriol, Gereon Frahling, Stefano Leonardi, Alberto Marchetti-Spaccamela, and Christian Sohler. Counting triangles in data streams. In Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, PODS '06, pages 253--262, New York, NY, USA, 2006. ACM. Google ScholarDigital Library
- Ronald S. Burt. Structural holes and good ideas. American Journal of Sociology, 110(2):349--99, September 2004.Google ScholarCross Ref
- Ronald S. Burt. Second-hand brokerage: Evidence on the importance of local structure for managers, bankers, and analysts. Academy of Management Journal, 50:119--148, 2007.Google ScholarCross Ref
- Norishige Chiba and Takao Nishizeki. Arboricity and subgraph listing algorithms. SIAM J. Comput., 14(1):210--223, 1985. Google ScholarDigital Library
- James S. Coleman. Social capital in the creation of human capital. American Journal of Sociology, 94:S95--S120, 1988.Google ScholarCross Ref
- Don Coppersmith and Ravi Kumar. An improved data stream algorithm for frequency moments. In Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms, SODA '04, pages 151--156, Philadelphia, PA, USA, 2004. Society for Industrial and Applied Mathematics. Google ScholarDigital Library
- Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified data processing on large clusters. In Proceedings of OSDI, pages 137--150, 2004. Google ScholarDigital Library
- Howard Karloff, Siddharth Suri, and Sergei Vassilvitskii. A model of computation for MapReduce. In Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 938--948, 2010. Google ScholarCross Ref
- Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. What is twitter, a social network or news media. In Proc. 19th International World Wide Web Conference, 2010. Google ScholarDigital Library
- Jure Leskovec and Eric Horvitz. Planetary-scale views on a large instant-messaging network. In Proc. 17th International World Wide Web Conference, 2008. Google ScholarDigital Library
- Jimmy Lin and Chris Dyer. Data-Intensive Text Processing with MapReduce. Number 7 in Synthesis Lectures on Human Language Technologies. Morgan and Claypool Publishers, April 2010. Google ScholarDigital Library
- Alejandro Portes. Social capital: Its origins and applications in modern sociology. Annual Review of Sociology, 24:1--24, 1998.Google ScholarCross Ref
- Thomas Schank. Algorithmic Aspects of Triangle-Based Network Analysis. PhD thesis, Universität Karlsruhe (TH), 2007.Google Scholar
- Charalampos E. Tsourakakis, U. Kang, Gary L. Miller, and Christos Faloutsos. Doulion: Counting triangles in massive graphs with a coin. In Knowledge Discovery and Data Mining (KDD), 2009. Google ScholarDigital Library
- Duncan Watts and Steve Strogatz. Collective dynamics of 'small-world' networks. Nature, 383:440--442, 1998.Google ScholarCross Ref
- Tom White. Hadoop: The Definitive Guide. O'Reilly Media, 2009. Google ScholarDigital Library
Index Terms
- Counting triangles and the curse of the last reducer
Recommendations
Efficient semi-streaming algorithms for local triangle counting in massive graphs
KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data miningIn this paper we study the problem of local triangle counting in large graphs. Namely, given a large graph G = (V;E) we want to estimate as accurately as possible the number of triangles incident to every node υ ∈ V in the graph. The problem of ...
DOULION: counting triangles in massive graphs with a coin
KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data miningCounting the number of triangles in a graph is a beautiful algorithmic problem which has gained importance over the last years due to its significant role in complex network analysis. Metrics frequently computed such as the clustering coefficient and ...
An efficient MapReduce algorithm for counting triangles in a very large graph
CIKM '13: Proceedings of the 22nd ACM international conference on Information & Knowledge ManagementTriangle counting problem is one of the fundamental problem in various domains. The problem can be utilized for computation of clustering coefficient, transitivity, trianglular connectivity, trusses, etc. The problem have been extensively studied in ...
Comments