Top

The Journal of Supercomputing

Published in:

12-04-2017

CCFinder: using Spark to find clustering coefficient in big graphs

Authors: Mehdi Alemi, Hassan Haghighi, Saeed Shahrivari

Published in: The Journal of Supercomputing | Issue 11/2017

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Networks with billions of vertices introduce new challenges to perform graph analysis in a reasonable time. Clustering coefficient is an important analytical measure of networks such as social networks and biological networks. To compute clustering coefficient in big graphs, existing distributed algorithms suffer from low efficiency such that they may fail due to demanding lots of memory, or even, if they complete successfully, their execution time is not acceptable for real-world applications. We present a distributed MapReduce-based algorithm, called CCFinder, to efficiently compute clustering coefficient in very big graphs. CCFinder is executed on Apache Spark, a scalable data processing platform. It efficiently detects existing triangles through using our proposed data structure, called FONL, which is cached in the distributed memory provided by Spark and reused multiple times. As data items in the FONL are fine-grained and contain the minimum required information, CCFinder requires less storage space and has better parallelism in comparison with its competitors. To find clustering coefficient, our solution to triangle counting is extended to have degree information of the vertices in the appropriate places. We performed several experiments on a Spark cluster with 60 processors. The results show that CCFinder achieves acceptable scalability and outperforms six existing competitor methods. Four competitors are those methods proposed based on graph processing systems, i.e., GraphX, NScale, NScaleSpark, and Pregel frameworks, and two others are the Cohen’s method and NodeIterator++, introduced based on MapReduce.

previous article Prioritizing the solution of cloud service selection using integrated MCDM methods under Fuzzy environment

next article Reconstructing permutation table to improve the Tabu Search for the PFSP on GPU

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Watts DJ, Strogatz SH (1998) Collective dynamics of ‘small-world’networks. Nature 393(6684):440–442CrossRefMATH

Newman ME (2003) The structure and function of complex networks. SIAM Rev 45(2):167–256CrossRefMATHMathSciNet

Kim BJ (2004) Performance of networks of artificial neurons: the role of clustering. Phys Rev E 69(4):045101CrossRef

Centola D (2010) The spread of behavior in an online social network experiment. Science 329(5996):1194–1197CrossRef

Huang Z (2006) Link prediction based on graph topology: the predictive value of generalized clustering coefficient. Paper presented at the Workshop on Link Analysis: Dynamics and Static of Large Networks (LinkKDD2006)

Goldstein R, Vitevitch MS (2013) The influence of clustering coefficient on word-learning: how groups of similar sounding words facilitate acquisition. Front Psychol 5:1307–1307

Newman ME (2009) Random graphs with clustering. Phys Rev Lett 103(5):058701CrossRef

Saramäki J, Kaski K (2004) Scale-free networks generated by random walkers. Phys A Stat Mech Appl 341:80–86CrossRefMathSciNet

Dorogovtsev SN, Goltsev AV, Mendes JFF (2002) Pseudofractal scale-free web. Phys Rev E 65(6):066122CrossRef

10.

Suri S, Vassilvitskii S (2011) Counting triangles and the curse of the last reducer. In: Proceedings of the 20th International Conference on World Wide Web, 2011. ACM, pp 607–614

11.

Chung FR, Lu L (2006) Complex graphs and networks, vol 107. American Mathematical Society, ProvidenceMATH

12.

Milo R, Shen-Orr S, Itzkovitz S, Kashtan N, Chklovskii D, Alon U (2002) Network motifs: simple building blocks of complex networks. Science 298(5594):824–827CrossRef

13.

Kwak H, Lee C, Park H, Moon S (2010) What is Twitter, a social network or a news media? In: Proceedings of the 19th International Conference on World Wide Web, 2010. ACM, pp 591–600

14.

Ye P, Peyser BD, Spencer FA, Bader JS (2005) Commensurate distances and similar motifs in genetic congruence and protein interaction networks in yeast. BMC Bioinform 6(1):270CrossRef

15.

White T (2012) Hadoop: the definitive guide. O’Reilly Media, Newton

16.

Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I (2010) Spark: cluster computing with working sets. HotCloud 10(10–10):95

17.

Meng X, Bradley J, Yavuz B, Sparks E, Venkataraman S, Liu D, Freeman J, Tsai D, Amde M, Owen S (2016) Mllib: machine learning in apache spark. J Mach Learn Res 17(34):1–7MATHMathSciNet

18.

Chen J, Li K, Tang Z, Bilal K, Yu S, Weng C, Li K (2017) A parallel random forest algorithm for big data in a Spark cloud computing environment. IEEE Trans Parallel Distrib Syst 28(4):919–933

19.

Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113CrossRef

20.

Malewicz G, Austern MH, Bik AJ, Dehnert JC, Horn I, Leiser N, Czajkowski G (2010) Pregel: a system for large-scale graph processing. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, 2010. ACM, pp 135–146

21.

Quamar A, Deshpande A, Lin J (2016) NScale: neighborhood-centric large-scale graph analytics in the cloud. VLDB J 25(2):125–150CrossRef

22.

Low Y, Bickson D, Gonzalez J, Guestrin C, Kyrola A, Hellerstein JM (2012) Distributed GraphLab: a framework for machine learning and data mining in the cloud. Proc VLDB Endow 5(8):716–727CrossRef

23.

Gonzalez JE, Xin RS, Dave A, Crankshaw D, Franklin MJ, Stoica I (2014) GraphX: graph processing in a distributed dataflow framework. In: OSDI, 2014, pp 599–613

24.

Quamar A, Deshpande A (2016) NScaleSpark: subgraph-centric graph analytics on Apache Spark. In: Proceedings of the 1st ACM SIGMOD Workshop on Network Data Analytics, 2016. ACM, p 5

25.

Soffer SN, Vazquez A (2005) Network clustering coefficient without degree-correlation biases. Phys Rev E 71(5):057101CrossRef

26.

Spark: Lightning-fast cluster computing, http://spark.apache.org/docs/latest/programming-guide.html. Accessed 1 Oct 2016

27.

Ortmann M, Brandes U (2014) Triangle listing algorithms: back from the diversion. In: 2014 Proceedings of the Sixteenth Workshop on Algorithm Engineering and Experiments (ALENEX), 2014. SIAM, pp 1–8

28.

Schank T (2007) Algorithmic aspects of triangle-based network analysis. Dissertation, University Karlsruhe

29.

Schank T, Wagner D (2005) counting and listing all triangles in large graphs, an experimental study. In: International Workshop on Experimental and Efficient Algorithms, 2005. Springer, pp 606–609

30.

Latapy M (2008) Main-memory triangle computations for very large (sparse (power-law)) graphs. Theor Comput Sci 407(1–3):458–473CrossRefMATHMathSciNet

31.

Itai A, Rodeh M (1978) Finding a minimum circuit in a graph. SIAM J Comput 7(4):413–423CrossRefMATHMathSciNet

32.

Arifuzzaman S, Khan M, Marathe M (2013) PATRIC: a parallel algorithm for counting triangles in massive networks. In: Proceedings of the 22nd ACM International Conference on Information & Knowledge Management, 2013. ACM, pp 529–538

33.

Cohen J (2009) Graph twiddling in a mapreduce world. Comput Sci Eng 11(4):29–41CrossRef

34.

Park H-M, Silvestri F, Kang U, Pagh R (2014) Mapreduce triangle enumeration with guarantees. In: Proceedings of the 23rd ACM International Conference on Information and Knowledge Management, 2014. ACM, pp 1739–1748

35.

Park H-M, Chung C-W (2013) An efficient MapReduce algorithm for counting triangles in a very large graph. In: Proceedings of the 22nd ACM International Conference on Information & Knowledge Management, 2013. ACM, pp 539–548

36.

Apache Giraph, http://giraph.apache.org/. Accessed 1 Oct 2016

37.

Gonzalez JE, Low Y, Gu H, Bickson D, Guestrin C (2012) PowerGraph: distributed graph-parallel computation on natural graphs. In: OSDI, 2012, vol 1, p 2

38.

Quick L, Wilkinson P, Hardcastle D (2012) Using pregel-like large scale graph processing frameworks for social network analysis. In: Proceedings of the 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012), 2012. IEEE Computer Society, pp 457–463

39.

SNAP: Stanford Network Analysis Project. http://snap.stanford.edu. Accessed 1 Oct 2016

40.

Yang J, Leskovec J (2015) Defining and evaluating network communities based on ground-truth. Knowl Inf Syst 42(1):181–213CrossRef

41.

Backstrom L, Huttenlocher D, Kleinberg J, Lan X (2006) Group formation in large social networks: membership, growth, and evolution. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2006. ACM, pp 44–54

42.

Cha M, Haddadi H, Benevenuto F, Gummadi PK (2010) Measuring user influence in twitter: the million follower fallacy. ICWSM 10(10–17):30

Title: CCFinder: using Spark to find clustering coefficient in big graphs
Authors: Mehdi Alemi
Hassan Haghighi
Saeed Shahrivari
Publication date: 12-04-2017
Publisher: Springer US
Published in: The Journal of Supercomputing / Issue 11/2017
Print ISSN: 0920-8542
Electronic ISSN: 1573-0484
DOI: https://doi.org/10.1007/s11227-017-2040-8

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Springer Professional "Wirtschaft+Technik"

Other articles of this Issue 11/2017

Soft error resilience of Big Data kernels through algorithmic approaches

Barrier coverage of WSNs with the imperialist competitive algorithm

Erratum to: Exploring correlation for fast skyline computation

Efficient exploitation of the Xeon Phi architecture for the Ant Colony Optimization (ACO) metaheuristic

Security analysis of a publicly verifiable data possession scheme for remote storage

Novel parity-preserving reversible logic array multipliers

Premium Partner