Skip to main content
Erschienen in: The Journal of Supercomputing 11/2017

12.04.2017

CCFinder: using Spark to find clustering coefficient in big graphs

verfasst von: Mehdi Alemi, Hassan Haghighi, Saeed Shahrivari

Erschienen in: The Journal of Supercomputing | Ausgabe 11/2017

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Networks with billions of vertices introduce new challenges to perform graph analysis in a reasonable time. Clustering coefficient is an important analytical measure of networks such as social networks and biological networks. To compute clustering coefficient in big graphs, existing distributed algorithms suffer from low efficiency such that they may fail due to demanding lots of memory, or even, if they complete successfully, their execution time is not acceptable for real-world applications. We present a distributed MapReduce-based algorithm, called CCFinder, to efficiently compute clustering coefficient in very big graphs. CCFinder is executed on Apache Spark, a scalable data processing platform. It efficiently detects existing triangles through using our proposed data structure, called FONL, which is cached in the distributed memory provided by Spark and reused multiple times. As data items in the FONL are fine-grained and contain the minimum required information, CCFinder requires less storage space and has better parallelism in comparison with its competitors. To find clustering coefficient, our solution to triangle counting is extended to have degree information of the vertices in the appropriate places. We performed several experiments on a Spark cluster with 60 processors. The results show that CCFinder achieves acceptable scalability and outperforms six existing competitor methods. Four competitors are those methods proposed based on graph processing systems, i.e., GraphX, NScale, NScaleSpark, and Pregel frameworks, and two others are the Cohen’s method and NodeIterator++, introduced based on MapReduce.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Watts DJ, Strogatz SH (1998) Collective dynamics of ‘small-world’networks. Nature 393(6684):440–442CrossRefMATH Watts DJ, Strogatz SH (1998) Collective dynamics of ‘small-world’networks. Nature 393(6684):440–442CrossRefMATH
3.
Zurück zum Zitat Kim BJ (2004) Performance of networks of artificial neurons: the role of clustering. Phys Rev E 69(4):045101CrossRef Kim BJ (2004) Performance of networks of artificial neurons: the role of clustering. Phys Rev E 69(4):045101CrossRef
4.
Zurück zum Zitat Centola D (2010) The spread of behavior in an online social network experiment. Science 329(5996):1194–1197CrossRef Centola D (2010) The spread of behavior in an online social network experiment. Science 329(5996):1194–1197CrossRef
5.
Zurück zum Zitat Huang Z (2006) Link prediction based on graph topology: the predictive value of generalized clustering coefficient. Paper presented at the Workshop on Link Analysis: Dynamics and Static of Large Networks (LinkKDD2006) Huang Z (2006) Link prediction based on graph topology: the predictive value of generalized clustering coefficient. Paper presented at the Workshop on Link Analysis: Dynamics and Static of Large Networks (LinkKDD2006)
6.
Zurück zum Zitat Goldstein R, Vitevitch MS (2013) The influence of clustering coefficient on word-learning: how groups of similar sounding words facilitate acquisition. Front Psychol 5:1307–1307 Goldstein R, Vitevitch MS (2013) The influence of clustering coefficient on word-learning: how groups of similar sounding words facilitate acquisition. Front Psychol 5:1307–1307
7.
Zurück zum Zitat Newman ME (2009) Random graphs with clustering. Phys Rev Lett 103(5):058701CrossRef Newman ME (2009) Random graphs with clustering. Phys Rev Lett 103(5):058701CrossRef
8.
Zurück zum Zitat Saramäki J, Kaski K (2004) Scale-free networks generated by random walkers. Phys A Stat Mech Appl 341:80–86CrossRefMathSciNet Saramäki J, Kaski K (2004) Scale-free networks generated by random walkers. Phys A Stat Mech Appl 341:80–86CrossRefMathSciNet
9.
Zurück zum Zitat Dorogovtsev SN, Goltsev AV, Mendes JFF (2002) Pseudofractal scale-free web. Phys Rev E 65(6):066122CrossRef Dorogovtsev SN, Goltsev AV, Mendes JFF (2002) Pseudofractal scale-free web. Phys Rev E 65(6):066122CrossRef
10.
Zurück zum Zitat Suri S, Vassilvitskii S (2011) Counting triangles and the curse of the last reducer. In: Proceedings of the 20th International Conference on World Wide Web, 2011. ACM, pp 607–614 Suri S, Vassilvitskii S (2011) Counting triangles and the curse of the last reducer. In: Proceedings of the 20th International Conference on World Wide Web, 2011. ACM, pp 607–614
11.
Zurück zum Zitat Chung FR, Lu L (2006) Complex graphs and networks, vol 107. American Mathematical Society, ProvidenceMATH Chung FR, Lu L (2006) Complex graphs and networks, vol 107. American Mathematical Society, ProvidenceMATH
12.
Zurück zum Zitat Milo R, Shen-Orr S, Itzkovitz S, Kashtan N, Chklovskii D, Alon U (2002) Network motifs: simple building blocks of complex networks. Science 298(5594):824–827CrossRef Milo R, Shen-Orr S, Itzkovitz S, Kashtan N, Chklovskii D, Alon U (2002) Network motifs: simple building blocks of complex networks. Science 298(5594):824–827CrossRef
13.
Zurück zum Zitat Kwak H, Lee C, Park H, Moon S (2010) What is Twitter, a social network or a news media? In: Proceedings of the 19th International Conference on World Wide Web, 2010. ACM, pp 591–600 Kwak H, Lee C, Park H, Moon S (2010) What is Twitter, a social network or a news media? In: Proceedings of the 19th International Conference on World Wide Web, 2010. ACM, pp 591–600
14.
Zurück zum Zitat Ye P, Peyser BD, Spencer FA, Bader JS (2005) Commensurate distances and similar motifs in genetic congruence and protein interaction networks in yeast. BMC Bioinform 6(1):270CrossRef Ye P, Peyser BD, Spencer FA, Bader JS (2005) Commensurate distances and similar motifs in genetic congruence and protein interaction networks in yeast. BMC Bioinform 6(1):270CrossRef
15.
Zurück zum Zitat White T (2012) Hadoop: the definitive guide. O’Reilly Media, Newton White T (2012) Hadoop: the definitive guide. O’Reilly Media, Newton
16.
Zurück zum Zitat Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I (2010) Spark: cluster computing with working sets. HotCloud 10(10–10):95 Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I (2010) Spark: cluster computing with working sets. HotCloud 10(10–10):95
17.
Zurück zum Zitat Meng X, Bradley J, Yavuz B, Sparks E, Venkataraman S, Liu D, Freeman J, Tsai D, Amde M, Owen S (2016) Mllib: machine learning in apache spark. J Mach Learn Res 17(34):1–7MATHMathSciNet Meng X, Bradley J, Yavuz B, Sparks E, Venkataraman S, Liu D, Freeman J, Tsai D, Amde M, Owen S (2016) Mllib: machine learning in apache spark. J Mach Learn Res 17(34):1–7MATHMathSciNet
18.
Zurück zum Zitat Chen J, Li K, Tang Z, Bilal K, Yu S, Weng C, Li K (2017) A parallel random forest algorithm for big data in a Spark cloud computing environment. IEEE Trans Parallel Distrib Syst 28(4):919–933 Chen J, Li K, Tang Z, Bilal K, Yu S, Weng C, Li K (2017) A parallel random forest algorithm for big data in a Spark cloud computing environment. IEEE Trans Parallel Distrib Syst 28(4):919–933
19.
Zurück zum Zitat Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113CrossRef Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113CrossRef
20.
Zurück zum Zitat Malewicz G, Austern MH, Bik AJ, Dehnert JC, Horn I, Leiser N, Czajkowski G (2010) Pregel: a system for large-scale graph processing. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, 2010. ACM, pp 135–146 Malewicz G, Austern MH, Bik AJ, Dehnert JC, Horn I, Leiser N, Czajkowski G (2010) Pregel: a system for large-scale graph processing. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, 2010. ACM, pp 135–146
21.
Zurück zum Zitat Quamar A, Deshpande A, Lin J (2016) NScale: neighborhood-centric large-scale graph analytics in the cloud. VLDB J 25(2):125–150CrossRef Quamar A, Deshpande A, Lin J (2016) NScale: neighborhood-centric large-scale graph analytics in the cloud. VLDB J 25(2):125–150CrossRef
22.
Zurück zum Zitat Low Y, Bickson D, Gonzalez J, Guestrin C, Kyrola A, Hellerstein JM (2012) Distributed GraphLab: a framework for machine learning and data mining in the cloud. Proc VLDB Endow 5(8):716–727CrossRef Low Y, Bickson D, Gonzalez J, Guestrin C, Kyrola A, Hellerstein JM (2012) Distributed GraphLab: a framework for machine learning and data mining in the cloud. Proc VLDB Endow 5(8):716–727CrossRef
23.
Zurück zum Zitat Gonzalez JE, Xin RS, Dave A, Crankshaw D, Franklin MJ, Stoica I (2014) GraphX: graph processing in a distributed dataflow framework. In: OSDI, 2014, pp 599–613 Gonzalez JE, Xin RS, Dave A, Crankshaw D, Franklin MJ, Stoica I (2014) GraphX: graph processing in a distributed dataflow framework. In: OSDI, 2014, pp 599–613
24.
Zurück zum Zitat Quamar A, Deshpande A (2016) NScaleSpark: subgraph-centric graph analytics on Apache Spark. In: Proceedings of the 1st ACM SIGMOD Workshop on Network Data Analytics, 2016. ACM, p 5 Quamar A, Deshpande A (2016) NScaleSpark: subgraph-centric graph analytics on Apache Spark. In: Proceedings of the 1st ACM SIGMOD Workshop on Network Data Analytics, 2016. ACM, p 5
25.
Zurück zum Zitat Soffer SN, Vazquez A (2005) Network clustering coefficient without degree-correlation biases. Phys Rev E 71(5):057101CrossRef Soffer SN, Vazquez A (2005) Network clustering coefficient without degree-correlation biases. Phys Rev E 71(5):057101CrossRef
27.
Zurück zum Zitat Ortmann M, Brandes U (2014) Triangle listing algorithms: back from the diversion. In: 2014 Proceedings of the Sixteenth Workshop on Algorithm Engineering and Experiments (ALENEX), 2014. SIAM, pp 1–8 Ortmann M, Brandes U (2014) Triangle listing algorithms: back from the diversion. In: 2014 Proceedings of the Sixteenth Workshop on Algorithm Engineering and Experiments (ALENEX), 2014. SIAM, pp 1–8
28.
Zurück zum Zitat Schank T (2007) Algorithmic aspects of triangle-based network analysis. Dissertation, University Karlsruhe Schank T (2007) Algorithmic aspects of triangle-based network analysis. Dissertation, University Karlsruhe
29.
Zurück zum Zitat Schank T, Wagner D (2005) counting and listing all triangles in large graphs, an experimental study. In: International Workshop on Experimental and Efficient Algorithms, 2005. Springer, pp 606–609 Schank T, Wagner D (2005) counting and listing all triangles in large graphs, an experimental study. In: International Workshop on Experimental and Efficient Algorithms, 2005. Springer, pp 606–609
30.
Zurück zum Zitat Latapy M (2008) Main-memory triangle computations for very large (sparse (power-law)) graphs. Theor Comput Sci 407(1–3):458–473CrossRefMATHMathSciNet Latapy M (2008) Main-memory triangle computations for very large (sparse (power-law)) graphs. Theor Comput Sci 407(1–3):458–473CrossRefMATHMathSciNet
32.
Zurück zum Zitat Arifuzzaman S, Khan M, Marathe M (2013) PATRIC: a parallel algorithm for counting triangles in massive networks. In: Proceedings of the 22nd ACM International Conference on Information & Knowledge Management, 2013. ACM, pp 529–538 Arifuzzaman S, Khan M, Marathe M (2013) PATRIC: a parallel algorithm for counting triangles in massive networks. In: Proceedings of the 22nd ACM International Conference on Information & Knowledge Management, 2013. ACM, pp 529–538
33.
Zurück zum Zitat Cohen J (2009) Graph twiddling in a mapreduce world. Comput Sci Eng 11(4):29–41CrossRef Cohen J (2009) Graph twiddling in a mapreduce world. Comput Sci Eng 11(4):29–41CrossRef
34.
Zurück zum Zitat Park H-M, Silvestri F, Kang U, Pagh R (2014) Mapreduce triangle enumeration with guarantees. In: Proceedings of the 23rd ACM International Conference on Information and Knowledge Management, 2014. ACM, pp 1739–1748 Park H-M, Silvestri F, Kang U, Pagh R (2014) Mapreduce triangle enumeration with guarantees. In: Proceedings of the 23rd ACM International Conference on Information and Knowledge Management, 2014. ACM, pp 1739–1748
35.
Zurück zum Zitat Park H-M, Chung C-W (2013) An efficient MapReduce algorithm for counting triangles in a very large graph. In: Proceedings of the 22nd ACM International Conference on Information & Knowledge Management, 2013. ACM, pp 539–548 Park H-M, Chung C-W (2013) An efficient MapReduce algorithm for counting triangles in a very large graph. In: Proceedings of the 22nd ACM International Conference on Information & Knowledge Management, 2013. ACM, pp 539–548
37.
Zurück zum Zitat Gonzalez JE, Low Y, Gu H, Bickson D, Guestrin C (2012) PowerGraph: distributed graph-parallel computation on natural graphs. In: OSDI, 2012, vol 1, p 2 Gonzalez JE, Low Y, Gu H, Bickson D, Guestrin C (2012) PowerGraph: distributed graph-parallel computation on natural graphs. In: OSDI, 2012, vol 1, p 2
38.
Zurück zum Zitat Quick L, Wilkinson P, Hardcastle D (2012) Using pregel-like large scale graph processing frameworks for social network analysis. In: Proceedings of the 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012), 2012. IEEE Computer Society, pp 457–463 Quick L, Wilkinson P, Hardcastle D (2012) Using pregel-like large scale graph processing frameworks for social network analysis. In: Proceedings of the 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012), 2012. IEEE Computer Society, pp 457–463
40.
Zurück zum Zitat Yang J, Leskovec J (2015) Defining and evaluating network communities based on ground-truth. Knowl Inf Syst 42(1):181–213CrossRef Yang J, Leskovec J (2015) Defining and evaluating network communities based on ground-truth. Knowl Inf Syst 42(1):181–213CrossRef
41.
Zurück zum Zitat Backstrom L, Huttenlocher D, Kleinberg J, Lan X (2006) Group formation in large social networks: membership, growth, and evolution. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2006. ACM, pp 44–54 Backstrom L, Huttenlocher D, Kleinberg J, Lan X (2006) Group formation in large social networks: membership, growth, and evolution. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2006. ACM, pp 44–54
42.
Zurück zum Zitat Cha M, Haddadi H, Benevenuto F, Gummadi PK (2010) Measuring user influence in twitter: the million follower fallacy. ICWSM 10(10–17):30 Cha M, Haddadi H, Benevenuto F, Gummadi PK (2010) Measuring user influence in twitter: the million follower fallacy. ICWSM 10(10–17):30
Metadaten
Titel
CCFinder: using Spark to find clustering coefficient in big graphs
verfasst von
Mehdi Alemi
Hassan Haghighi
Saeed Shahrivari
Publikationsdatum
12.04.2017
Verlag
Springer US
Erschienen in
The Journal of Supercomputing / Ausgabe 11/2017
Print ISSN: 0920-8542
Elektronische ISSN: 1573-0484
DOI
https://doi.org/10.1007/s11227-017-2040-8

Weitere Artikel der Ausgabe 11/2017

The Journal of Supercomputing 11/2017 Zur Ausgabe