Skip to main content
Erschienen in: The Journal of Supercomputing 3/2014

01.09.2014

Exploiting fine-grained parallelism in graph traversal algorithms via lock virtualization on multi-core architecture

verfasst von: Jie Yan, Guangming Tan, Ninghui Sun

Erschienen in: The Journal of Supercomputing | Ausgabe 3/2014

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Traversal is a fundamental procedure in most parallel graph algorithms. To explore the massive fine-grained parallelism in graph traversal, the fine-grained data synchronization is critical. On commodity multi-core processors, the widely adopted solution is fine-grained locks (i.e., one lock per vertex). However, in emerging graph analytics of massive irregular graphs (e.g., social network and web graph), it suffers huge memory cost and poor locality due to the large-scale vertex set and inherent random vertex access. In this paper, we propose a novel fine-grained lock mechanism—lock virtualization (vLock). The key idea is to map the huge logical lock space to a small fixed physical lock space that can reside in cache during runtime. The virtualization mechanism effectively reduces lock incurred extra memory cost and cache misses with only a slight increase of lock conflict rate, while it preserves high portability for legacy codes by providing Pthreads-like application programming interface. Our further analysis reveals that given the random access pattern, the lock conflict rate is no longer related to the size of vertex set but only the numbers of both physical locks and parallel threads, thus vLock is independent from graph topologies. This paper presents a complete description of the vLock method as well as its theoretic foundation. We implemented an vLock library and evaluated its performance in four classic graph traversal algorithms (BFS, SSSP, CC, PageRank). Experiments on the Intel Xeon E5 eight-core processor show that, compared to Pthreads fine-grained locks, vLock significantly reduces lock’s cache misses and achieves 4–20 % performance improvement.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Fußnoten
1
Similarly, \(\mathbb {O}\) can be associated with edges and uniquely indexed by edge \(id\). In this paper, we do not concern this case.
 
2
Note that the side effect of primitive \(trylock\) is different from that in traditional fine-grained locks. In traditional fine locks, returning failure implies some other thread is operating on \(v\) while in vLock it does not.
 
3
This is sometimes decided by compiling techniques. Intel C Compiler can automatically reorder instructions to hide the latency of \(DIV\) well, while GNU C Complier cannot.
 
4
In analysis of Sects. 4.24.4, we assume that the random lock access is further uniform. In practice, this assumption may not hold for virtual locks although it typically holds for physical locks, which should be considered in cases requiring more accurate analysis.
 
5
For BFS and SSSP, normalized performance of each run is first calculated and then used to compute their harmonic mean and deviation. For CC and PageRank, however, the mean and standard deviation of runtime in all 16 runs are first calculated and then handled with normalization.
 
6
The size of physical lock space should be a prime number for hash by address and be power of 2 for hash by vertex.
 
Literatur
1.
Zurück zum Zitat Asanovic K, Bodik R, Demmel J, Keaveny T, Keutzer K, Kubiatowicz J et al (2009) A view of the parallel computing landscape. In: Proceedings of communications of the ACM, p 52 Asanovic K, Bodik R, Demmel J, Keaveny T, Keutzer K, Kubiatowicz J et al (2009) A view of the parallel computing landscape. In: Proceedings of communications of the ACM, p 52
2.
Zurück zum Zitat National Research Council (2013) Frontiers in massive data analysis. The National Academies Press, Washington National Research Council (2013) Frontiers in massive data analysis. The National Academies Press, Washington
3.
Zurück zum Zitat Lumsdaine A, Gregor D, Hendrickson B, Berry JW (2007) Challenges in parallel graph processing. Parallel Process Lett 7(1):5–20MathSciNetCrossRef Lumsdaine A, Gregor D, Hendrickson B, Berry JW (2007) Challenges in parallel graph processing. Parallel Process Lett 7(1):5–20MathSciNetCrossRef
4.
Zurück zum Zitat Malewicz G, Austern M, Bik A, Dehnert J, Horn I, Leiser N, Czajkowski G (2010) Pregel: a system for large-scale graph processing. In: Proceeding of SIGMOD’10, Indianapolis, USA, 2010 Malewicz G, Austern M, Bik A, Dehnert J, Horn I, Leiser N, Czajkowski G (2010) Pregel: a system for large-scale graph processing. In: Proceeding of SIGMOD’10, Indianapolis, USA, 2010
5.
Zurück zum Zitat Gonzalez JE, Low Y, Gu H, Bickson D, Guestrin C (2012) PowerGraph: distributed graph-parallel computation on natural graphs. In: Proceedings of the 10th OSDI, Hollywood, Oct 2012 Gonzalez JE, Low Y, Gu H, Bickson D, Guestrin C (2012) PowerGraph: distributed graph-parallel computation on natural graphs. In: Proceedings of the 10th OSDI, Hollywood, Oct 2012
6.
Zurück zum Zitat Shun J, Belloch G (2013) Ligra: a lightweight graph processing framework for shared memory. In: Proceeding of PPoPP’13, Shenzhen, China, Feb 2013 Shun J, Belloch G (2013) Ligra: a lightweight graph processing framework for shared memory. In: Proceeding of PPoPP’13, Shenzhen, China, Feb 2013
7.
Zurück zum Zitat Gregor D, Lumsdaine A (2005) The parallel BGL: a generic library for distributed graph computations. In: Proceeding of POOSC’05, Bloomington, July 2005 Gregor D, Lumsdaine A (2005) The parallel BGL: a generic library for distributed graph computations. In: Proceeding of POOSC’05, Bloomington, July 2005
9.
Zurück zum Zitat Bader D, Feo J, Gilbert J, Kepner J, Koester D, Loh E, Madduri K, Mann W, Meuse T (2007) HPCS scalable synthetic compact applications #2 graph analysis (ssca#2 v2.2 specification), Sep 2007 Bader D, Feo J, Gilbert J, Kepner J, Koester D, Loh E, Madduri K, Mann W, Meuse T (2007) HPCS scalable synthetic compact applications #2 graph analysis (ssca#2 v2.2 specification), Sep 2007
10.
Zurück zum Zitat Leskovec J, Lang K, Dasgupta A, Mahoney W (2008) Community structure in large networks: natural cluster sizes and the absence of large well-defined clusters. Internet Math 6(1):29–123MathSciNetCrossRef Leskovec J, Lang K, Dasgupta A, Mahoney W (2008) Community structure in large networks: natural cluster sizes and the absence of large well-defined clusters. Internet Math 6(1):29–123MathSciNetCrossRef
11.
Zurück zum Zitat Bader DA, Madduri K (2008) SNAP, small-world network analysis and partitioning: an open-source parallel graph framework for the exploration of large-scale networks. In: Proceedings of IPDPS, pp 1–12 Bader DA, Madduri K (2008) SNAP, small-world network analysis and partitioning: an open-source parallel graph framework for the exploration of large-scale networks. In: Proceedings of IPDPS, pp 1–12
12.
Zurück zum Zitat Tu D, Tan G (2009) Characterizing betweenness centrality algorithm on multi-core architectures. In: Proceedings of international symposium on parallel and distributed processing with applications, pp 182–189 Tu D, Tan G (2009) Characterizing betweenness centrality algorithm on multi-core architectures. In: Proceedings of international symposium on parallel and distributed processing with applications, pp 182–189
14.
Zurück zum Zitat Zhu W, Sreedhar VC, Hu Z, Gao GR (2007) Synchronization state buffer: supporting efficient fine-grain synchronization on many-core architectures. In: Proceedings of ISCA ’07 San Diego, USA, 2007 Zhu W, Sreedhar VC, Hu Z, Gao GR (2007) Synchronization state buffer: supporting efficient fine-grain synchronization on many-core architectures. In: Proceedings of ISCA ’07 San Diego, USA, 2007
15.
Zurück zum Zitat Fraser K, Harris T (2007) Concurrent programming without locks. In: Proceedings of ACM transactions on computer systems, vol 25, issue 2, May 2007 Fraser K, Harris T (2007) Concurrent programming without locks. In: Proceedings of ACM transactions on computer systems, vol 25, issue 2, May 2007
16.
Zurück zum Zitat Michael MM, Scott ML (1996) Simple, fast, and practical non-blocking and blocking concurrent queue algorithms. In: Proceedings of the 15th annual ACM symposium on principles of distributed computing, PODC ’96 Michael MM, Scott ML (1996) Simple, fast, and practical non-blocking and blocking concurrent queue algorithms. In: Proceedings of the 15th annual ACM symposium on principles of distributed computing, PODC ’96
17.
Zurück zum Zitat Herlihy M, Moss JEB (1993) Transactional memory: architectural support for lock-free data structures. In: Proceedings of the 20th annual international symposium on computer architecture, ISCA ’93 Herlihy M, Moss JEB (1993) Transactional memory: architectural support for lock-free data structures. In: Proceedings of the 20th annual international symposium on computer architecture, ISCA ’93
18.
Zurück zum Zitat Harris T, Larus J, Rajwar R (2010) Transactional memory. In: Proceedings of synthesis lectures on computer architecture. Morgan & Claypool Publisher, San Rafael, USA, June 2010 Harris T, Larus J, Rajwar R (2010) Transactional memory. In: Proceedings of synthesis lectures on computer architecture. Morgan & Claypool Publisher, San Rafael, USA, June 2010
19.
Zurück zum Zitat Kulkarni M, Pingali K, Walter B, Ramanarayanan G, Bala K, Chew LP (2007) Optimistic parallelism requires abstractions. In: Proceedings of PLDI ’07, vol 7, San Diego, USA, June 2007 Kulkarni M, Pingali K, Walter B, Ramanarayanan G, Bala K, Chew LP (2007) Optimistic parallelism requires abstractions. In: Proceedings of PLDI ’07, vol 7, San Diego, USA, June 2007
20.
Zurück zum Zitat Hammond L, Wong V, Chen M, Carlstrom BD, Davis JD, Hertzberg B, Prabhu MK, Wijaya H, Kozyrakis C, Olukotun K (2004) Transactional memory coherence and consistency. In: Proceedings of the 31st annual international symposium on computer architecture, ISCA ’04 Hammond L, Wong V, Chen M, Carlstrom BD, Davis JD, Hertzberg B, Prabhu MK, Wijaya H, Kozyrakis C, Olukotun K (2004) Transactional memory coherence and consistency. In: Proceedings of the 31st annual international symposium on computer architecture, ISCA ’04
21.
Zurück zum Zitat Yan J, Tan G, Zhang X, Yao E, Sun N (2013) vLock: lock virtualization mechanism for exploiting fine-grained parallelism in graph traversal algorithms. In: Proceedings of IEEE/ACM symposium on code generation and optimization (CGO ’13), pp 141–150 Yan J, Tan G, Zhang X, Yao E, Sun N (2013) vLock: lock virtualization mechanism for exploiting fine-grained parallelism in graph traversal algorithms. In: Proceedings of IEEE/ACM symposium on code generation and optimization (CGO ’13), pp 141–150
22.
Zurück zum Zitat Intel Corporation (2014) Intel 64 and IA-32 architectures optimization reference manual. pp c1–c26 Intel Corporation (2014) Intel 64 and IA-32 architectures optimization reference manual. pp c1–c26
23.
Zurück zum Zitat Bertsekas DP, Guerriero F, Musmanno R (1996) Parallel asynchronous label correcting methods for shortest paths. J Optim Theory Appl 88(2):297–320 Bertsekas DP, Guerriero F, Musmanno R (1996) Parallel asynchronous label correcting methods for shortest paths. J Optim Theory Appl 88(2):297–320
24.
Zurück zum Zitat Pearce R, Gokhale M, Amato NM (2010) Multithreaded asynchronous graph traversal for in-memory and semi-external memory. In: Proceeding of SC’10, New Orleans, Louisiana, USA, Nov 2010 Pearce R, Gokhale M, Amato NM (2010) Multithreaded asynchronous graph traversal for in-memory and semi-external memory. In: Proceeding of SC’10, New Orleans, Louisiana, USA, Nov 2010
26.
Zurück zum Zitat Brin S, Page L (1998) The anatomy of a large-scale hypertextual web search engine. In: Proceedings of computer networks and ISDN systems. Elsevier Science Publishers, Oxford, UK pp 107–117 Brin S, Page L (1998) The anatomy of a large-scale hypertextual web search engine. In: Proceedings of computer networks and ISDN systems. Elsevier Science Publishers, Oxford, UK pp 107–117
27.
Zurück zum Zitat Chakrabarti D, Zhan Y, Faloutsos C (2004) R-MAT: a recursive model for graph mining. In: Proceedings of SDM’04, Toronto, Canada, August 2004 Chakrabarti D, Zhan Y, Faloutsos C (2004) R-MAT: a recursive model for graph mining. In: Proceedings of SDM’04, Toronto, Canada, August 2004
28.
Zurück zum Zitat Kwak H, Lee C, Park H, Moon S (2010) What is Twitter, a social network or a news media?. In: Proceedings of www’10, Raleigh, NC, USA pp 591–600 Kwak H, Lee C, Park H, Moon S (2010) What is Twitter, a social network or a news media?. In: Proceedings of www’10, Raleigh, NC, USA pp 591–600
29.
Zurück zum Zitat Boldi P, Codenotti B, Santini M, Vigna S (2008) A large time-aware graph. SIGIR Forum 42(2):33–38CrossRef Boldi P, Codenotti B, Santini M, Vigna S (2008) A large time-aware graph. SIGIR Forum 42(2):33–38CrossRef
31.
Zurück zum Zitat Yan J, Tan G, Sun N (2013) Graphine: programming graph-parallel computation of large natural graphs on multicore cluster, technical report, ICT-HPC-2013-2 Yan J, Tan G, Sun N (2013) Graphine: programming graph-parallel computation of large natural graphs on multicore cluster, technical report, ICT-HPC-2013-2
32.
Zurück zum Zitat UPC Consortium (2013) UPC language and library specifications v1.3, Lawrence Berkeley National Lab, technical report LBNL-6623E, Nov 2013 UPC Consortium (2013) UPC language and library specifications v1.3, Lawrence Berkeley National Lab, technical report LBNL-6623E, Nov 2013
33.
Zurück zum Zitat Akgul BE, Mooney VJ (2002) The system-on-a-chip Lock Cache. Int J Des Autom Embed Syst 7:139–174CrossRefMATH Akgul BE, Mooney VJ (2002) The system-on-a-chip Lock Cache. Int J Des Autom Embed Syst 7:139–174CrossRefMATH
34.
Zurück zum Zitat Feo J, Harper D, Kahan S, Konecny P (2005) Eldorado. In: Proceedings of the 2nd conference on computing frontiers, CF ’05, New York, NY, USA, pp 28–34 Feo J, Harper D, Kahan S, Konecny P (2005) Eldorado. In: Proceedings of the 2nd conference on computing frontiers, CF ’05, New York, NY, USA, pp 28–34
35.
Zurück zum Zitat Steffan JG, Colohan CB, Zhai A, Mowry TC (2000) A scalable approach to thread-level speculation. In: Proceedings of the 27th annual international symposium on computer architecture, ISCA ’00 Steffan JG, Colohan CB, Zhai A, Mowry TC (2000) A scalable approach to thread-level speculation. In: Proceedings of the 27th annual international symposium on computer architecture, ISCA ’00
36.
Zurück zum Zitat Afek Y, Dauber D, Touitou D (1995) Wait-free made fast. In: Proceedings of the twenty-seventh annual ACM symposium on theory of computing (STOC ’95) Afek Y, Dauber D, Touitou D (1995) Wait-free made fast. In: Proceedings of the twenty-seventh annual ACM symposium on theory of computing (STOC ’95)
37.
Zurück zum Zitat Valois JD (1995) Lock-free linked lists using compare-and-swap. In: Proceedings of the 14th annual ACM symposium on principles of distributed computing, PODC ’95 Valois JD (1995) Lock-free linked lists using compare-and-swap. In: Proceedings of the 14th annual ACM symposium on principles of distributed computing, PODC ’95
38.
Zurück zum Zitat Michael MM (2002) High performance dynamic lock-free hash tables and list-based sets. In: Proceedings of the 14th annual ACM symposium on parallel algorithms and architectures, SPAA ’02 Michael MM (2002) High performance dynamic lock-free hash tables and list-based sets. In: Proceedings of the 14th annual ACM symposium on parallel algorithms and architectures, SPAA ’02
Metadaten
Titel
Exploiting fine-grained parallelism in graph traversal algorithms via lock virtualization on multi-core architecture
verfasst von
Jie Yan
Guangming Tan
Ninghui Sun
Publikationsdatum
01.09.2014
Verlag
Springer US
Erschienen in
The Journal of Supercomputing / Ausgabe 3/2014
Print ISSN: 0920-8542
Elektronische ISSN: 1573-0484
DOI
https://doi.org/10.1007/s11227-014-1239-1

Weitere Artikel der Ausgabe 3/2014

The Journal of Supercomputing 3/2014 Zur Ausgabe