nach oben

The Journal of Supercomputing

Erschienen in:

01.09.2014

Exploiting fine-grained parallelism in graph traversal algorithms via lock virtualization on multi-core architecture

verfasst von: Jie Yan, Guangming Tan, Ninghui Sun

Erschienen in: The Journal of Supercomputing | Ausgabe 3/2014

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Traversal is a fundamental procedure in most parallel graph algorithms. To explore the massive fine-grained parallelism in graph traversal, the fine-grained data synchronization is critical. On commodity multi-core processors, the widely adopted solution is fine-grained locks (i.e., one lock per vertex). However, in emerging graph analytics of massive irregular graphs (e.g., social network and web graph), it suffers huge memory cost and poor locality due to the large-scale vertex set and inherent random vertex access. In this paper, we propose a novel fine-grained lock mechanism—lock virtualization (vLock). The key idea is to map the huge logical lock space to a small fixed physical lock space that can reside in cache during runtime. The virtualization mechanism effectively reduces lock incurred extra memory cost and cache misses with only a slight increase of lock conflict rate, while it preserves high portability for legacy codes by providing Pthreads-like application programming interface. Our further analysis reveals that given the random access pattern, the lock conflict rate is no longer related to the size of vertex set but only the numbers of both physical locks and parallel threads, thus vLock is independent from graph topologies. This paper presents a complete description of the vLock method as well as its theoretic foundation. We implemented an vLock library and evaluated its performance in four classic graph traversal algorithms (BFS, SSSP, CC, PageRank). Experiments on the Intel Xeon E5 eight-core processor show that, compared to Pthreads fine-grained locks, vLock significantly reduces lock’s cache misses and achieves 4–20 % performance improvement.

Vorheriger Artikel Novel resource allocation algorithms to performance and energy efficiency in cloud computing

Nächster Artikel Optimizing memory access traffic via runtime thread migration for on-chip distributed memory systems

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Nur mit Berechtigung zugänglich

Similarly, \(\mathbb {O}\) can be associated with edges and uniquely indexed by edge \(id\). In this paper, we do not concern this case.

Note that the side effect of primitive \(trylock\) is different from that in traditional fine-grained locks. In traditional fine locks, returning failure implies some other thread is operating on \(v\) while in vLock it does not.

This is sometimes decided by compiling techniques. Intel C Compiler can automatically reorder instructions to hide the latency of \(DIV\) well, while GNU C Complier cannot.

In analysis of Sects. 4.2–4.4, we assume that the random lock access is further uniform. In practice, this assumption may not hold for virtual locks although it typically holds for physical locks, which should be considered in cases requiring more accurate analysis.

For BFS and SSSP, normalized performance of each run is first calculated and then used to compute their harmonic mean and deviation. For CC and PageRank, however, the mean and standard deviation of runtime in all 16 runs are first calculated and then handled with normalization.

The size of physical lock space should be a prime number for hash by address and be power of 2 for hash by vertex.

Asanovic K, Bodik R, Demmel J, Keaveny T, Keutzer K, Kubiatowicz J et al (2009) A view of the parallel computing landscape. In: Proceedings of communications of the ACM, p 52

National Research Council (2013) Frontiers in massive data analysis. The National Academies Press, Washington

Lumsdaine A, Gregor D, Hendrickson B, Berry JW (2007) Challenges in parallel graph processing. Parallel Process Lett 7(1):5–20MathSciNetCrossRef

Malewicz G, Austern M, Bik A, Dehnert J, Horn I, Leiser N, Czajkowski G (2010) Pregel: a system for large-scale graph processing. In: Proceeding of SIGMOD’10, Indianapolis, USA, 2010

Gonzalez JE, Low Y, Gu H, Bickson D, Guestrin C (2012) PowerGraph: distributed graph-parallel computation on natural graphs. In: Proceedings of the 10th OSDI, Hollywood, Oct 2012

Shun J, Belloch G (2013) Ligra: a lightweight graph processing framework for shared memory. In: Proceeding of PPoPP’13, Shenzhen, China, Feb 2013

Gregor D, Lumsdaine A (2005) The parallel BGL: a generic library for distributed graph computations. In: Proceeding of POOSC’05, Bloomington, July 2005

Graph500 benchmark (2010). http://www.graph500.org. Accessed 23 June 2014

Bader D, Feo J, Gilbert J, Kepner J, Koester D, Loh E, Madduri K, Mann W, Meuse T (2007) HPCS scalable synthetic compact applications #2 graph analysis (ssca#2 v2.2 specification), Sep 2007

10.

Leskovec J, Lang K, Dasgupta A, Mahoney W (2008) Community structure in large networks: natural cluster sizes and the absence of large well-defined clusters. Internet Math 6(1):29–123MathSciNetCrossRef

11.

Bader DA, Madduri K (2008) SNAP, small-world network analysis and partitioning: an open-source parallel graph framework for the exploration of large-scale networks. In: Proceedings of IPDPS, pp 1–12

12.

Tu D, Tan G (2009) Characterizing betweenness centrality algorithm on multi-core architectures. In: Proceedings of international symposium on parallel and distributed processing with applications, pp 182–189

13.

Cray XMT (2014). http://www.cray.com/Assets/PDF/products/xmt/CrayXMTBrochure.pdf. Accessed 23 June 2014

14.

Zhu W, Sreedhar VC, Hu Z, Gao GR (2007) Synchronization state buffer: supporting efficient fine-grain synchronization on many-core architectures. In: Proceedings of ISCA ’07 San Diego, USA, 2007

15.

Fraser K, Harris T (2007) Concurrent programming without locks. In: Proceedings of ACM transactions on computer systems, vol 25, issue 2, May 2007

16.

Michael MM, Scott ML (1996) Simple, fast, and practical non-blocking and blocking concurrent queue algorithms. In: Proceedings of the 15th annual ACM symposium on principles of distributed computing, PODC ’96

17.

Herlihy M, Moss JEB (1993) Transactional memory: architectural support for lock-free data structures. In: Proceedings of the 20th annual international symposium on computer architecture, ISCA ’93

18.

Harris T, Larus J, Rajwar R (2010) Transactional memory. In: Proceedings of synthesis lectures on computer architecture. Morgan & Claypool Publisher, San Rafael, USA, June 2010

19.

Kulkarni M, Pingali K, Walter B, Ramanarayanan G, Bala K, Chew LP (2007) Optimistic parallelism requires abstractions. In: Proceedings of PLDI ’07, vol 7, San Diego, USA, June 2007

20.

Hammond L, Wong V, Chen M, Carlstrom BD, Davis JD, Hertzberg B, Prabhu MK, Wijaya H, Kozyrakis C, Olukotun K (2004) Transactional memory coherence and consistency. In: Proceedings of the 31st annual international symposium on computer architecture, ISCA ’04

21.

Yan J, Tan G, Zhang X, Yao E, Sun N (2013) vLock: lock virtualization mechanism for exploiting fine-grained parallelism in graph traversal algorithms. In: Proceedings of IEEE/ACM symposium on code generation and optimization (CGO ’13), pp 141–150

22.

Intel Corporation (2014) Intel 64 and IA-32 architectures optimization reference manual. pp c1–c26

23.

Bertsekas DP, Guerriero F, Musmanno R (1996) Parallel asynchronous label correcting methods for shortest paths. J Optim Theory Appl 88(2):297–320

24.

Pearce R, Gokhale M, Amato NM (2010) Multithreaded asynchronous graph traversal for in-memory and semi-external memory. In: Proceeding of SC’10, New Orleans, Louisiana, USA, Nov 2010

25.

Shiloach Y, Vishkin U (1982) An o(log n) parallel connectivity algorithm. J Algorithms 3:57–67MathSciNetCrossRefMATH

26.

Brin S, Page L (1998) The anatomy of a large-scale hypertextual web search engine. In: Proceedings of computer networks and ISDN systems. Elsevier Science Publishers, Oxford, UK pp 107–117

27.

Chakrabarti D, Zhan Y, Faloutsos C (2004) R-MAT: a recursive model for graph mining. In: Proceedings of SDM’04, Toronto, Canada, August 2004

28.

Kwak H, Lee C, Park H, Moon S (2010) What is Twitter, a social network or a news media?. In: Proceedings of www’10, Raleigh, NC, USA pp 591–600

29.

Boldi P, Codenotti B, Santini M, Vigna S (2008) A large time-aware graph. SIGIR Forum 42(2):33–38CrossRef

30.

LAWS datasets (2014). http://law.di.unimi.it/datasets.php. Accessed 23 June 2014

31.

Yan J, Tan G, Sun N (2013) Graphine: programming graph-parallel computation of large natural graphs on multicore cluster, technical report, ICT-HPC-2013-2

32.

UPC Consortium (2013) UPC language and library specifications v1.3, Lawrence Berkeley National Lab, technical report LBNL-6623E, Nov 2013

33.

Akgul BE, Mooney VJ (2002) The system-on-a-chip Lock Cache. Int J Des Autom Embed Syst 7:139–174CrossRefMATH

34.

Feo J, Harper D, Kahan S, Konecny P (2005) Eldorado. In: Proceedings of the 2nd conference on computing frontiers, CF ’05, New York, NY, USA, pp 28–34

35.

Steffan JG, Colohan CB, Zhai A, Mowry TC (2000) A scalable approach to thread-level speculation. In: Proceedings of the 27th annual international symposium on computer architecture, ISCA ’00

36.

Afek Y, Dauber D, Touitou D (1995) Wait-free made fast. In: Proceedings of the twenty-seventh annual ACM symposium on theory of computing (STOC ’95)

37.

Valois JD (1995) Lock-free linked lists using compare-and-swap. In: Proceedings of the 14th annual ACM symposium on principles of distributed computing, PODC ’95

38.

Michael MM (2002) High performance dynamic lock-free hash tables and list-based sets. In: Proceedings of the 14th annual ACM symposium on parallel algorithms and architectures, SPAA ’02

Titel: Exploiting fine-grained parallelism in graph traversal algorithms via lock virtualization on multi-core architecture
verfasst von: Jie Yan
Guangming Tan
Ninghui Sun
Publikationsdatum: 01.09.2014
Verlag: Springer US
Erschienen in: The Journal of Supercomputing / Ausgabe 3/2014
Print ISSN: 0920-8542
Elektronische ISSN: 1573-0484
DOI: https://doi.org/10.1007/s11227-014-1239-1

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Springer Professional "Wirtschaft+Technik"

Weitere Artikel der Ausgabe 3/2014

On private Hamming distance computation

Performance analysis based resource allocation for green cloud computing

Novel resource allocation algorithms to performance and energy efficiency in cloud computing

An energy conservation replica placement strategy for Dynamo

Achieving spilling-friendly register file assignment for highly distributed register files

Optimizing memory access traffic via runtime thread migration for on-chip distributed memory systems