research-article

CuSha: vertex-centric graph processing on GPUs

Authors:
Farzad Khorasani

University of California Riverside, RIVERSIDE, CA, USA

University of California Riverside, RIVERSIDE, CA, USA
View Profile

,
Keval Vora

University of California Riverside, RIVERSIDE, CA, USA

University of California Riverside, RIVERSIDE, CA, USA
View Profile

,
Rajiv Gupta

University of California Riverside, RIVERSIDE, CA, USA

University of California Riverside, RIVERSIDE, CA, USA
View Profile

,
Laxmi N. Bhuyan

University of California Riverside, RIVERSIDE, CA, USA

University of California Riverside, RIVERSIDE, CA, USA
View Profile

HPDC '14: Proceedings of the 23rd international symposium on High-performance parallel and distributed computingJune 2014Pages 239–252https://doi.org/10.1145/2600212.2600227

Published:23 June 2014Publication History

HPDC '14: Proceedings of the 23rd international symposium on High-performance parallel and distributed computing

Pages 239–252

ABSTRACT

Vertex-centric graph processing is employed by many popular algorithms (e.g., PageRank) due to its simplicity and efficient use of asynchronous parallelism. The high compute power provided by SIMT architecture presents an opportunity for accelerating these algorithms using GPUs. Prior works of graph processing on a GPU employ Compressed Sparse Row (CSR) form for its space-efficiency; however, CSR suffers from irregular memory accesses and GPU underutilization that limit its performance. In this paper, we present CuSha, a CUDA-based graph processing framework that overcomes the above obstacle via use of two novel graph representations: G-Shards and Concatenated Windows (CW). G-Shards uses a concept recently introduced for non-GPU systems that organizes a graph into autonomous sets of ordered edges called shards. CuSha's mapping of GPU hardware resources on to shards allows fully coalesced memory accesses. CW is a novel representation that enhances the use of shards to achieve higher GPU utilization for processing sparse graphs. Finally, CuSha fully utilizes the GPU power by processing multiple shards in parallel on GPU's streaming multiprocessors. For ease of programming, CuSha allows the user to define the vertex-centric computation and plug it into its framework for parallel processing of large graphs. Our experiments show that CuSha provides significant speedups over the state-of-the-art CSR-based virtual warp-centric method for processing graphs on GPUs.

References

V. Agarwal, F. Petrini, D. Pasetto, and D. A. Bader. Scalable graph exploration on multicore processors. In SC, pages 1--11, 2010. Google ScholarDigital Library
D. A. Bader and K. Madduri. Snap, small-world network analysis and partitioning: An open-source parallel graph framework for the exploration of large-scale networks. In IPDPS, pages 1--12, 2008.Google ScholarCross Ref
A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt. Analyzing cuda workloads using a detailed gpu simulator. In ISPASS, pages 163--174, 2009.Google ScholarCross Ref
J. Barnat, P. Bauch, L. Brim, and M. Ceska. Computing strongly connected components in parallel on cuda. In IPDPS, pages 544--555, 2011. Google ScholarDigital Library
D. Chakrabarti, Y. Zhan, and C. Faloutsos. R-mat: A recursive model for graph mining. In In SDM, 2004.Google ScholarCross Ref
S. Che, J. Sheaffer, and K. Skadron. Dymaxion: Optimizing memory access patterns for heterogeneous systems. In SC, pages 1--11, 2011. Google ScholarDigital Library
M. De Domenico, A. Lima, P. Mougel, and M. Musolesi. The anatomy of a scientific rumor. Scientific Reports, 2013.Google ScholarCross Ref
A. Gharaibeh, L. Beltrão Costa, E. Santos-Neto, and M. Ripeanu. A yoke of oxen and a thousand chickens for heavy lifting graph processing. In PACT, pages 345--354, 2012. Google ScholarDigital Library
A. Gharaibeh, L. Beltrão Costa, E. Santos-Neto, and M. Ripeanu. Efficient Large-Scale Graph Processing on Hybrid CPU and GPU Systems. In CoRR, 2013.Google Scholar
P. Harish and P. J. Narayanan. Accelerating large graph algorithms on the gpu using cuda. In HiPC, pages 197--208, 2007. Google ScholarDigital Library
M. Harris et al. Optimizing parallel reduction in cuda. NVIDIA Developer Technology, 2, 2007.Google Scholar
S. Hong, S. K. Kim, T. Oguntebi, and K. Olukotun. Accelerating cuda graph algorithms at maximum warp. In PPoPP, pages 267--276, 2011. Google ScholarDigital Library
S. Hong, T. Oguntebi, and K. Olukotun. Efficient parallel graph exploration on multi-core cpu and gpu. In PACT, 2011. Google ScholarDigital Library
A. Kyrola, G. Blelloch, and C. Guestrin. Graphchi: Large-scale graph computation on just a pc. In OSDI, pages 31--46, 2012. Google ScholarDigital Library
J. Leskovec. Stanford large network dataset collection. http://snap.stanford.edu/data/index.html, 2011.Google Scholar
J. Leskovec, L. A. Adamic, and B. A. Huberman. The dynamics of viral marketing. ACM Trans. Web, 1(1), May 2007. Google ScholarDigital Library
J. Leskovec, K. J. Lang, A. Dasgupta, and M. W. Mahoney. Community structure in large networks: Natural cluster sizes and the absence of large well-defined clusters. CoRR, abs/0810.1355, 2008.Google Scholar
A. Lumsdaine, D. Gregor, B. Hendrickson, and J. Berry. Challenges in parallel graph processing. Parallel Processing Letters, 17(01):5--20, 2007.Google ScholarCross Ref
L. Luo, M. Wong, and W.-m. Hwu. An effective gpu implementation of breadth-first search. In DAC, pages 52--55, 2010. Google ScholarDigital Library
M. Mendez-Lojo, M. Burtscher, and K. Pingali. A gpu implementation of inclusion-based points-to analysis. In PPoPP, pages 107--116, 2012. Google ScholarDigital Library
D. Merrill, M. Garland, and A. Grimshaw. Scalable gpu graph traversal. In PPoPP, pages 117--128, 2012. Google ScholarDigital Library
R. Nasre, M. Burtscher, and K. Pingali. Morph algorithms on gpus. In PPoPP, pages 147--156, 2013. Google ScholarDigital Library
L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Technical Report 1999--66, Stanford InfoLab, 1999.Google Scholar
M. Samadi, A. Hormati, M. Mehrara, J. Lee, and S. Mahlke. Adaptive input-aware compilation for graphics engines. In PLDI, pages 13--22, 2012. Google ScholarDigital Library
L. Takac and M. Zabovsky. Data analysis in public social networks. In International Scientific Conference and International Workshop Present Day Trends of Innovations, 2012.Google Scholar
L. G. Valiant. A bridging model for parallel computation. Commun. ACM, 33(8):103--111, 1990. Google ScholarDigital Library
B. Wu, Z. Zhao, E. Z. Zhang, Y. Jiang, and X. Shen. Complexity analysis and algorithm design for reorganizing data to minimize non-coalesced memory accesses on gpu. In PPoPP, pages 57--68, 2013. Google ScholarDigital Library
T. Wu, B. Wang, Y. Shan, F. Yan, Y. Wang, and N. Xu. Efficient pagerank and spmv computation on amd gpus. In ICPP, pages 81--89, 2010. Google ScholarDigital Library
E. Z. Zhang, Y. Jiang, Z. Guo, K. Tian, and X. Shen. On-the- y elimination of dynamic irregularities for gpu computing. In ASPLOS, pages 369--380, 2011. Google ScholarDigital Library
J. Zhong, and B. He. Medusa: Simplied Graph Processing on GPUs. In IEEE Transactions on Parallel and Distributed Systems, 2013.Google Scholar

Index Terms

CuSha: vertex-centric graph processing on GPUs
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel programming languages
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language types
        Parallel programming languages

Recommendations

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing

The graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Read More
Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs
SCC '12: Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis

OpenCL (Open Computing Language) is a framework for general-purpose parallel programming. Programs written in OpenCL are functionally portable across multiple processors including CPUs, GPUs, and also FPGAs. Using an auto-tuning technique makes ...
Read More
Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers
Highlights
- Generate parallel CUDA code from sequential C input code using a compiler-based tool for key operators in Geometric Multigrid.
Abstract
GPUs, with their high bandwidths and computational capabilities are an increasingly popular target for scientific computing. Unfortunately, to date, harnessing the power of the GPU has required use of a GPU-specific programming model ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
HPDC '14: Proceedings of the 23rd international symposium on High-performance parallel and distributed computing
June 2014
334 pages
ISBN:9781450327497
DOI:10.1145/2600212
General Chairs:
Beth Plale
Indiana University, USA
,
Matei Ripeanu
University of British Columbia, CA
,
Program Chairs:
Franck Cappello
Argonne National Lab and INRIA, USA
,
Dongyan Xu
Purdue University, USA
Copyright © 2014 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 23 June 2014
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
coalesced memory accesses
concatenated windows
g-shards
gpu
graph representation
Qualifiers
- research-article
Conference

Acceptance Rates
HPDC '14 Paper Acceptance Rate21of130submissions,16%Overall Acceptance Rate166of966submissions,17%
More
Upcoming Conference
HPDC '24

Sponsor:

sigarch

The 33rd International Symposium on High-Performance Parallel and Distributed Computing

June 3 - 7, 2024

Pisa , Italy
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 185
  Total Citations
  View Citations
- 2,209
  Total Downloads
- Downloads (Last 12 months)140
- Downloads (Last 6 weeks)14
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

CuSha: vertex-centric graph processing on GPUs

HPDC '14: Proceedings of the 23rd international symposium on High-performance parallel and distributed computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs

Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers