ABSTRACT
Vertex-centric graph processing is employed by many popular algorithms (e.g., PageRank) due to its simplicity and efficient use of asynchronous parallelism. The high compute power provided by SIMT architecture presents an opportunity for accelerating these algorithms using GPUs. Prior works of graph processing on a GPU employ Compressed Sparse Row (CSR) form for its space-efficiency; however, CSR suffers from irregular memory accesses and GPU underutilization that limit its performance. In this paper, we present CuSha, a CUDA-based graph processing framework that overcomes the above obstacle via use of two novel graph representations: G-Shards and Concatenated Windows (CW). G-Shards uses a concept recently introduced for non-GPU systems that organizes a graph into autonomous sets of ordered edges called shards. CuSha's mapping of GPU hardware resources on to shards allows fully coalesced memory accesses. CW is a novel representation that enhances the use of shards to achieve higher GPU utilization for processing sparse graphs. Finally, CuSha fully utilizes the GPU power by processing multiple shards in parallel on GPU's streaming multiprocessors. For ease of programming, CuSha allows the user to define the vertex-centric computation and plug it into its framework for parallel processing of large graphs. Our experiments show that CuSha provides significant speedups over the state-of-the-art CSR-based virtual warp-centric method for processing graphs on GPUs.
- V. Agarwal, F. Petrini, D. Pasetto, and D. A. Bader. Scalable graph exploration on multicore processors. In SC, pages 1--11, 2010. Google ScholarDigital Library
- D. A. Bader and K. Madduri. Snap, small-world network analysis and partitioning: An open-source parallel graph framework for the exploration of large-scale networks. In IPDPS, pages 1--12, 2008.Google ScholarCross Ref
- A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt. Analyzing cuda workloads using a detailed gpu simulator. In ISPASS, pages 163--174, 2009.Google ScholarCross Ref
- J. Barnat, P. Bauch, L. Brim, and M. Ceska. Computing strongly connected components in parallel on cuda. In IPDPS, pages 544--555, 2011. Google ScholarDigital Library
- D. Chakrabarti, Y. Zhan, and C. Faloutsos. R-mat: A recursive model for graph mining. In In SDM, 2004.Google ScholarCross Ref
- S. Che, J. Sheaffer, and K. Skadron. Dymaxion: Optimizing memory access patterns for heterogeneous systems. In SC, pages 1--11, 2011. Google ScholarDigital Library
- M. De Domenico, A. Lima, P. Mougel, and M. Musolesi. The anatomy of a scientific rumor. Scientific Reports, 2013.Google ScholarCross Ref
- A. Gharaibeh, L. Beltrão Costa, E. Santos-Neto, and M. Ripeanu. A yoke of oxen and a thousand chickens for heavy lifting graph processing. In PACT, pages 345--354, 2012. Google ScholarDigital Library
- A. Gharaibeh, L. Beltrão Costa, E. Santos-Neto, and M. Ripeanu. Efficient Large-Scale Graph Processing on Hybrid CPU and GPU Systems. In CoRR, 2013.Google Scholar
- P. Harish and P. J. Narayanan. Accelerating large graph algorithms on the gpu using cuda. In HiPC, pages 197--208, 2007. Google ScholarDigital Library
- M. Harris et al. Optimizing parallel reduction in cuda. NVIDIA Developer Technology, 2, 2007.Google Scholar
- S. Hong, S. K. Kim, T. Oguntebi, and K. Olukotun. Accelerating cuda graph algorithms at maximum warp. In PPoPP, pages 267--276, 2011. Google ScholarDigital Library
- S. Hong, T. Oguntebi, and K. Olukotun. Efficient parallel graph exploration on multi-core cpu and gpu. In PACT, 2011. Google ScholarDigital Library
- A. Kyrola, G. Blelloch, and C. Guestrin. Graphchi: Large-scale graph computation on just a pc. In OSDI, pages 31--46, 2012. Google ScholarDigital Library
- J. Leskovec. Stanford large network dataset collection. http://snap.stanford.edu/data/index.html, 2011.Google Scholar
- J. Leskovec, L. A. Adamic, and B. A. Huberman. The dynamics of viral marketing. ACM Trans. Web, 1(1), May 2007. Google ScholarDigital Library
- J. Leskovec, K. J. Lang, A. Dasgupta, and M. W. Mahoney. Community structure in large networks: Natural cluster sizes and the absence of large well-defined clusters. CoRR, abs/0810.1355, 2008.Google Scholar
- A. Lumsdaine, D. Gregor, B. Hendrickson, and J. Berry. Challenges in parallel graph processing. Parallel Processing Letters, 17(01):5--20, 2007.Google ScholarCross Ref
- L. Luo, M. Wong, and W.-m. Hwu. An effective gpu implementation of breadth-first search. In DAC, pages 52--55, 2010. Google ScholarDigital Library
- M. Mendez-Lojo, M. Burtscher, and K. Pingali. A gpu implementation of inclusion-based points-to analysis. In PPoPP, pages 107--116, 2012. Google ScholarDigital Library
- D. Merrill, M. Garland, and A. Grimshaw. Scalable gpu graph traversal. In PPoPP, pages 117--128, 2012. Google ScholarDigital Library
- R. Nasre, M. Burtscher, and K. Pingali. Morph algorithms on gpus. In PPoPP, pages 147--156, 2013. Google ScholarDigital Library
- L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Technical Report 1999--66, Stanford InfoLab, 1999.Google Scholar
- M. Samadi, A. Hormati, M. Mehrara, J. Lee, and S. Mahlke. Adaptive input-aware compilation for graphics engines. In PLDI, pages 13--22, 2012. Google ScholarDigital Library
- L. Takac and M. Zabovsky. Data analysis in public social networks. In International Scientific Conference and International Workshop Present Day Trends of Innovations, 2012.Google Scholar
- L. G. Valiant. A bridging model for parallel computation. Commun. ACM, 33(8):103--111, 1990. Google ScholarDigital Library
- B. Wu, Z. Zhao, E. Z. Zhang, Y. Jiang, and X. Shen. Complexity analysis and algorithm design for reorganizing data to minimize non-coalesced memory accesses on gpu. In PPoPP, pages 57--68, 2013. Google ScholarDigital Library
- T. Wu, B. Wang, Y. Shan, F. Yan, Y. Wang, and N. Xu. Efficient pagerank and spmv computation on amd gpus. In ICPP, pages 81--89, 2010. Google ScholarDigital Library
- E. Z. Zhang, Y. Jiang, Z. Guo, K. Tian, and X. Shen. On-the- y elimination of dynamic irregularities for gpu computing. In ASPLOS, pages 369--380, 2011. Google ScholarDigital Library
- J. Zhong, and B. He. Medusa: Simplied Graph Processing on GPUs. In IEEE Transactions on Parallel and Distributed Systems, 2013.Google Scholar
Index Terms
- CuSha: vertex-centric graph processing on GPUs
Recommendations
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance ComputingThe graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs
SCC '12: Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and AnalysisOpenCL (Open Computing Language) is a framework for general-purpose parallel programming. Programs written in OpenCL are functionally portable across multiple processors including CPUs, GPUs, and also FPGAs. Using an auto-tuning technique makes ...
Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers
Highlights- Generate parallel CUDA code from sequential C input code using a compiler-based tool for key operators in Geometric Multigrid.
AbstractGPUs, with their high bandwidths and computational capabilities are an increasingly popular target for scientific computing. Unfortunately, to date, harnessing the power of the GPU has required use of a GPU-specific programming model ...
Comments