ABSTRACT
The wide usage of GPGPU programming models and compiler techniques enables the optimization of data-parallel programs on commodity GPUs. However, mapping GPGPU applications running on discrete parts to emerging integrated heterogeneous architectures such as the AMD Fusion APU and Intel Sandy/Ivy bridge with the CPU and the GPU on the same die has not been well studied.
Classic time-step simulation applications represented by agent-based models have the intrinsic parallel structure that is a good fit for GPGPU architectures. However, when mapping these applications directly to the integrated GPUs, the performance may degrade due to less computation units and lower clock speed.
This paper proposes an optimization to the GPGPU implementation of the agent-based model and illustrates it in the traffic simulation example. The optimization adapts the algorithm by moving part of the workload to the CPU to leverage the integrated architecture and the on-chip memory bus which is faster than the PCIe bus that connects the discrete GPU and the host. The experiments on discrete AMD Radeon GPU and AMD Fusion APU demonstrate that the optimization can achieve 1.08--2.71x performance speedup on the integrated architecture over the discrete platform.
- B. Aaby, K. Perumalla, and S. Seal. Efficient simulation of agent-based models on multi-gpu and multi-core clusters. In Proceedings of the 3rd International ICST Conference on Simulation Tools and Techniques, page 29. ICST (Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering), 2010. Google ScholarDigital Library
- AMD. Amd fusion family of apus: Enabling a superior, immersive pc experience. March 2010.Google Scholar
- AMD. Amd accelerated parallel processing opencl programming guide. December 2012.Google Scholar
- AnandTech. Amd's graphics core next preview: Architected for compute. 2011.Google Scholar
- M. Billeter, O. Olsson, and U. Assarsson. Efficient stream compaction on wide simd many-core architectures. In Proceedings of the Conference on High Performance Graphics 2009, HPG '09, pages 159--166. ACM, 2009. Google ScholarDigital Library
- M. Daga, A. Aji, and W. Feng. On the efficacy of a fused cpu+ gpu processor (or apu) for parallel computing. In Symposium on Application Accelerators in High-Performance Computing (SAAHPC), pages 141--149. IEEE, 2011. Google ScholarDigital Library
- M. Doerksen, S. Solomon, and P. Thulasiraman. Designing apu oriented scientific computing applications in opencl. In International Conference on High Performance Computing and Communications (HPCC), pages 587--592. IEEE, 2011. Google ScholarDigital Library
- U. Erra, B. Frola, V. Scarano, and I. Couzin. An efficient gpu implementation for large scale individual-based simulation of collective behavior. In International Workshop on High Performance Computational Systems Biology, 2009. HIBI'09, pages 51--58. IEEE, 2009. Google ScholarDigital Library
- N. Ferrando, M. Gosalvez, J. Cerda, R. Gadea, and K. Sato. Octree-based, gpu implementation of a continuous cellular automaton for the simulation of complex, evolving surfaces. Computer Physics Communications, 182(3):628--640, 2011.Google ScholarCross Ref
- M. Garland, M. Kudlur, and Y. Zheng. Designing a unified programming model for heterogeneous machines. In Supercomputing, 2012. Google ScholarDigital Library
- Intel. 2nd generation intel core i5 processor. 2011.Google Scholar
- J. Katajainen, T. Pasanen, and J. Teuhola. Practical in-place mergesort. Nordic Journal of Computing, 3(1):27--40, 1996. Google ScholarDigital Library
- S. Keckler, W. Dally, B. Khailany, M. Garland, and D. Glasco. Gpus and the future of parallel computing. Micro, IEEE, 31(5):7--17, September-October 2011. Google ScholarDigital Library
- Khronos Group. The OpenCL Specification, version 1.2.19, November 2010.Google Scholar
- S. Lakshmivarahan, S. Dhall, and L. Miller. Parallel sorting algorithms. Advances in Computers, 23:295--354, 1984.Google ScholarCross Ref
- M. Lysenko and R. D'Souza. A framework for megascale agent based model simulations on graphics processing units. Journal of Artificial Societies and Social Simulation, 11(4):10, 2008.Google Scholar
- M. Niazi and A. Hussain. Agent-based computing from multi-agent systems to agent-based models: a visual survey. Scientometrics, 89(2):479--499, 2011. Google ScholarDigital Library
- NVIDIA. Nvidias next generation cuda compute architecture: Fermi. 2009.Google Scholar
- H. Peters, O. Schulz-Hildebrandt, and N. Luttenberger. Fast in-place sorting with cuda based on bitonic sort. Parallel Processing and Applied Mathematics, pages 403--410, 2010. Google ScholarDigital Library
- M. Pharr and R. Fernando. Gpu gems 2: Programming techniques for high-performance graphics and general-purpose computation. 2005. Google ScholarDigital Library
- P. Richmond and D. Romano. Agent based gpu, a real-time 3d simulation and interactive visualisation framework for massive agent based modelling on the gpu. In Proceedings of International Workshop on Super Visualisation (IWSV08), 2008.Google Scholar
- P. Richmond, D. Walker, S. Coakley, and D. Romano. High performance cellular level agent-based simulation with flame for the gpu. Briefings in bioinformatics, 11(3):334--347, 2010.Google ScholarCross Ref
- M. Treiber and A. Kesting. An open-source microscopic traffic simulator. Intelligent Transportation Systems Magazine, 2(3):6--13, Fall 2010.Google ScholarCross Ref
- L. G. Valiant. A bridging model for parallel computation. Commun. ACM, 33(8):103--111, 1990. Google ScholarDigital Library
- H. Wu, G. Diamos, S. Cadambi, and S. Yalamanchili. Kernel weaver: Automatically fusing database primitives for efficient gpu computation. In Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-45 '12, pages 107--118, 2012. Google ScholarDigital Library
Index Terms
- Accelerating simulation of agent-based models on heterogeneous architectures
Recommendations
The tradeoffs of fused memory hierarchies in heterogeneous computing architectures
CF '12: Proceedings of the 9th conference on Computing FrontiersWith the rise of general purpose computing on graphics processing units (GPGPU), the influence from consumer markets can now be seen across the spectrum of computer architectures. In fact, many of the high-ranking Top500 HPC systems now include these ...
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance ComputingThe graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Analyzing memory management methods on integrated CPU-GPU systems
ISMM '17Heterogeneous systems that integrate a multicore CPU and a GPU on the same die are ubiquitous. On these systems, both the CPU and GPU share the same physical memory as opposed to using separate memory dies. Although integration eliminates the need to ...
Comments