ABSTRACT
Modern processors are evolving into hybrid, heterogeneous processors with both CPU and GPU cores used for general purpose computation. Several languages such as Brook, CUDA, and more recently OpenCL are being developed to fully harness the potential of these processors. These languages typically involve the control code running on the CPU and the performance-critical, data-parallel kernel code running on the GPUs.
In this paper, we present Twin Peaks, a software platform for heterogeneous computing that executes code originally targeted for GPUs efficiently on CPUs as well. This permits a more balanced execution between the CPU and GPU, and enables portability of code between these architectures and to CPU-only environments. We propose several techniques in the runtime system to efficiently utilize the caches and functional units present in CPUs. Using OpenCL as a canonical language for heterogeneous computing, and running several experiments on real hardware, we show that our techniques enable GPGPU-style code to execute efficiently on multicore CPUs with minimal runtime overhead. These results also show that for maximum performance, it is beneficial for applications to utilize both CPUs and GPUs as accelerator targets.
- }}GPGPU. www.gpgpu.org.Google Scholar
- }}Intel TBB. www.threadingbuildingblocks.org.Google Scholar
- }}OpenCL. www.khronos.org/opencl/.Google Scholar
- }}T. E. Anderson, B. N. Bershad, E. D. Lazowska, and H. M. Levy. Scheduler activations: effective kernel support for the user-level management of parallelism. ACM Trans. on Computer Systems, 10(1), 1992. Google ScholarDigital Library
- }}I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanrahan. Brook for GPUs: Stream computing on graphics hardware. In SIGGRAPH, 2004. Google ScholarDigital Library
- }}B. Calder, C. Krintz, S. John, and T. Austin. Cache-conscious data placement. In Proceedings of ASPLOS, 1998. Google ScholarDigital Library
- }}T. M. Chilimbi, M. D. Hill, and J. R. Larus. Cache-conscious structure layout. In Proceedings of PLDI, 1999. Google ScholarDigital Library
- }}C. H. Crawford, P. Henning, M. Kistler, and C. Wright. Accelerating computing with the Cell broadband engine processor. In Proceedings of Computing Frontiers, 2008. Google ScholarDigital Library
- }}J. Gummaraju, J. Coburn, Y. Turner, and M. Rosenblum. Streamware: Programming general-purpose multicore processors using streams. In Proceedings of ASPLOS XIII, 2008. Google ScholarDigital Library
- }}G. Hoflehner, K. Kirkegaard, R. Skinner, D. Lavery, Y.-F. Lee, and W. Li. Compiler Optimizations for Transaction Processing Workloads on Itanium R Linux Systems. In Proceedings of International Symposium on Microarchitecture, 2004. Google ScholarDigital Library
- }}H. P. Hofstee. Power efficient processor architecture and the cell processor. In Proceedings of HPCA, 2005. Google ScholarDigital Library
- }}H. P. Hofstee. Power efficient processor architecture and the cell processor. In Proceedings of HPCA, 2005. Google ScholarDigital Library
- }}N. Joukov, A. Kashyap, G. Sivathanu, and E. Zadok. KeFence: An Electric Fence for Kernel Buffers. In StorageSS, 2005. Google ScholarDigital Library
- }}R. Kumar, D. M. Tullsen, N. P. Jouppi, and P. Ranganathan. Heterogeneous chip multiprocessors. Computer, 38(11):32--38, 2005. Google ScholarDigital Library
- }}C. Lattner and V. Adve. LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In CGO '04: Proceedings of the International Symposium on Code generation and Optimization, 2004. Google ScholarDigital Library
- }}V. W. Lee, C. Kim, J. Chhugani, M. Deisher, D. Kim, A. D. Nguyen, N. Satish, M. Smelyanskiy, S. Chennupaty, P. Hammarlund, R. Singhal, and P. Dubey. Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU. In Proceedings of ISCA, 2010. Google ScholarDigital Library
- }}NVIDIA Corporation. CUDA Programming Guide 2.0, June 2008.Google Scholar
- }}OpenMP Architecture Review Board. OpenMP Application Program Interface 3.0, 2007.Google Scholar
- }}J. D. Owens, M. Houston, D. Luebke, S. Green, J. E. Stone, and J. C. Phillips. GPU computing. Proceedings of the IEEE, 96(5), May 2008.Google ScholarCross Ref
- }}B. Saha, X. Zhou, H. Chen, Y. Gao, S. Yan, M. Rajagopalan, J. Fang, P. Zhang, R. Ronen, and A. Mendelson. Programming model for a heterogeneous x86 platform. In PLDI, 2009. Google ScholarDigital Library
- }}L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey, S. Junkins, A. Lake, J. Sugerman, R. Cavin, R. Espasa, E. Grochowski, T. Juan, and P. Hanrahan. Larrabee: a many-core x86 architecture for visual computing. In SIGGRAPH, 2008. Google ScholarDigital Library
- }}J. Stratton, S. S. Stone, and W. W. Hwu. M-CUDA: An efficient implementation of CUDA kernels on multicores. Int'l Workshop on Languages and Compilers for Parallel Computing, 2008. Google ScholarDigital Library
- }}D. Tarditi, S. Puri, and J. Oglesby. Accelerator: Using Data Parallelism to Program GPUs for General-Purpose Uses. In Proceedings of ASPLOS, 2006. Google ScholarDigital Library
- }}C. A. Thekkath, T. D. Nguyen, E. Moy, and E. D. Lazowska. Implementing network protocols at user level. IEEE/ACM Trans. Netw., 1(5), 1993. Google ScholarDigital Library
- }}P. H. Wang, J. D. Collins, G. N. Chinya, H. Jiang, X. Tian, M. Girkar, N. Y. Yang, G.-Y. Lueh, and H. Wang. EXOCHI: architecture and programming environment for a heterogeneous multi-core multithreaded system. In Proceedings of PLDI, 2007. Google ScholarDigital Library
Index Terms
- Twin peaks: a software platform for heterogeneous computing on general-purpose and graphics processors
Recommendations
Fluidic Kernels: Cooperative Execution of OpenCL Programs on Multiple Heterogeneous Devices
CGO '14: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and OptimizationProgramming heterogeneous computing systems with Graphics Processing Units (GPU) and multi-core CPUs in them is complex and time-consuming. OpenCL has emerged as an attractive programming framework for heterogeneous systems. But utilizing multiple ...
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance ComputingThe graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Fluidic Kernels: Cooperative Execution of OpenCL Programs on Multiple Heterogeneous Devices
CGO '14: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and OptimizationProgramming heterogeneous computing systems with Graphics Processing Units (GPU) and multi-core CPUs in them is complex and time-consuming. OpenCL has emerged as an attractive programming framework for heterogeneous systems. But utilizing multiple ...
Comments