ABSTRACT
Since 2012, the U.S. Department of Energy's X-Stack program has been developing solutions including runtime systems, programming models, languages, compilers, and tools for the Exascale system software to address crucial performance and power requirements. Fine grain programming models and runtime systems show a great potential to efficiently utilize the underlying hardware. Thus, they are essential to many X-Stack efforts. An abundant amount of small tasks can better utilize the vast parallelism available on current and future machines. Moreover, finer tasks can recover faster and adapt better, due to a decrease in state and control.
Nevertheless, current applications have been written to exploit old paradigms (such as Communicating Sequential Processor and Bulk Synchronous Parallel processing). To fully utilize the advantages of these new systems, applications need to be adapted to these new paradigms. As part of the applications' porting process, in-depth characterization studies, focused on both application characteristics and runtime features, need to take place to fully understand the application performance bottlenecks and how to resolve them.
This paper presents a characterization study for a novel high performance runtime system, called the Open Community Runtime, using key HPC kernels as its vehicle. This study has the following contributions: one of the first high performance, fine grain, distributed memory runtime system implementing the OCR standard (version 0.99a); and a characterization study of key HPC kernels in terms of runtime primitives running on both intra and inter node environments. Running on a general purpose cluster, we have found up to 1635x relative speed-up for a parallel tiled Cholesky Kernels on 128 nodes with 16 cores each and a 1864x relative speed-up for a parallel tiled Smith-Waterman kernel on 128 nodes with 30 cores.
- Rajkishore Barik, Zoran Budimlic, Vincent Cave, Sanjay Chatterjee, Yi Guo, David Peixotto, Raghavan Raman, Jun Shirako, Sagnak Tasirlar, Yonghong Yan, Yisheng Zhao, and Vivek Sarkar. The Habanero Multicore Software Research Project. In Proceedings of the 24th ACM SIGPLAN Conference Companion on Object Oriented Programming Systems Languages and Applications, OOPSLA '09, pages 735--736, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
- Richard F. Barrett, Dylan T. Stark, Courtenay T. Vaughan, Ryan E. Grant, Stephen L. Olivier, and Kevin T. Pedretti. Toward an Evolutionary Task Parallel Integrated MPI + X Programming Model. In Proceedings of the Sixth International Workshop on Programming Models and Applications for Multicores and Manycores, PMAM '15, pages 30--39, New York, NY, USA, 2015. ACM. Google ScholarDigital Library
- Z. Budimlic, V. Cave, S. Chatterjee, R. ÌACledat, V. Sarkar, B. Seshasayee, R. Surendran, and N. Vrvilo. Characterizing application execution using the open community runtime. In International Workshop on Runtime Systems for Extreme Scale Programming Models and Architectures, in conjunction with SC15. Austin, Texas, November 2015, November 2015.Google Scholar
- B. L. Chamberlain, D. Callahan, and H. P. Zima. Parallel Programmability and the Chapel Language. Int. J. High Perform. Comput. Appl., 21(3):291--312, August 2007. Google ScholarDigital Library
- Philippe Charles, Christian Grothoff, Vijay Saraswat, Christopher Donawa, Allan Kielstra, Kemal Ebcioglu, Christoph von Praun, and Vivek Sarkar. X10: An Object-oriented Approach to Non-uniform Cluster Computing. In Proceedings of the 20th Annual ACM SIGPLAN Conference on Object-oriented Programming, Systems, Languages, and Applications, OOPSLA '05, pages 519--538, New York, NY, USA, 2005. ACM. Google ScholarDigital Library
- Intel Corporation. Open Community Runtime v1.0, May 2015.Google Scholar
- Timothy A. Davis and Yifan Hu. The University of Florida Sparse Matrix Collection. ACM Trans. Math. Softw., 38(1):1:1--1:25, December 2011. Google ScholarDigital Library
- Jack B. Dennis and David P. Misunas. A preliminary architecture for a basic data-flow processor. In Proceedings of the 2Nd Annual Symposium on Computer Architecture, ISCA '75, pages 126--132, New York, NY, USA, 1975. ACM. Google ScholarDigital Library
- U.S. Department of Energy. CoDesign Center: Extreme Materials at Extreme Scale (ExMatEx). http://www.exmatex.org/comd.html, 2012.Google Scholar
- Jiri Dokulil and Siegfried Benkner. Retargeting of the Open Community Runtime to Intel Xeon Phi. Procedia Computer Science, 51:1453--1462, 2015. International Conference On Computational Science, {ICCS} 2015Computational Science at the Gates of Nature.Google ScholarDigital Library
- Jiri Dokulil, Martin Sandrieser, and Siegfried Benkner. Ocr-vx - an alternative implementation of the open community runtime. In International Workshop on Runtime Systems for Extreme Scale Programming Models and Architectures, in conjunction with SC15. Austin, Texas, November 2015, November 2015.Google Scholar
- Zhen Fang, Lixin Zhang, John B. Carter, Liqun Cheng, and Michael Parker. Fast synchronization on shared-memory multiprocessors: An architectural approach. J. Parallel Distrib. Comput., 65(10):1158--1170, October 2005. Google ScholarDigital Library
- Benedict R. Gaster and Lee Howes. Opencl c++. In Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units, GPGPU-6, pages 86--95, New York, NY, USA, 2013. ACM. Google ScholarDigital Library
- ET International. White paper: SWARM (SWift Adaptive Runtime Machine) scalable performance optimization for multi-core/multi-node. http://www.etinternational.com/files/2713/2128/2002/ETI-SWARM-whitepaper-11092011.pdf.Google Scholar
- Hartmut Kaiser, Thomas Heller, Bryce Adelstein-Lelbach, Adrian Serio, and Dietmar Fey. HPX: A Task Based Programming Model in a Global Address Space. In Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models, PGAS '14, pages 6:1--6:11, New York, NY, USA, 2014. ACM. Google ScholarDigital Library
- Laxmikant V. Kale and Sanjeev Krishnan. CHARM++: A Portable Concurrent Object Oriented System Based on C++. In Proceedings of the Eighth Annual Conference on Object-oriented Programming Systems, Languages, and Applications, OOPSLA '93, pages 91--108, New York, NY, USA, 1993. ACM. Google ScholarDigital Library
- Rishi Khan and Mark Glines. Cholesky swarm overview, December 2012.Google Scholar
- Charles E. Leiserson. The cilk++ concurrency platform. In Proceedings of the 46th Annual Design Automation Conference, DAC '09, pages 522--527, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
- National Center for Biotechnology Information (NCBI). Nucleotide. http://www.ncbi.nlm.nih.gov/nucleotide/.Google Scholar
- Jacob Nelson, Brandon Holt, Brandon Myers, Preston Briggs, Luis Ceze, Simon Kahan, and Mark Oskin. Latency-tolerant software distributed shared memory. In 2015 USENIX Annual Technical Conference (USENIX ATC 15), pages 291--305, Santa Clara, CA, July 2015. USENIX Association. Google ScholarDigital Library
- OpenMP Architecture Review Board. OpenMP application program interface version 4.0, 2013.Google Scholar
- Pacific Northwest National Laboratory. PNNL: Institutional Computing. http://pic.pnnl.gov/resources.stm.Google Scholar
- Ruymán Reyes, Iván López, Juan J. Fumero, and Francisco Sande. A preliminary evaluation of openacc implementations. J. Supercomput., 65(3):1063--1075, September 2013. Google ScholarDigital Library
- Thomas Willhalm and Nicolae Popovici. Putting Intel® Threading Building Blocks to Work. In Proceedings of the 1st International Workshop on Multicore Software Engineering, IWMSE '08, pages 3--4, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
- Weirong Zhu, Vugranam C Sreedhar, Ziang Hu, and Guang R. Gao. Synchronization state buffer: Supporting efficient fine-grain synchronization on many-core architectures. In Proceedings of the 34th Annual International Symposium on Computer Architecture, ISCA '07, pages 35--45, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
- Application characterization at scale: lessons learned from developing a distributed open community runtime system for high performance computing
Recommendations
Auto-vectorizing a large-scale production unstructured-mesh CFD application
WPMVP '16: Proceedings of the 3rd Workshop on Programming Models for SIMD/Vector ProcessingFor modern x86 based CPUs with increasingly longer vector lengths, achieving good vectorization has become very important for gaining higher performance. Using very explicit SIMD vector programming techniques has been shown to give near optimal ...
Application Speedup Characterization: Modeling Parallelization Overhead and Variations of Problem Size and Number of Cores.
ICPE '18: Companion of the 2018 ACM/SPEC International Conference on Performance EngineeringTo make efficient use of multi-core processors, it is important to understand the performance behavior of parallel applications. Modeling this can enable the use of online approaches to optimize throughput or energy, or even guarantee a minimum QoS. ...
Performance characterization of the NAS Parallel Benchmarks in OpenCL
IISWC '11: Proceedings of the 2011 IEEE International Symposium on Workload CharacterizationHeterogeneous parallel computing platforms, which are composed of different processors (e.g., CPUs, GPUs, FPGAs, and DSPs), are widening their user base in all computing domains. With this trend, parallel programming models need to achieve portability ...
Comments