research-article

Application characterization at scale: lessons learned from developing a distributed open community runtime system for high performance computing

Authors:
Joshua Landwehr

Pacific Northwest National Lab, Richland, WA

Pacific Northwest National Lab, Richland, WA
View Profile

,
Joshua Suetterlein

University of Delaware, Newark, DE

University of Delaware, Newark, DE
View Profile

,
Andrés Márquez

Pacific Northwest National Lab, Richland, WA

Pacific Northwest National Lab, Richland, WA
View Profile

,
Joseph Manzano

Pacific Northwest National Lab, Richland, WA

Pacific Northwest National Lab, Richland, WA
View Profile

,
Guang R. Gao

University of Delaware, Newark, DE

University of Delaware, Newark, DE
View Profile

CF '16: Proceedings of the ACM International Conference on Computing FrontiersMay 2016Pages 164–171https://doi.org/10.1145/2903150.2903166

Published:16 May 2016Publication History

CF '16: Proceedings of the ACM International Conference on Computing Frontiers

Pages 164–171

ABSTRACT

Since 2012, the U.S. Department of Energy's X-Stack program has been developing solutions including runtime systems, programming models, languages, compilers, and tools for the Exascale system software to address crucial performance and power requirements. Fine grain programming models and runtime systems show a great potential to efficiently utilize the underlying hardware. Thus, they are essential to many X-Stack efforts. An abundant amount of small tasks can better utilize the vast parallelism available on current and future machines. Moreover, finer tasks can recover faster and adapt better, due to a decrease in state and control.

Nevertheless, current applications have been written to exploit old paradigms (such as Communicating Sequential Processor and Bulk Synchronous Parallel processing). To fully utilize the advantages of these new systems, applications need to be adapted to these new paradigms. As part of the applications' porting process, in-depth characterization studies, focused on both application characteristics and runtime features, need to take place to fully understand the application performance bottlenecks and how to resolve them.

This paper presents a characterization study for a novel high performance runtime system, called the Open Community Runtime, using key HPC kernels as its vehicle. This study has the following contributions: one of the first high performance, fine grain, distributed memory runtime system implementing the OCR standard (version 0.99a); and a characterization study of key HPC kernels in terms of runtime primitives running on both intra and inter node environments. Running on a general purpose cluster, we have found up to 1635x relative speed-up for a parallel tiled Cholesky Kernels on 128 nodes with 16 cores each and a 1864x relative speed-up for a parallel tiled Smith-Waterman kernel on 128 nodes with 30 cores.

References

Rajkishore Barik, Zoran Budimlic, Vincent Cave, Sanjay Chatterjee, Yi Guo, David Peixotto, Raghavan Raman, Jun Shirako, Sagnak Tasirlar, Yonghong Yan, Yisheng Zhao, and Vivek Sarkar. The Habanero Multicore Software Research Project. In Proceedings of the 24th ACM SIGPLAN Conference Companion on Object Oriented Programming Systems Languages and Applications, OOPSLA '09, pages 735--736, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
Richard F. Barrett, Dylan T. Stark, Courtenay T. Vaughan, Ryan E. Grant, Stephen L. Olivier, and Kevin T. Pedretti. Toward an Evolutionary Task Parallel Integrated MPI + X Programming Model. In Proceedings of the Sixth International Workshop on Programming Models and Applications for Multicores and Manycores, PMAM '15, pages 30--39, New York, NY, USA, 2015. ACM. Google ScholarDigital Library
Z. Budimlic, V. Cave, S. Chatterjee, R. ÌACledat, V. Sarkar, B. Seshasayee, R. Surendran, and N. Vrvilo. Characterizing application execution using the open community runtime. In International Workshop on Runtime Systems for Extreme Scale Programming Models and Architectures, in conjunction with SC15. Austin, Texas, November 2015, November 2015.Google Scholar
B. L. Chamberlain, D. Callahan, and H. P. Zima. Parallel Programmability and the Chapel Language. Int. J. High Perform. Comput. Appl., 21(3):291--312, August 2007. Google ScholarDigital Library
Philippe Charles, Christian Grothoff, Vijay Saraswat, Christopher Donawa, Allan Kielstra, Kemal Ebcioglu, Christoph von Praun, and Vivek Sarkar. X10: An Object-oriented Approach to Non-uniform Cluster Computing. In Proceedings of the 20th Annual ACM SIGPLAN Conference on Object-oriented Programming, Systems, Languages, and Applications, OOPSLA '05, pages 519--538, New York, NY, USA, 2005. ACM. Google ScholarDigital Library
Intel Corporation. Open Community Runtime v1.0, May 2015.Google Scholar
Timothy A. Davis and Yifan Hu. The University of Florida Sparse Matrix Collection. ACM Trans. Math. Softw., 38(1):1:1--1:25, December 2011. Google ScholarDigital Library
Jack B. Dennis and David P. Misunas. A preliminary architecture for a basic data-flow processor. In Proceedings of the 2Nd Annual Symposium on Computer Architecture, ISCA '75, pages 126--132, New York, NY, USA, 1975. ACM. Google ScholarDigital Library
U.S. Department of Energy. CoDesign Center: Extreme Materials at Extreme Scale (ExMatEx). http://www.exmatex.org/comd.html, 2012.Google Scholar
Jiri Dokulil and Siegfried Benkner. Retargeting of the Open Community Runtime to Intel Xeon Phi. Procedia Computer Science, 51:1453--1462, 2015. International Conference On Computational Science, {ICCS} 2015Computational Science at the Gates of Nature.Google ScholarDigital Library
Jiri Dokulil, Martin Sandrieser, and Siegfried Benkner. Ocr-vx - an alternative implementation of the open community runtime. In International Workshop on Runtime Systems for Extreme Scale Programming Models and Architectures, in conjunction with SC15. Austin, Texas, November 2015, November 2015.Google Scholar
Zhen Fang, Lixin Zhang, John B. Carter, Liqun Cheng, and Michael Parker. Fast synchronization on shared-memory multiprocessors: An architectural approach. J. Parallel Distrib. Comput., 65(10):1158--1170, October 2005. Google ScholarDigital Library
Benedict R. Gaster and Lee Howes. Opencl c++. In Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units, GPGPU-6, pages 86--95, New York, NY, USA, 2013. ACM. Google ScholarDigital Library
ET International. White paper: SWARM (SWift Adaptive Runtime Machine) scalable performance optimization for multi-core/multi-node. http://www.etinternational.com/files/2713/2128/2002/ETI-SWARM-whitepaper-11092011.pdf.Google Scholar
Hartmut Kaiser, Thomas Heller, Bryce Adelstein-Lelbach, Adrian Serio, and Dietmar Fey. HPX: A Task Based Programming Model in a Global Address Space. In Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models, PGAS '14, pages 6:1--6:11, New York, NY, USA, 2014. ACM. Google ScholarDigital Library
Laxmikant V. Kale and Sanjeev Krishnan. CHARM++: A Portable Concurrent Object Oriented System Based on C++. In Proceedings of the Eighth Annual Conference on Object-oriented Programming Systems, Languages, and Applications, OOPSLA '93, pages 91--108, New York, NY, USA, 1993. ACM. Google ScholarDigital Library
Rishi Khan and Mark Glines. Cholesky swarm overview, December 2012.Google Scholar
Charles E. Leiserson. The cilk++ concurrency platform. In Proceedings of the 46th Annual Design Automation Conference, DAC '09, pages 522--527, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
National Center for Biotechnology Information (NCBI). Nucleotide. http://www.ncbi.nlm.nih.gov/nucleotide/.Google Scholar
Jacob Nelson, Brandon Holt, Brandon Myers, Preston Briggs, Luis Ceze, Simon Kahan, and Mark Oskin. Latency-tolerant software distributed shared memory. In 2015 USENIX Annual Technical Conference (USENIX ATC 15), pages 291--305, Santa Clara, CA, July 2015. USENIX Association. Google ScholarDigital Library
OpenMP Architecture Review Board. OpenMP application program interface version 4.0, 2013.Google Scholar
Pacific Northwest National Laboratory. PNNL: Institutional Computing. http://pic.pnnl.gov/resources.stm.Google Scholar
Ruymán Reyes, Iván López, Juan J. Fumero, and Francisco Sande. A preliminary evaluation of openacc implementations. J. Supercomput., 65(3):1063--1075, September 2013. Google ScholarDigital Library
Thomas Willhalm and Nicolae Popovici. Putting Intel® Threading Building Blocks to Work. In Proceedings of the 1st International Workshop on Multicore Software Engineering, IWMSE '08, pages 3--4, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
Weirong Zhu, Vugranam C Sreedhar, Ziang Hu, and Guang R. Gao. Synchronization state buffer: Supporting efficient fine-grain synchronization on many-core architectures. In Proceedings of the 34th Annual International Symposium on Computer Architecture, ISCA '07, pages 35--45, New York, NY, USA, 2007. ACM. Google ScholarDigital Library

Application characterization at scale: lessons learned from developing a distributed open community runtime system for high performance computing

Recommendations

Auto-vectorizing a large-scale production unstructured-mesh CFD application
WPMVP '16: Proceedings of the 3rd Workshop on Programming Models for SIMD/Vector Processing

For modern x86 based CPUs with increasingly longer vector lengths, achieving good vectorization has become very important for gaining higher performance. Using very explicit SIMD vector programming techniques has been shown to give near optimal ...
Read More
Application Speedup Characterization: Modeling Parallelization Overhead and Variations of Problem Size and Number of Cores.
ICPE '18: Companion of the 2018 ACM/SPEC International Conference on Performance Engineering

To make efficient use of multi-core processors, it is important to understand the performance behavior of parallel applications. Modeling this can enable the use of online approaches to optimize throughput or energy, or even guarantee a minimum QoS. ...
Read More
Performance characterization of the NAS Parallel Benchmarks in OpenCL
IISWC '11: Proceedings of the 2011 IEEE International Symposium on Workload Characterization

Heterogeneous parallel computing platforms, which are composed of different processors (e.g., CPUs, GPUs, FPGAs, and DSPs), are widening their user base in all computing domains. With this trend, parallel programming models need to achieve portability ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CF '16: Proceedings of the ACM International Conference on Computing Frontiers
May 2016
487 pages
ISBN:9781450341288
DOI:10.1145/2903150
General Chairs:
Gianluca Palermo
Politecnico di Milano, IT
,
John Feo
Pacific Northwest National Laboratory and Northwest Institute for Advanced Computing
,
Program Chairs:
Antonino Tumeo
Pacific Northwest National Laboratory, USA
,
Hubertus Franke
New York University and IBM Research, USA
Copyright © 2016 ACM
© 2016 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 16 May 2016
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
Conference

Acceptance Rates
CF '16 Paper Acceptance Rate30of94submissions,32%Overall Acceptance Rate240of680submissions,35%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 9
  Total Citations
  View Citations
- 184
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Application characterization at scale: lessons learned from developing a distributed open community runtime system for high performance computing

CF '16: Proceedings of the ACM International Conference on Computing Frontiers

ABSTRACT

References

Cited By

Recommendations

Auto-vectorizing a large-scale production unstructured-mesh CFD application

Application Speedup Characterization: Modeling Parallelization Overhead and Variations of Problem Size and Number of Cores.

Performance characterization of the NAS Parallel Benchmarks in OpenCL

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Application characterization at scale: lessons learned from developing a distributed open community runtime system for high performance computing

CF '16: Proceedings of the ACM International Conference on Computing Frontiers

ABSTRACT

References

Cited By

Recommendations

Auto-vectorizing a large-scale production unstructured-mesh CFD application

Application Speedup Characterization: Modeling Parallelization Overhead and Variations of Problem Size and Number of Cores.

Performance characterization of the NAS Parallel Benchmarks in OpenCL

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media