skip to main content
10.1145/2903150.2903166acmconferencesArticle/Chapter ViewAbstractPublication PagescfConference Proceedingsconference-collections
research-article

Application characterization at scale: lessons learned from developing a distributed open community runtime system for high performance computing

Published:16 May 2016Publication History

ABSTRACT

Since 2012, the U.S. Department of Energy's X-Stack program has been developing solutions including runtime systems, programming models, languages, compilers, and tools for the Exascale system software to address crucial performance and power requirements. Fine grain programming models and runtime systems show a great potential to efficiently utilize the underlying hardware. Thus, they are essential to many X-Stack efforts. An abundant amount of small tasks can better utilize the vast parallelism available on current and future machines. Moreover, finer tasks can recover faster and adapt better, due to a decrease in state and control.

Nevertheless, current applications have been written to exploit old paradigms (such as Communicating Sequential Processor and Bulk Synchronous Parallel processing). To fully utilize the advantages of these new systems, applications need to be adapted to these new paradigms. As part of the applications' porting process, in-depth characterization studies, focused on both application characteristics and runtime features, need to take place to fully understand the application performance bottlenecks and how to resolve them.

This paper presents a characterization study for a novel high performance runtime system, called the Open Community Runtime, using key HPC kernels as its vehicle. This study has the following contributions: one of the first high performance, fine grain, distributed memory runtime system implementing the OCR standard (version 0.99a); and a characterization study of key HPC kernels in terms of runtime primitives running on both intra and inter node environments. Running on a general purpose cluster, we have found up to 1635x relative speed-up for a parallel tiled Cholesky Kernels on 128 nodes with 16 cores each and a 1864x relative speed-up for a parallel tiled Smith-Waterman kernel on 128 nodes with 30 cores.

References

  1. Rajkishore Barik, Zoran Budimlic, Vincent Cave, Sanjay Chatterjee, Yi Guo, David Peixotto, Raghavan Raman, Jun Shirako, Sagnak Tasirlar, Yonghong Yan, Yisheng Zhao, and Vivek Sarkar. The Habanero Multicore Software Research Project. In Proceedings of the 24th ACM SIGPLAN Conference Companion on Object Oriented Programming Systems Languages and Applications, OOPSLA '09, pages 735--736, New York, NY, USA, 2009. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Richard F. Barrett, Dylan T. Stark, Courtenay T. Vaughan, Ryan E. Grant, Stephen L. Olivier, and Kevin T. Pedretti. Toward an Evolutionary Task Parallel Integrated MPI + X Programming Model. In Proceedings of the Sixth International Workshop on Programming Models and Applications for Multicores and Manycores, PMAM '15, pages 30--39, New York, NY, USA, 2015. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Z. Budimlic, V. Cave, S. Chatterjee, R. ÌACledat, V. Sarkar, B. Seshasayee, R. Surendran, and N. Vrvilo. Characterizing application execution using the open community runtime. In International Workshop on Runtime Systems for Extreme Scale Programming Models and Architectures, in conjunction with SC15. Austin, Texas, November 2015, November 2015.Google ScholarGoogle Scholar
  4. B. L. Chamberlain, D. Callahan, and H. P. Zima. Parallel Programmability and the Chapel Language. Int. J. High Perform. Comput. Appl., 21(3):291--312, August 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Philippe Charles, Christian Grothoff, Vijay Saraswat, Christopher Donawa, Allan Kielstra, Kemal Ebcioglu, Christoph von Praun, and Vivek Sarkar. X10: An Object-oriented Approach to Non-uniform Cluster Computing. In Proceedings of the 20th Annual ACM SIGPLAN Conference on Object-oriented Programming, Systems, Languages, and Applications, OOPSLA '05, pages 519--538, New York, NY, USA, 2005. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Intel Corporation. Open Community Runtime v1.0, May 2015.Google ScholarGoogle Scholar
  7. Timothy A. Davis and Yifan Hu. The University of Florida Sparse Matrix Collection. ACM Trans. Math. Softw., 38(1):1:1--1:25, December 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Jack B. Dennis and David P. Misunas. A preliminary architecture for a basic data-flow processor. In Proceedings of the 2Nd Annual Symposium on Computer Architecture, ISCA '75, pages 126--132, New York, NY, USA, 1975. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. U.S. Department of Energy. CoDesign Center: Extreme Materials at Extreme Scale (ExMatEx). http://www.exmatex.org/comd.html, 2012.Google ScholarGoogle Scholar
  10. Jiri Dokulil and Siegfried Benkner. Retargeting of the Open Community Runtime to Intel Xeon Phi. Procedia Computer Science, 51:1453--1462, 2015. International Conference On Computational Science, {ICCS} 2015Computational Science at the Gates of Nature.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Jiri Dokulil, Martin Sandrieser, and Siegfried Benkner. Ocr-vx - an alternative implementation of the open community runtime. In International Workshop on Runtime Systems for Extreme Scale Programming Models and Architectures, in conjunction with SC15. Austin, Texas, November 2015, November 2015.Google ScholarGoogle Scholar
  12. Zhen Fang, Lixin Zhang, John B. Carter, Liqun Cheng, and Michael Parker. Fast synchronization on shared-memory multiprocessors: An architectural approach. J. Parallel Distrib. Comput., 65(10):1158--1170, October 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Benedict R. Gaster and Lee Howes. Opencl c++. In Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units, GPGPU-6, pages 86--95, New York, NY, USA, 2013. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. ET International. White paper: SWARM (SWift Adaptive Runtime Machine) scalable performance optimization for multi-core/multi-node. http://www.etinternational.com/files/2713/2128/2002/ETI-SWARM-whitepaper-11092011.pdf.Google ScholarGoogle Scholar
  15. Hartmut Kaiser, Thomas Heller, Bryce Adelstein-Lelbach, Adrian Serio, and Dietmar Fey. HPX: A Task Based Programming Model in a Global Address Space. In Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models, PGAS '14, pages 6:1--6:11, New York, NY, USA, 2014. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Laxmikant V. Kale and Sanjeev Krishnan. CHARM++: A Portable Concurrent Object Oriented System Based on C++. In Proceedings of the Eighth Annual Conference on Object-oriented Programming Systems, Languages, and Applications, OOPSLA '93, pages 91--108, New York, NY, USA, 1993. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Rishi Khan and Mark Glines. Cholesky swarm overview, December 2012.Google ScholarGoogle Scholar
  18. Charles E. Leiserson. The cilk++ concurrency platform. In Proceedings of the 46th Annual Design Automation Conference, DAC '09, pages 522--527, New York, NY, USA, 2009. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. National Center for Biotechnology Information (NCBI). Nucleotide. http://www.ncbi.nlm.nih.gov/nucleotide/.Google ScholarGoogle Scholar
  20. Jacob Nelson, Brandon Holt, Brandon Myers, Preston Briggs, Luis Ceze, Simon Kahan, and Mark Oskin. Latency-tolerant software distributed shared memory. In 2015 USENIX Annual Technical Conference (USENIX ATC 15), pages 291--305, Santa Clara, CA, July 2015. USENIX Association. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. OpenMP Architecture Review Board. OpenMP application program interface version 4.0, 2013.Google ScholarGoogle Scholar
  22. Pacific Northwest National Laboratory. PNNL: Institutional Computing. http://pic.pnnl.gov/resources.stm.Google ScholarGoogle Scholar
  23. Ruymán Reyes, Iván López, Juan J. Fumero, and Francisco Sande. A preliminary evaluation of openacc implementations. J. Supercomput., 65(3):1063--1075, September 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Thomas Willhalm and Nicolae Popovici. Putting Intel® Threading Building Blocks to Work. In Proceedings of the 1st International Workshop on Multicore Software Engineering, IWMSE '08, pages 3--4, New York, NY, USA, 2008. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Weirong Zhu, Vugranam C Sreedhar, Ziang Hu, and Guang R. Gao. Synchronization state buffer: Supporting efficient fine-grain synchronization on many-core architectures. In Proceedings of the 34th Annual International Symposium on Computer Architecture, ISCA '07, pages 35--45, New York, NY, USA, 2007. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  1. Application characterization at scale: lessons learned from developing a distributed open community runtime system for high performance computing

              Recommendations

              Comments

              Login options

              Check if you have access through your login credentials or your institution to get full access on this article.

              Sign in
              • Published in

                cover image ACM Conferences
                CF '16: Proceedings of the ACM International Conference on Computing Frontiers
                May 2016
                487 pages
                ISBN:9781450341288
                DOI:10.1145/2903150
                • General Chairs:
                • Gianluca Palermo,
                • John Feo,
                • Program Chairs:
                • Antonino Tumeo,
                • Hubertus Franke

                Copyright © 2016 ACM

                © 2016 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

                Publisher

                Association for Computing Machinery

                New York, NY, United States

                Publication History

                • Published: 16 May 2016

                Permissions

                Request permissions about this article.

                Request Permissions

                Check for updates

                Qualifiers

                • research-article

                Acceptance Rates

                CF '16 Paper Acceptance Rate30of94submissions,32%Overall Acceptance Rate240of680submissions,35%

              PDF Format

              View or Download as a PDF file.

              PDF

              eReader

              View online with eReader.

              eReader