ABSTRACT
Modern HPC systems are growing in complexity, as they move towards deeper memory hierarchies and increasing use of computational heterogeneity via GPUs or other accelerators. When developing applications for these platforms, programmers are faced with two bad choices. On one hand, they can explicitly manage all machine resources, writing programs decorated with low level primitives from multiple APIs (e.g. Hybrid MPI / OpenMP applications). Though seemingly necessary for efficient execution, it is an inherently non-scalable way to write software. Without a separation of concerns, only small programs written by expert developers actually achieve this efficiency. Furthermore, the implementations are rigid, difficult to extend, and not portable. Alternatively, users can adopt higher level programming environments to abstract away these concerns. Extensibility and portability, however, often come at the cost of lost performance. The mapping of a user's application onto the system now occurs without the contextual information that was immediately available in the more coupled approach.
In this paper, we describe a framework for the transfer of high level, application semantic knowledge into lower levels of the software stack at an appropriate level of abstraction. Using the STAPL library, we demonstrate how this information guides important decisions in the runtime system (STAPL-RTS), such as multi-protocol communication coordination and request aggregation. Through examples, we show how generic programming idioms already known to C++ programmers are used to annotate calls and increase performance.
- Boost. http://www.boost.org/.Google Scholar
- Performance Application Programming Interface. http://icl.cs.utk.edu/papi/.Google Scholar
- C. Amza, A. L. Cox, S. Dwarkadas, P. Keleher, H. Lu, R. Rajamony, W. Yu, and W. Zwaenepoel. Treadmarks: Shared memory computing on networks of workstations. Computer, 29(2):18--28, Feb. 1996. Google ScholarDigital Library
- C. G. Baker and M. A. Heroux. Tpetra, and the use of generic programming in scientific computing. Sci. Program., 20(2):115--128, Apr. 2012. Google ScholarDigital Library
- H. C. Baker, Jr. and C. Hewitt. The incremental garbage collection of processes. SIGPLAN Not., 12(8):55--59, Aug. 1977. Google ScholarDigital Library
- A. Birka and M. D. Ernst. A practical type system and language for reference immutability. In Object-Oriented Prog. Systems, Langs., and Apps. (OOPSLA 2004), pages 35--49, Vancouver, BC, Canada, Oct. 2004. Google ScholarDigital Library
- D. Bonachea and J. Duell. Problems with using mpi 1.1 and 2.0 as compilation targets for parallel language implementations. Int. J. High Perf. Comput. Netw., 1(1-3):91--99, Aug. 2004. Google ScholarDigital Library
- D. Buntinas and G. Mercier. Design and evaluation of nemesis, a scalable, low-latency, message-passing communication subsystem. In Proc. of the Int. Symp. on Cluster Computing and the Grid, pages 521--530. IEEE Computer Society, 2006. Google ScholarDigital Library
- A. Buss, A. Fidel, Harshvardhan, T. Smith, G. Tanase, N. Thomas, X. Xu, M. Bianco, N. M. Amato, and L. Rauchwerger. The STAPL pView. In Int. Wkshp. on Langs. and Comps. for Par. Comp. (LCPC), Houston, TX, USA, Sept. 2010. Google ScholarDigital Library
- A. Buss, Harshvardhan, I. Papadopoulos, O. Pearce, T. Smith, G. Tanase, N. Thomas, X. Xu, M. Bianco, N. M. Amato, and L. Rauchwerger. STAPL: Standard template adaptive parallel library. In Proc. Annual Haifa Experimental Systems Conf. (SYSTOR), pages 1--10, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
- D. Callahan, B. L. Chamberlain, and H. P. Zima. The cascade high productivity language. In 9th Int. Wkshp. on High-Level Par. Prog. Models and Supportive Environments (HIPS), pages 52--60, 2004.Google ScholarCross Ref
- C. Campbell and A. Miller. A Parallel Programming with Microsoft Visual C++: Design Patterns for Decomposition and Coordination on Multicore Architectures. Microsoft Press, 1st edition, 2011. Google ScholarDigital Library
- F. Cappello and D. Etiemble. Mpi versus mpi+openmp on ibm sp for the nas benchmarks. In Proc. of the 2000 ACM/IEEE Conf. on Supercomputing, SC '00, Washington, DC, USA, 2000. IEEE Computer Society. Google ScholarDigital Library
- S. Chatterjee, S. Tasirlar, Z. Budimlic, V. Cave, M. Chabbi, M. Grossman, V. Sarkar, and Y. Yan. Integrating asynchronous task parallelism with mpi. In Proc. Int. Par. and Dist. Proc. Symp. (IPDPS), pages 712--725, May 2013. Google ScholarDigital Library
- I. Dhillon and D. Modha. A data-clustering algorithm on distributed memory multiprocessors. In Large-Scale Par. Data Mining, volume 1759 of LNAI, pages 245--260. Springer-Verlag, 2000. Google ScholarDigital Library
- A. Fidel, S. A. Jacobs, S. Sharma, N. M. Amato, and L. Rauchwerger. Using load balancing to scalably parallelize sampling-based motion planning algorithms. In Proc. Int. Par. and Dist. Proc. Symp. (IPDPS), Phoenix, Arizona, USA, May 2014. Google ScholarDigital Library
- J. Ghosh and A. Liu. K-means. In The Top Ten Algorithms in Data Mining. CRC Press, Boca Raton, FL, USA, 2009.Google ScholarCross Ref
- C. S. Gordon, M. J. Parkinson, J. Parsons, A. Bromfield, and J. Duffy. Uniqueness and reference immutability for safe parallelism. In Proc. of the ACM Int. Conf. on Object Oriented Prog. Systems Langs. and Apps., OOPSLA '12, pages 21--40, New York, NY, USA, 2012. ACM. Google ScholarDigital Library
- D. Gregor, B. Stroustrup, J. Widman, and J. Siek. Improvements to std::future and related apis. technical report n3857. ISO/IEC JTC 1, Information Technology Subcommittee SC 22, Programming Language C++, 2014.Google Scholar
- Harshvardhan, A. Fidel, N. M. Amato, and L. Rauchwerger. The stapl parallel graph library. In Lecture Notes in Comp. Sci. (LNCS), pages 46--60. Springer Berlin Heidelberg, 2012.Google Scholar
- T. Heller, H. Kaiser, A. Schäfer, and D. Fey. Using hpx and libgeodecomp for scaling hpc applications on heterogeneous supercomputers. In Proc. of the Wkshp. on Latest Advances in Scalable Algorithms for Large-Scale Systems, ScalA '13, pages 1:1--1:8, New York, NY, USA, 2013. ACM. Google ScholarDigital Library
- T. Hoefler, J. Dinan, D. Buntinas, P. Balaji, B. Barrett, R. Brightwell, W. D. Gropp, V. Kale, and R. Thakur. Mpi + mpi: A new hybrid approach to parallel programming with mpi plus shared memory. Computing, 95:1121--1136, 2013. Google ScholarDigital Library
- F. Jiao, N. Mahajan, J. Willcock, A. Chauhan, and A. Lumsdaine. Partial globalization of partitioned address spaces for zero-copy communication with shared memory. In High Perf. Computing (HiPC), pages 1--10, Dec 2011. Google ScholarDigital Library
- L. V. Kale and S. Krishnan. Charm++: A portable concurrent object oriented system based on c++. SIGPLAN Not., 28(10):91--108, Oct. 1993. Google ScholarDigital Library
- U. Kang, C. E. Tsourakakis, and C. Faloutsos. Pegasus: A peta-scale graph mining system- implementation and observations, 2009.Google Scholar
- S. N. Labs. Portals Message Passing Interface. http://www.sandia.gov/Portals.Google Scholar
- MPI forum. MPI: A Message-Passing Interface Standard Version 3.0. http://www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf, 2012.Google Scholar
- D. Musser, G. Derge, and A. Saini. STL Tutorial and Reference Guide, 2nd Edition. Addison-Wesley, 2001. Google ScholarDigital Library
- S. Nanz, S. West, and K. S. D. Silveira. B.: Benchmarking usability and performance of multicore languages. In In: ESEM'13. IEEE Computer Society, 2013.Google Scholar
- OpenMP Architecture Review Board. OpenMP - C and C++ Application Program Interface, October 1998. Document DN 004-2229-001.Google Scholar
- Parallel Programming Laboratory, University of Illinois at Urbana-Champaign. The Charm++ Programming Language Manual. Version 6 (Release 1).Google Scholar
- I. Pechtchanski and V. Sarkar. Immutability specification and its applications: Research articles. Concurr. Comput. : Pract. Exper., 17(5-6):639--662, Apr. 2005. Google ScholarDigital Library
- S. Saunders and L. Rauchwerger. ARMI: an adaptive, platform independent communication library. In Proc. of the 9th ACM SIGPLAN Symp. on Principles and Practice of Par. Prog. (PPoPP), pages 230--241, San Diego, California, USA, 2003. ACM. Google ScholarDigital Library
- S. S. Shende and A. D. Malony. The tau parallel performance system. The Int. J. of High Perf. Computing Apps., 20:287--331, 2006. Google ScholarDigital Library
- J. Sillero, G. Borrell, J. Jiménez, and R. D. Moser. Hybrid openmp-mpi turbulent boundary layer code over 32k cores. In Proc. of the 18th European MPI Users' Group Conf. on Recent Advances in the Message Passing Interface, EuroMPI'11, pages 218--227, Berlin, Heidelberg, 2011. Springer-Verlag. Google ScholarDigital Library
- A. B. Sinha, L. V. Kalé, and B. Ramkumar. A dynamic and adaptive quiescence detection algorithm, 1993.Google Scholar
- M. Snir, S. W. Otto, S. Huss-Lederman, D. W. Walker, and J. Dongarra. MPI: The complete reference. MIT Press, Cambridge, MA, 1996. Google ScholarDigital Library
- B. Stroustrup. The C++ Programming Language. Addison Wesley Professional, 2013. Google ScholarDigital Library
- G. Tanase, A. Buss, A. Fidel, Harshvardhan, I. Papadopoulos, O. Pearce, T. Smith, N. Thomas, X. Xu, N. Mourad, J. Vu, M. Bianco, N. M. Amato, and L. Rauchwerger. The STAPL Parallel Container Framework. In Proc. ACM SIGPLAN Symp. Prin. Prac. Par. Prog. (PPoPP), pages 235--246, San Antonio, Texas, USA, 2011. Google ScholarDigital Library
- R. Thakur and W. D. Gropp. Test suite for evaluating performance of multithreaded mpi communication. Par. Computing, 35:608--617, Nov. 2008. Google ScholarDigital Library
- J. J. Willcock, T. Hoefler, N. G. Edmonds, and A. Lumsdaine. Am++: A generalized active message framework. In Proc. of the 19th Int. Conf. on Par. Architectures and Compilation Techniques, PACT '10, pages 401--410, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
- C. E. Wu et al. From trace generation to visualization: A performance framework for distributed parallel systems. In Proc. of SC2000: High Perf. Networking and Computing, Nov. 2000. Google ScholarDigital Library
- M. Zandifar, N. Thomas, N. M. Amato, and L. Rauchwerger. The STAPL skeleton framework. In Proc. 27th Int. Wkshp. on Langs. and Comps. for Par. Comp. (LCPC), Hillsboro, OR, US, 2014.Google Scholar
- Y. Zibin, A. Potanin, M. Ali, S. Artzi, A. Kiezun, and M. D. Ernst. Object and reference immutability using java generics. In Proc. of the the 6th Joint Meeting of the European Soft. Eng. Conf. and the ACM SIGSOFT Symp. on The Foundations of Soft. Eng., ESEC-FSE '07, pages 75--84, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
Index Terms
- STAPL-RTS: An Application Driven Runtime System
Recommendations
Efficient Java RMI for parallel programming
Java offers interesting opportunities for parallel computing. In particular, Java Remote Method Invocation (RMI) provides a flexible kind of remote procedure call (RPC) that supports polymorphism. Sun's RMI implementation achieves this kind of ...
The semi-automatic parallelisation of scientific application codes using a computer aided parallelisation toolkit
The shared-memory programming model can be an effective way to achieve parallelism on shared memory parallel computers. Historically however, the lack of a programming standard using directives and the limited scalability have affected its take-up. ...
Preliminary Experiments with XKaapi on Intel Xeon Phi Coprocessor
SBAC-PAD '13: Proceedings of the 2013 25th International Symposium on Computer Architecture and High Performance ComputingThis paper presents preliminary performance comparisons of parallel applications developed natively for the Intel Xeon Phi accelerator using three different parallel programming environments and their associated runtime systems. We compare Intel OpenMP, ...
Comments