ABSTRACT
A low-diameter, fast interconnection network is going to be a prerequisite for building exascale machines. A two-level direct network has been proposed by several groups as a scalable design for future machines. IBM's PERCS topology and the dragonfly network discussed in the DARPA exascale hardware study are examples of this design. The presence of multiple levels in this design leads to hot-spots on a few links when processes are grouped together at the lowest level to minimize total communication volume. This is especially true for communication graphs with a small number of neighbors per task. Routing and mapping choices can impact the communication performance of parallel applications running on a machine with a two-level direct topology. This paper explores intelligent topology aware mappings of different communication patterns to the physical topology to identify cases that minimize link utilization. We also analyze the trade-offs between using direct and indirect routing with different mappings. We use simulations to study communication and overall performance of applications since there are no installations of two-level direct networks yet. This study raises interesting issues regarding the choice of job scheduling, routing and mapping for future machines.
- T. Agarwal, A. Sharma, and L. V. Kalé. Topology-aware task mapping for reducing communication contention on large parallel machines. In Proceedings of IEEE International Parallel and Distributed Processing Symposium 2006, April 2006. Google ScholarDigital Library
- B. Arimilli, R. Arimilli, V. Chung, S. Clark, W. Denzel, B. Drerup, T. Hoefler, J. Joyner, J. Lewis, J. Li, N. Ni, and R. Rajamony. The PERCS High-Performance Interconnect. In 2010 IEEE 18th Annual Symposium on High Performance Interconnects (HOTI), pages 75--82, August 2010. Google ScholarDigital Library
- C. Bernard, T. Burch, T. A. DeGrand, C. DeTar, S. Gottlieb, U. M. Heller, J. E. Hetrick, K. Orginos, B. Sugar, and D. Toussaint. Scaling tests of the improved Kogut-Susskind quark action. Physical Review D, (61), 2000.Google ScholarCross Ref
- G. Bhanot, A. Gara, P. Heidelberger, E. Lawless, J. C. Sexton, and R. Walkup. Optimizing task layout on the Blue Gene/L supercomputer. IBM Journal of Research and Development, 49(2/3):489--500, 2005. Google ScholarDigital Library
- A. Bhatelé, E. Bohm, and L. V. Kalé. Optimizing communication for Charm++ applications by reducing network contention. Concurrency and Computation: Practice & Experience, 23(2):211--222, February 2011. Google ScholarDigital Library
- A. Bhatele, S. Kumar, C. Mei, J. C. Phillips, G. Zheng, and L. V. Kale. Overcoming scaling challenges in biomolecular simulations across multiple platforms. In Proceedings of IEEE International Parallel and Distributed Processing Symposium 2008, pages 1--12, April 2008.Google ScholarCross Ref
- F. Ercal and J. Ramanujam and P. Sadayappan. Task allocation onto a hypercube by recursive mincut bipartitioning. In Proceedings of the 3rd conference on Hypercube concurrent computers and applications, pages 210--221. ACM Press, 1988. Google ScholarDigital Library
- B. G. Fitch, A. Rayshubskiy, M. Eleftheriou, T. J. C. Ward, M. Giampapa, and M. C. Pitman. Blue matter: Approaching the limits of concurrency for classical molecular dynamics. In SC '06: Proceedings of the 2006 ACM/IEEE conference on Supercomputing, New York, NY, USA, 2006. ACM Press. Google ScholarDigital Library
- Francine Berman and Lawrence Snyder. On mapping parallel algorithms into parallel architectures. Journal of Parallel and Distributed Computing, 4(5):439--458, 1987. Google ScholarDigital Library
- F. Gygi, E. W. Draeger, M. Schulz, B. R. D. Supinski, J. A. Gunnels, V. Austel, J. C. Sexton, F. Franchetti, S. Kral, C. Ueberhuber, and J. Lorenz. Large-Scale Electronic Structure Calculations of High-Z Metals on the Blue Gene/L Platform. In Proceedings of the International Conference in Supercomputing. ACM Press, 2006. Google ScholarDigital Library
- T. Hoefler and M. Snir. Generic topology mapping strategies for large-scale parallel architectures. In Proceedings of the international conference on Supercomputing, ICS '11, pages 75--84, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
- C. Huang, G. Zheng, S. Kumar, and L. V. Kalé. Performance Evaluation of Adaptive MPI. In Proceedings of ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming 2006, March 2006. Google ScholarDigital Library
- L. Kalé and S. Krishnan. CHARM++: A Portable Concurrent Object Oriented System Based on C++. In A. Paepcke, editor, Proceedings of OOPSLA'93, pages 91--108. ACM Press, September 1993. Google ScholarDigital Library
- L. V. Kale, G. Zheng, C. W. Lee, and S. Kumar. Scaling applications to massively parallel machines using projections performance analysis tool. In Future Generation Computer Systems Special Issue on: Large-Scale System Performance Modeling and Analysis, volume 22, pages 347--358, February 2006. Google ScholarDigital Library
- J. Kim, W. J. Dally, S. Scott, and D. Abts. Technology-driven, highly-scalable dragonfly topology. SIGARCH Comput. Archit. News, 36:77--88, June 2008. Google ScholarDigital Library
- P. Kogge, K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, J. Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards, A. Scarpelli, S. Scott, A. Snavely, T. Sterling, R. S. Williams, and K. Yelick. Exascale computing study: Technology challenges in achieving exascale systems, 2008.Google Scholar
- Michalakes, J., J. Dudhia, D. Gill, T. Henderson, J. Klemp, W. Skamarock, and W. Wang. The Weather Research and Forecast Model: Software Architecture and Performance. In Proceedings of the 11th ECMWF Workshop on the Use of High Performance Computing In Meteorology, October 2004.Google Scholar
- S. Wayne Bollinger and Scott F. Midkiff. Processor and Link Assignment in Multicomputers Using Simulated Annealing. In ICPP (1), pages 1--7, 1988.Google Scholar
- Soo-Young Lee and J. K. Aggarwal. A Mapping Strategy for Parallel Processing. IEEE Trans. Computers, 36(4):433--442, 1987. Google ScholarDigital Library
- H. Yu, I.-H. Chung, and J. Moreira. Topology mapping for Blue Gene/L supercomputer. In SC '06: Proceedings of the 2006 ACM/IEEE conference on Supercomputing, page 116, New York, NY, USA, 2006. ACM. Google ScholarDigital Library
- G. Zheng, T. Wilmarth, P. Jagadishprasad, and L. V. Kalé. Simulation-based performance prediction for large parallel machines. In International Journal of Parallel Programming, volume 33, pages 183--207, 2005. Google ScholarDigital Library
Index Terms
- Avoiding hot-spots on two-level direct networks
Recommendations
Triplet-based topology for on-chip networks
Most CMPs use on-chip network to connect cores and tend to integrate more simple cores on a single die. As the number of cores increases, on-chip network will play an important role in the performance of future CMPs. Due to the tradeoff between the ...
Topology aware task mapping techniques: an api and case study
PPoPP '09: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programmingOptimal network performance is critical to efficient parallel scaling for communication-bound applications on large machines. With wormhole routing, no-load latencies do not increase significantly with number of hops traveled. Yet, we, and others have ...
Topology aware task mapping techniques: an api and case study
PPoPP '09Optimal network performance is critical to efficient parallel scaling for communication-bound applications on large machines. With wormhole routing, no-load latencies do not increase significantly with number of hops traveled. Yet, we, and others have ...
Comments