research-article

Avoiding hot-spots on two-level direct networks

Authors:
Abhinav Bhatele

Lawrence Livermore National Laboratory, Livermore, CA

Lawrence Livermore National Laboratory, Livermore, CA
View Profile

,
Nikhil Jain

University of Illinois at Urbana-Champaign, Urbana, IL

University of Illinois at Urbana-Champaign, Urbana, IL
View Profile

,
William D. Gropp

University of Illinois at Urbana-Champaign, Urbana, IL

University of Illinois at Urbana-Champaign, Urbana, IL
View Profile

,
Laxmikant V. Kale

University of Illinois at Urbana-Champaign, Urbana, IL

University of Illinois at Urbana-Champaign, Urbana, IL
View Profile

SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and AnalysisNovember 2011Article No.: 76Pages 1–11https://doi.org/10.1145/2063384.2063486

Published:12 November 2011Publication History

SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis

Pages 1–11

ABSTRACT

A low-diameter, fast interconnection network is going to be a prerequisite for building exascale machines. A two-level direct network has been proposed by several groups as a scalable design for future machines. IBM's PERCS topology and the dragonfly network discussed in the DARPA exascale hardware study are examples of this design. The presence of multiple levels in this design leads to hot-spots on a few links when processes are grouped together at the lowest level to minimize total communication volume. This is especially true for communication graphs with a small number of neighbors per task. Routing and mapping choices can impact the communication performance of parallel applications running on a machine with a two-level direct topology. This paper explores intelligent topology aware mappings of different communication patterns to the physical topology to identify cases that minimize link utilization. We also analyze the trade-offs between using direct and indirect routing with different mappings. We use simulations to study communication and overall performance of applications since there are no installations of two-level direct networks yet. This study raises interesting issues regarding the choice of job scheduling, routing and mapping for future machines.

References

T. Agarwal, A. Sharma, and L. V. Kalé. Topology-aware task mapping for reducing communication contention on large parallel machines. In Proceedings of IEEE International Parallel and Distributed Processing Symposium 2006, April 2006. Google ScholarDigital Library
B. Arimilli, R. Arimilli, V. Chung, S. Clark, W. Denzel, B. Drerup, T. Hoefler, J. Joyner, J. Lewis, J. Li, N. Ni, and R. Rajamony. The PERCS High-Performance Interconnect. In 2010 IEEE 18th Annual Symposium on High Performance Interconnects (HOTI), pages 75--82, August 2010. Google ScholarDigital Library
C. Bernard, T. Burch, T. A. DeGrand, C. DeTar, S. Gottlieb, U. M. Heller, J. E. Hetrick, K. Orginos, B. Sugar, and D. Toussaint. Scaling tests of the improved Kogut-Susskind quark action. Physical Review D, (61), 2000.Google ScholarCross Ref
G. Bhanot, A. Gara, P. Heidelberger, E. Lawless, J. C. Sexton, and R. Walkup. Optimizing task layout on the Blue Gene/L supercomputer. IBM Journal of Research and Development, 49(2/3):489--500, 2005. Google ScholarDigital Library
A. Bhatelé, E. Bohm, and L. V. Kalé. Optimizing communication for Charm++ applications by reducing network contention. Concurrency and Computation: Practice & Experience, 23(2):211--222, February 2011. Google ScholarDigital Library
A. Bhatele, S. Kumar, C. Mei, J. C. Phillips, G. Zheng, and L. V. Kale. Overcoming scaling challenges in biomolecular simulations across multiple platforms. In Proceedings of IEEE International Parallel and Distributed Processing Symposium 2008, pages 1--12, April 2008.Google ScholarCross Ref
F. Ercal and J. Ramanujam and P. Sadayappan. Task allocation onto a hypercube by recursive mincut bipartitioning. In Proceedings of the 3rd conference on Hypercube concurrent computers and applications, pages 210--221. ACM Press, 1988. Google ScholarDigital Library
B. G. Fitch, A. Rayshubskiy, M. Eleftheriou, T. J. C. Ward, M. Giampapa, and M. C. Pitman. Blue matter: Approaching the limits of concurrency for classical molecular dynamics. In SC '06: Proceedings of the 2006 ACM/IEEE conference on Supercomputing, New York, NY, USA, 2006. ACM Press. Google ScholarDigital Library
Francine Berman and Lawrence Snyder. On mapping parallel algorithms into parallel architectures. Journal of Parallel and Distributed Computing, 4(5):439--458, 1987. Google ScholarDigital Library
F. Gygi, E. W. Draeger, M. Schulz, B. R. D. Supinski, J. A. Gunnels, V. Austel, J. C. Sexton, F. Franchetti, S. Kral, C. Ueberhuber, and J. Lorenz. Large-Scale Electronic Structure Calculations of High-Z Metals on the Blue Gene/L Platform. In Proceedings of the International Conference in Supercomputing. ACM Press, 2006. Google ScholarDigital Library
T. Hoefler and M. Snir. Generic topology mapping strategies for large-scale parallel architectures. In Proceedings of the international conference on Supercomputing, ICS '11, pages 75--84, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
C. Huang, G. Zheng, S. Kumar, and L. V. Kalé. Performance Evaluation of Adaptive MPI. In Proceedings of ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming 2006, March 2006. Google ScholarDigital Library
L. Kalé and S. Krishnan. CHARM++: A Portable Concurrent Object Oriented System Based on C++. In A. Paepcke, editor, Proceedings of OOPSLA'93, pages 91--108. ACM Press, September 1993. Google ScholarDigital Library
L. V. Kale, G. Zheng, C. W. Lee, and S. Kumar. Scaling applications to massively parallel machines using projections performance analysis tool. In Future Generation Computer Systems Special Issue on: Large-Scale System Performance Modeling and Analysis, volume 22, pages 347--358, February 2006. Google ScholarDigital Library
J. Kim, W. J. Dally, S. Scott, and D. Abts. Technology-driven, highly-scalable dragonfly topology. SIGARCH Comput. Archit. News, 36:77--88, June 2008. Google ScholarDigital Library
P. Kogge, K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, J. Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards, A. Scarpelli, S. Scott, A. Snavely, T. Sterling, R. S. Williams, and K. Yelick. Exascale computing study: Technology challenges in achieving exascale systems, 2008.Google Scholar
Michalakes, J., J. Dudhia, D. Gill, T. Henderson, J. Klemp, W. Skamarock, and W. Wang. The Weather Research and Forecast Model: Software Architecture and Performance. In Proceedings of the 11th ECMWF Workshop on the Use of High Performance Computing In Meteorology, October 2004.Google Scholar
S. Wayne Bollinger and Scott F. Midkiff. Processor and Link Assignment in Multicomputers Using Simulated Annealing. In ICPP (1), pages 1--7, 1988.Google Scholar
Soo-Young Lee and J. K. Aggarwal. A Mapping Strategy for Parallel Processing. IEEE Trans. Computers, 36(4):433--442, 1987. Google ScholarDigital Library
H. Yu, I.-H. Chung, and J. Moreira. Topology mapping for Blue Gene/L supercomputer. In SC '06: Proceedings of the 2006 ACM/IEEE conference on Supercomputing, page 116, New York, NY, USA, 2006. ACM. Google ScholarDigital Library
G. Zheng, T. Wilmarth, P. Jagadishprasad, and L. V. Kalé. Simulation-based performance prediction for large parallel machines. In International Journal of Parallel Programming, volume 33, pages 183--207, 2005. Google ScholarDigital Library

Index Terms

Avoiding hot-spots on two-level direct networks

Recommendations

Triplet-based topology for on-chip networks

Most CMPs use on-chip network to connect cores and tend to integrate more simple cores on a single die. As the number of cores increases, on-chip network will play an important role in the performance of future CMPs. Due to the tradeoff between the ...
Read More
Topology aware task mapping techniques: an api and case study
PPoPP '09: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming

Optimal network performance is critical to efficient parallel scaling for communication-bound applications on large machines. With wormhole routing, no-load latencies do not increase significantly with number of hops traveled. Yet, we, and others have ...
Read More
Topology aware task mapping techniques: an api and case study
PPoPP '09

Optimal network performance is critical to efficient parallel scaling for communication-bound applications on large machines. With wormhole routing, no-load latencies do not increase significantly with number of hops traveled. Yet, we, and others have ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
November 2011
866 pages
ISBN:9781450307710
DOI:10.1145/2063384
Conference Chair:
Scott Lathrop
University of Chicago
,
Program Chairs:
Jim Costa
Sandia National Laboratories
,
William Kramer
National Center for Supercomputing Applications
Copyright © 2011 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 November 2011
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
communication
dragonfly network
exascale
mapping
performance
Qualifiers
- research-article
Conference

Acceptance Rates
SC '11 Paper Acceptance Rate74of352submissions,21%Overall Acceptance Rate1,516of6,373submissions,24%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 46
  Total Citations
  View Citations
- 249
  Total Downloads
- Downloads (Last 12 months)11
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Avoiding hot-spots on two-level direct networks

SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis

ABSTRACT

References

Cited By

Index Terms

Recommendations

Triplet-based topology for on-chip networks

Topology aware task mapping techniques: an api and case study

Topology aware task mapping techniques: an api and case study

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Avoiding hot-spots on two-level direct networks

SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis

ABSTRACT

References

Cited By

Index Terms

Recommendations

Triplet-based topology for on-chip networks

Topology aware task mapping techniques: an api and case study

Topology aware task mapping techniques: an api and case study

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media