Isoefficiency in Practice: Configuring and Understanding the Performance of Task-based Applications

Authors:
Sergei Shudler

Technische Universität Darmstadt, Darmstadt, Germany

Technische Universität Darmstadt, Darmstadt, Germany
View Profile

,
Alexandru Calotoiu

Technische Universität Darmstadt, Darmstadt, Germany

Technische Universität Darmstadt, Darmstadt, Germany
View Profile

,
Torsten Hoefler

ETH Zurich, Zurich, Switzerland

ETH Zurich, Zurich, Switzerland
View Profile

,
Felix Wolf

Technische Universität Darmstadt, Darmstadt, Germany

Technische Universität Darmstadt, Darmstadt, Germany
View Profile

Authors Info & Claims

ACM SIGPLAN Notices Volume 52 Issue 8August 2017pp 131–143https://doi.org/10.1145/3155284.3018770

Published:26 January 2017Publication History

ACM SIGPLAN Notices

Abstract

Task-based programming offers an elegant way to express units of computation and the dependencies among them, making it easier to distribute the computational load evenly across multiple cores. However, this separation of problem decomposition and parallelism requires a sufficiently large input problem to achieve satisfactory efficiency on a given number of cores. Unfortunately, finding a good match between input size and core count usually requires significant experimentation, which is expensive and sometimes even impractical. In this paper, we propose an automated empirical method for finding the isoefficiency function of a task-based program, binding efficiency, core count, and the input size in one analytical expression. This allows the latter two to be adjusted according to given (realistic) efficiency objectives. Moreover, we not only find (i) the actual isoefficiency function but also (ii) the function one would yield if the program execution was free of resource contention and (iii) an upper bound that could only be reached if the program was able to maintain its average parallelism throughout its execution. The difference between the three helps to explain low efficiency, and in particular, it helps to differentiate between resource contention and structural conflicts related to task dependencies or scheduling. The insights gained can be used to co-design programs and shared system resources.

References

BSC Application Repository. https://pm.bsc.es/projects/bar/wiki/Applications.Google Scholar
Extra-P -- Automated Performance-modeling Tool. http://www.scalasca.org/software/extra-p.Google Scholar
Graphviz -- Graph Visualization Software. http://www.graphviz.org.Google Scholar
MATLAB. http://www.mathworks.com.Google Scholar
G. E. Blelloch. NESL: A Nested Data-Parallel Language. Technical report, 1992.Google ScholarDigital Library
G. E. Blelloch. Programming Parallel Algorithms. phCommun. ACM, 39 (3): 85--97, 1996.Google ScholarDigital Library
R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An Efficient Multithreaded Runtime System. Journal of Parallel and Distributed Computing, 37 (8): 55--69, 1997.Google Scholar
R. P. Brent. The Parallel Evaluation of General Arithmetic Expressions. Journal of the ACM (JACM), 21 (2): 201--206, April 1974. Google ScholarDigital Library
A. Calotoiu, T. Hoefler, M. Poke, and F. Wolf. Using Automated Performance Modeling to Find Scalability Bugs in Complex Codes. In Proc. of the 2013 ACM/IEEE Conference on Supercomputing (SC '13), pages 45:1--45:12. ACM, 2013. Google ScholarDigital Library
A. Calotoiu, D. Beckingsale, C. W. Earl, T. Hoefler, I. Karlin, M. Schulz, and F. Wolf. Fast Multi-Parameter Performance Modeling. In Proc. of the 2016 IEEE International Conference on Cluster Computing (CLUSTER '16), pages 1--10, September 2016. (to appear). Google ScholarCross Ref
T. H. Cormen, C. E. Leiserson, R. L. Rivest, and S. Clifford. Multithreaded Algorithms. In Introduction to Algorithms, Third Edition, pages 772--812. The MIT Press, 3rd edition, 2009.Google Scholar
A. Duran, J. Corbalán, and E. Ayguadé. An Adaptive Cut-off for Task Parallelism. In Proc. of the 2008 ACM/IEEE Conference on Supercomputing (SC '08), pages 36:1--36:11. IEEE Press, 2008\natexlaba. Google ScholarCross Ref
A. Duran, J. Corbalán, and E. Ayguadé. Evaluation of OpenMP Task Scheduling Strategies. In Proc. of the 4th International Conference on OpenMP in a New Era of Parallelism (IWOMP '08), pages 100--110. Springer-Verlag, 2008\natexlabb. Google ScholarCross Ref
Duran2009A. Duran, X. Teruel, R. Ferrer, X. Martorell, and E. Ayguadé. Barcelona OpenMP Tasks Suite: A Set of Benchmarks Targeting the Exploitation of Task Parallelism in OpenMP. In Proc. of the 2009 International Conference on Parallel Processing (ICPP '09), pages 124--131. IEEE, 2009.Google ScholarDigital Library
Badia, Labarta, Martinell, Martorell, and Planas]Duran2011A. Duran, E. Ayguadé, R. M. Badia, J. Labarta, L. Martinell, X. Martorell, and J. Planas. OmpSs: A Proposal for Programming Heterogeneous Multi-core Architectures. Parallel Processing Letters, 21 (02): 173--193, 2011.Google Scholar
D. L. Eager, J. Zahorjan, and E. D. Lazowska. Speedup Versus Efficiency in Parallel Systems. IEEE Transactions on Computers, 38 (3): 408--423, 1989. Google ScholarDigital Library
M. Frigo, C. E. Leiserson, and K. H. Randall. The Implementation of the Cilk-5 Multithreaded Language. In Proc. of the ACM SIGPLAN 1998 Conference on Programming Language Design and Implementation (PLDI '98), pages 212--223. ACM, 1998. Google ScholarDigital Library
R. L. Graham. Bounds on Multiprocessing Timing Anomalies. SIAM Journal on Applied Mathematics, 17 (2): 416--429, 1969. Google ScholarCross Ref
A. Grama, A. Gupta, G. Karypis, and V. Kumar. Introduction to Parallel Computing. Addison-Wesley Longman Publishing Co., Inc., 2nd edition, 2002.Google Scholar
A. Y. Grama, A. Gupta, and V. Kumar. Isoefficiency: Measuring the Scalability of Parallel Algorithms and Architectures. Parallel Distributed Technology: Systems Applications, 1 (3): 12--21, 1993. Google ScholarDigital Library
Y. He, C. E. Leiserson, and W. M. Leiserson. The Cilkview Scalability Analyzer. In Proc. of the 22nd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA '10), pages 145--156, 2010. Google ScholarDigital Library
T. Hoefler and R. Belli. Scientific Benchmarking of Parallel Computing Systems: Twelve Ways to Tell the Masses when Reporting Performance Results. In Proc. of the 2015 ACM/IEEE Conference on Supercomputing (SC '15), pages 73:1--73:12. ACM, 2015. Google ScholarDigital Library
]Hoefler2010T. Hoefler, W. Gropp, R. Thakur, and J. L. Träff. Toward Performance Models of MPI Implementations for Understanding Application Scaling Issues. In Proc. of the European MPI Users' Group Meeting (EuroMPI '10), pages 21--30. Springer-Verlag, 2010.Google ScholarCross Ref
G. Kestor, R. Gioiosa, and D. Chavarra-Miranda. Prometheus: scalable and accurate emulation of task-based applications on many-core systems. In Proc. of the 2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS '15), pages 308--317. IEEE, 2015. Google ScholarCross Ref
D. Lorenz, P. Philippen, D. Schmidl, and F. Wolf. Profiling of OpenMP Tasks with Score-P. In Proc. of the 2012 41st International Conference on Parallel Processing Workshops (ICPPW '12), pages 444--453, September 2012. Google ScholarDigital Library
M. R. Meswani, L. Carrington, D. Unat, A. Snavely, S. Baden, and S. Poole. Modeling and Predicting Performance of High Performance Computing Applications on Hardware Accelerators. International Journal of High Performance Computing Applications, 27 (2): 89--108, May 2013. Google ScholarDigital Library
S. L. Olivier, B. R. de Supinski, M. Schulz, and J. F. Prins. Characterizing and Mitigating Work Time Inflation in Task Parallel Programs. In Proc. of the 2012 ACM/IEEE Conference on Supercomputing (SC '12), pages 65:1--65:12. IEEE Computer Society Press, 2012. Google ScholarDigital Library
OpenMP Architecture Review Board. OpenMP application programming interface, version 4.0. http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf.Google Scholar
M. J. Quinn. Parallel Programming in C with MPI and OpenMP. McGraw-Hill Education Group, 2003.Google ScholarDigital Library
A. Rico, F. Cabarcas, C. Villavieja, M. Pavlovic, A. Vega, Y. Etsion, A. Ramirez, and M. Valero. On the simulation of large-scale architectures using multiple application abstraction levels. ACM Transactions on Architecture and Code Optimization, 8 (4): 1--20, 2012. Google ScholarDigital Library
S. Shudler, A. Calotoiu, T. Hoefler, A. Strube, and F. Wolf. Exascaling Your Library: Will Your Implementation Meet Your Expectations? In Proc. of the 29th ACM International Conference on Supercomputing (ICS '15), pages 165--175. ACM, June 2015. Google ScholarDigital Library
N. R. Tallent and A. Hoisie. Palm: Easing the Burden of Analytical Performance Modeling. In Proc. of the 28th ACM International Conference on Supercomputing (ICS '14), pages 221--230. ACM, 2014. Google ScholarDigital Library
N. R. Tallent and J. M. Mellor-Crummey. Effective Performance Measurement and Analysis of Multithreaded Applications. In Proc. of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '09), pages 229--240. ACM, 2009. Google ScholarDigital Library
U.S. Department of Energy. The Opportunities and Challenges of Exascale Computing. Office of Science, Washington, D.C., 2010. http://science.energy.gov/ /media/ascr/ascac/pdf/reports/Exascale_subcommittee_report.pdf.Google Scholar
W. Wang, T. Dey, J. W. Davidson, and M. L. Soffa. DraMon: Predicting Memory Bandwidth Usage of Multi-threaded Programs With High Accuracy and Low Overhead. In Proc. of the 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA '14), pages 380--391, February 2014. Google ScholarCross Ref

Index Terms

Isoefficiency in Practice: Configuring and Understanding the Performance of Task-based Applications
1. Computing methodologies
  1. Modeling and simulation
    1. Model development and analysis
      1. Modeling methodologies
  2. Parallel computing methodologies
2. Theory of computation
  1. Models of computation
    1. Concurrency
      1. Parallel computing models

Recommendations

Isoefficiency in Practice: Configuring and Understanding the Performance of Task-based Applications
PPoPP '17: Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Task-based programming offers an elegant way to express units of computation and the dependencies among them, making it easier to distribute the computational load evenly across multiple cores. However, this separation of problem decomposition and ...
Read More
Benchmarking Parallel Performance on Many-Core Processors
OpenSHMEM 2014: Proceedings of the First Workshop on OpenSHMEM and Related Technologies. Experiences, Implementations, and Tools - Volume 8356

With the emergence of many-core processor architectures onto the HPC scene, concerns arise regarding the performance and productivity of numerous existing parallel-programming tools, models, and languages. As these devices begin augmenting conventional ...
Read More
Low-level PGAS computing on many-core processors with TSHMEM

Diminishing returns from increased clock frequencies and instruction-level parallelism have forced computer architects to adopt architectures that exploit wider parallelism through multiple processor cores. While emerging many-core architectures have ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM SIGPLAN Notices Volume 52, Issue 8
PPoPP '17
August 2017
442 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/3155284
Editor:
Matthew Fluet
Issue’s Table of Contents
PPoPP '17: Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
January 2017
476 pages
ISBN:9781450344937
DOI:10.1145/3018743
General Chair:
Vivek Sarkar
Rice University, USA
,
Program Chair:
Lawrence Rauchwerger
Texas A&M University, USA
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 26 January 2017
Check for updates
Author Tags
co-design
isoefficiency
parallel programming
performance analysis
performance modeling
tasking
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 12
  Total Citations
  View Citations
- 791
  Total Downloads
- Downloads (Last 12 months)123
- Downloads (Last 6 weeks)14
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Isoefficiency in Practice: Configuring and Understanding the Performance of Task-based Applications

ACM SIGPLAN Notices

Abstract

References

Cited By

Index Terms

Recommendations

Isoefficiency in Practice: Configuring and Understanding the Performance of Task-based Applications

Benchmarking Parallel Performance on Many-Core Processors

Low-level PGAS computing on many-core processors with TSHMEM