Abstract
Task-based programming offers an elegant way to express units of computation and the dependencies among them, making it easier to distribute the computational load evenly across multiple cores. However, this separation of problem decomposition and parallelism requires a sufficiently large input problem to achieve satisfactory efficiency on a given number of cores. Unfortunately, finding a good match between input size and core count usually requires significant experimentation, which is expensive and sometimes even impractical. In this paper, we propose an automated empirical method for finding the isoefficiency function of a task-based program, binding efficiency, core count, and the input size in one analytical expression. This allows the latter two to be adjusted according to given (realistic) efficiency objectives. Moreover, we not only find (i) the actual isoefficiency function but also (ii) the function one would yield if the program execution was free of resource contention and (iii) an upper bound that could only be reached if the program was able to maintain its average parallelism throughout its execution. The difference between the three helps to explain low efficiency, and in particular, it helps to differentiate between resource contention and structural conflicts related to task dependencies or scheduling. The insights gained can be used to co-design programs and shared system resources.
- BSC Application Repository. https://pm.bsc.es/projects/bar/wiki/Applications.Google Scholar
- Extra-P -- Automated Performance-modeling Tool. http://www.scalasca.org/software/extra-p.Google Scholar
- Graphviz -- Graph Visualization Software. http://www.graphviz.org.Google Scholar
- MATLAB. http://www.mathworks.com.Google Scholar
- G. E. Blelloch. NESL: A Nested Data-Parallel Language. Technical report, 1992.Google ScholarDigital Library
- G. E. Blelloch. Programming Parallel Algorithms. phCommun. ACM, 39 (3): 85--97, 1996.Google ScholarDigital Library
- R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An Efficient Multithreaded Runtime System. Journal of Parallel and Distributed Computing, 37 (8): 55--69, 1997.Google Scholar
- R. P. Brent. The Parallel Evaluation of General Arithmetic Expressions. Journal of the ACM (JACM), 21 (2): 201--206, April 1974. Google ScholarDigital Library
- A. Calotoiu, T. Hoefler, M. Poke, and F. Wolf. Using Automated Performance Modeling to Find Scalability Bugs in Complex Codes. In Proc. of the 2013 ACM/IEEE Conference on Supercomputing (SC '13), pages 45:1--45:12. ACM, 2013. Google ScholarDigital Library
- A. Calotoiu, D. Beckingsale, C. W. Earl, T. Hoefler, I. Karlin, M. Schulz, and F. Wolf. Fast Multi-Parameter Performance Modeling. In Proc. of the 2016 IEEE International Conference on Cluster Computing (CLUSTER '16), pages 1--10, September 2016. (to appear). Google ScholarCross Ref
- T. H. Cormen, C. E. Leiserson, R. L. Rivest, and S. Clifford. Multithreaded Algorithms. In Introduction to Algorithms, Third Edition, pages 772--812. The MIT Press, 3rd edition, 2009.Google Scholar
- A. Duran, J. Corbalán, and E. Ayguadé. An Adaptive Cut-off for Task Parallelism. In Proc. of the 2008 ACM/IEEE Conference on Supercomputing (SC '08), pages 36:1--36:11. IEEE Press, 2008\natexlaba. Google ScholarCross Ref
- A. Duran, J. Corbalán, and E. Ayguadé. Evaluation of OpenMP Task Scheduling Strategies. In Proc. of the 4th International Conference on OpenMP in a New Era of Parallelism (IWOMP '08), pages 100--110. Springer-Verlag, 2008\natexlabb. Google ScholarCross Ref
- Duran2009A. Duran, X. Teruel, R. Ferrer, X. Martorell, and E. Ayguadé. Barcelona OpenMP Tasks Suite: A Set of Benchmarks Targeting the Exploitation of Task Parallelism in OpenMP. In Proc. of the 2009 International Conference on Parallel Processing (ICPP '09), pages 124--131. IEEE, 2009.Google ScholarDigital Library
- Badia, Labarta, Martinell, Martorell, and Planas]Duran2011A. Duran, E. Ayguadé, R. M. Badia, J. Labarta, L. Martinell, X. Martorell, and J. Planas. OmpSs: A Proposal for Programming Heterogeneous Multi-core Architectures. Parallel Processing Letters, 21 (02): 173--193, 2011.Google Scholar
- D. L. Eager, J. Zahorjan, and E. D. Lazowska. Speedup Versus Efficiency in Parallel Systems. IEEE Transactions on Computers, 38 (3): 408--423, 1989. Google ScholarDigital Library
- M. Frigo, C. E. Leiserson, and K. H. Randall. The Implementation of the Cilk-5 Multithreaded Language. In Proc. of the ACM SIGPLAN 1998 Conference on Programming Language Design and Implementation (PLDI '98), pages 212--223. ACM, 1998. Google ScholarDigital Library
- R. L. Graham. Bounds on Multiprocessing Timing Anomalies. SIAM Journal on Applied Mathematics, 17 (2): 416--429, 1969. Google ScholarCross Ref
- A. Grama, A. Gupta, G. Karypis, and V. Kumar. Introduction to Parallel Computing. Addison-Wesley Longman Publishing Co., Inc., 2nd edition, 2002.Google Scholar
- A. Y. Grama, A. Gupta, and V. Kumar. Isoefficiency: Measuring the Scalability of Parallel Algorithms and Architectures. Parallel Distributed Technology: Systems Applications, 1 (3): 12--21, 1993. Google ScholarDigital Library
- Y. He, C. E. Leiserson, and W. M. Leiserson. The Cilkview Scalability Analyzer. In Proc. of the 22nd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA '10), pages 145--156, 2010. Google ScholarDigital Library
- T. Hoefler and R. Belli. Scientific Benchmarking of Parallel Computing Systems: Twelve Ways to Tell the Masses when Reporting Performance Results. In Proc. of the 2015 ACM/IEEE Conference on Supercomputing (SC '15), pages 73:1--73:12. ACM, 2015. Google ScholarDigital Library
- ]Hoefler2010T. Hoefler, W. Gropp, R. Thakur, and J. L. Träff. Toward Performance Models of MPI Implementations for Understanding Application Scaling Issues. In Proc. of the European MPI Users' Group Meeting (EuroMPI '10), pages 21--30. Springer-Verlag, 2010.Google ScholarCross Ref
- G. Kestor, R. Gioiosa, and D. Chavarra-Miranda. Prometheus: scalable and accurate emulation of task-based applications on many-core systems. In Proc. of the 2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS '15), pages 308--317. IEEE, 2015. Google ScholarCross Ref
- D. Lorenz, P. Philippen, D. Schmidl, and F. Wolf. Profiling of OpenMP Tasks with Score-P. In Proc. of the 2012 41st International Conference on Parallel Processing Workshops (ICPPW '12), pages 444--453, September 2012. Google ScholarDigital Library
- M. R. Meswani, L. Carrington, D. Unat, A. Snavely, S. Baden, and S. Poole. Modeling and Predicting Performance of High Performance Computing Applications on Hardware Accelerators. International Journal of High Performance Computing Applications, 27 (2): 89--108, May 2013. Google ScholarDigital Library
- S. L. Olivier, B. R. de Supinski, M. Schulz, and J. F. Prins. Characterizing and Mitigating Work Time Inflation in Task Parallel Programs. In Proc. of the 2012 ACM/IEEE Conference on Supercomputing (SC '12), pages 65:1--65:12. IEEE Computer Society Press, 2012. Google ScholarDigital Library
- OpenMP Architecture Review Board. OpenMP application programming interface, version 4.0. http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf.Google Scholar
- M. J. Quinn. Parallel Programming in C with MPI and OpenMP. McGraw-Hill Education Group, 2003.Google ScholarDigital Library
- A. Rico, F. Cabarcas, C. Villavieja, M. Pavlovic, A. Vega, Y. Etsion, A. Ramirez, and M. Valero. On the simulation of large-scale architectures using multiple application abstraction levels. ACM Transactions on Architecture and Code Optimization, 8 (4): 1--20, 2012. Google ScholarDigital Library
- S. Shudler, A. Calotoiu, T. Hoefler, A. Strube, and F. Wolf. Exascaling Your Library: Will Your Implementation Meet Your Expectations? In Proc. of the 29th ACM International Conference on Supercomputing (ICS '15), pages 165--175. ACM, June 2015. Google ScholarDigital Library
- N. R. Tallent and A. Hoisie. Palm: Easing the Burden of Analytical Performance Modeling. In Proc. of the 28th ACM International Conference on Supercomputing (ICS '14), pages 221--230. ACM, 2014. Google ScholarDigital Library
- N. R. Tallent and J. M. Mellor-Crummey. Effective Performance Measurement and Analysis of Multithreaded Applications. In Proc. of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '09), pages 229--240. ACM, 2009. Google ScholarDigital Library
- U.S. Department of Energy. The Opportunities and Challenges of Exascale Computing. Office of Science, Washington, D.C., 2010. http://science.energy.gov/ /media/ascr/ascac/pdf/reports/Exascale_subcommittee_report.pdf.Google Scholar
- W. Wang, T. Dey, J. W. Davidson, and M. L. Soffa. DraMon: Predicting Memory Bandwidth Usage of Multi-threaded Programs With High Accuracy and Low Overhead. In Proc. of the 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA '14), pages 380--391, February 2014. Google ScholarCross Ref
Index Terms
- Isoefficiency in Practice: Configuring and Understanding the Performance of Task-based Applications
Recommendations
Isoefficiency in Practice: Configuring and Understanding the Performance of Task-based Applications
PPoPP '17: Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingTask-based programming offers an elegant way to express units of computation and the dependencies among them, making it easier to distribute the computational load evenly across multiple cores. However, this separation of problem decomposition and ...
Benchmarking Parallel Performance on Many-Core Processors
OpenSHMEM 2014: Proceedings of the First Workshop on OpenSHMEM and Related Technologies. Experiences, Implementations, and Tools - Volume 8356With the emergence of many-core processor architectures onto the HPC scene, concerns arise regarding the performance and productivity of numerous existing parallel-programming tools, models, and languages. As these devices begin augmenting conventional ...
Low-level PGAS computing on many-core processors with TSHMEM
Diminishing returns from increased clock frequencies and instruction-level parallelism have forced computer architects to adopt architectures that exploit wider parallelism through multiple processor cores. While emerging many-core architectures have ...
Comments