skip to main content
research-article
Public Access

Isoefficiency in Practice: Configuring and Understanding the Performance of Task-based Applications

Published:26 January 2017Publication History
Skip Abstract Section

Abstract

Task-based programming offers an elegant way to express units of computation and the dependencies among them, making it easier to distribute the computational load evenly across multiple cores. However, this separation of problem decomposition and parallelism requires a sufficiently large input problem to achieve satisfactory efficiency on a given number of cores. Unfortunately, finding a good match between input size and core count usually requires significant experimentation, which is expensive and sometimes even impractical. In this paper, we propose an automated empirical method for finding the isoefficiency function of a task-based program, binding efficiency, core count, and the input size in one analytical expression. This allows the latter two to be adjusted according to given (realistic) efficiency objectives. Moreover, we not only find (i) the actual isoefficiency function but also (ii) the function one would yield if the program execution was free of resource contention and (iii) an upper bound that could only be reached if the program was able to maintain its average parallelism throughout its execution. The difference between the three helps to explain low efficiency, and in particular, it helps to differentiate between resource contention and structural conflicts related to task dependencies or scheduling. The insights gained can be used to co-design programs and shared system resources.

References

  1. BSC Application Repository. https://pm.bsc.es/projects/bar/wiki/Applications.Google ScholarGoogle Scholar
  2. Extra-P -- Automated Performance-modeling Tool. http://www.scalasca.org/software/extra-p.Google ScholarGoogle Scholar
  3. Graphviz -- Graph Visualization Software. http://www.graphviz.org.Google ScholarGoogle Scholar
  4. MATLAB. http://www.mathworks.com.Google ScholarGoogle Scholar
  5. G. E. Blelloch. NESL: A Nested Data-Parallel Language. Technical report, 1992.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. G. E. Blelloch. Programming Parallel Algorithms. phCommun. ACM, 39 (3): 85--97, 1996.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An Efficient Multithreaded Runtime System. Journal of Parallel and Distributed Computing, 37 (8): 55--69, 1997.Google ScholarGoogle Scholar
  8. R. P. Brent. The Parallel Evaluation of General Arithmetic Expressions. Journal of the ACM (JACM), 21 (2): 201--206, April 1974. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. A. Calotoiu, T. Hoefler, M. Poke, and F. Wolf. Using Automated Performance Modeling to Find Scalability Bugs in Complex Codes. In Proc. of the 2013 ACM/IEEE Conference on Supercomputing (SC '13), pages 45:1--45:12. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. A. Calotoiu, D. Beckingsale, C. W. Earl, T. Hoefler, I. Karlin, M. Schulz, and F. Wolf. Fast Multi-Parameter Performance Modeling. In Proc. of the 2016 IEEE International Conference on Cluster Computing (CLUSTER '16), pages 1--10, September 2016. (to appear). Google ScholarGoogle ScholarCross RefCross Ref
  11. T. H. Cormen, C. E. Leiserson, R. L. Rivest, and S. Clifford. Multithreaded Algorithms. In Introduction to Algorithms, Third Edition, pages 772--812. The MIT Press, 3rd edition, 2009.Google ScholarGoogle Scholar
  12. A. Duran, J. Corbalán, and E. Ayguadé. An Adaptive Cut-off for Task Parallelism. In Proc. of the 2008 ACM/IEEE Conference on Supercomputing (SC '08), pages 36:1--36:11. IEEE Press, 2008\natexlaba. Google ScholarGoogle ScholarCross RefCross Ref
  13. A. Duran, J. Corbalán, and E. Ayguadé. Evaluation of OpenMP Task Scheduling Strategies. In Proc. of the 4th International Conference on OpenMP in a New Era of Parallelism (IWOMP '08), pages 100--110. Springer-Verlag, 2008\natexlabb. Google ScholarGoogle ScholarCross RefCross Ref
  14. Duran2009A. Duran, X. Teruel, R. Ferrer, X. Martorell, and E. Ayguadé. Barcelona OpenMP Tasks Suite: A Set of Benchmarks Targeting the Exploitation of Task Parallelism in OpenMP. In Proc. of the 2009 International Conference on Parallel Processing (ICPP '09), pages 124--131. IEEE, 2009.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Badia, Labarta, Martinell, Martorell, and Planas]Duran2011A. Duran, E. Ayguadé, R. M. Badia, J. Labarta, L. Martinell, X. Martorell, and J. Planas. OmpSs: A Proposal for Programming Heterogeneous Multi-core Architectures. Parallel Processing Letters, 21 (02): 173--193, 2011.Google ScholarGoogle Scholar
  16. D. L. Eager, J. Zahorjan, and E. D. Lazowska. Speedup Versus Efficiency in Parallel Systems. IEEE Transactions on Computers, 38 (3): 408--423, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. M. Frigo, C. E. Leiserson, and K. H. Randall. The Implementation of the Cilk-5 Multithreaded Language. In Proc. of the ACM SIGPLAN 1998 Conference on Programming Language Design and Implementation (PLDI '98), pages 212--223. ACM, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. R. L. Graham. Bounds on Multiprocessing Timing Anomalies. SIAM Journal on Applied Mathematics, 17 (2): 416--429, 1969. Google ScholarGoogle ScholarCross RefCross Ref
  19. A. Grama, A. Gupta, G. Karypis, and V. Kumar. Introduction to Parallel Computing. Addison-Wesley Longman Publishing Co., Inc., 2nd edition, 2002.Google ScholarGoogle Scholar
  20. A. Y. Grama, A. Gupta, and V. Kumar. Isoefficiency: Measuring the Scalability of Parallel Algorithms and Architectures. Parallel Distributed Technology: Systems Applications, 1 (3): 12--21, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Y. He, C. E. Leiserson, and W. M. Leiserson. The Cilkview Scalability Analyzer. In Proc. of the 22nd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA '10), pages 145--156, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. T. Hoefler and R. Belli. Scientific Benchmarking of Parallel Computing Systems: Twelve Ways to Tell the Masses when Reporting Performance Results. In Proc. of the 2015 ACM/IEEE Conference on Supercomputing (SC '15), pages 73:1--73:12. ACM, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. ]Hoefler2010T. Hoefler, W. Gropp, R. Thakur, and J. L. Träff. Toward Performance Models of MPI Implementations for Understanding Application Scaling Issues. In Proc. of the European MPI Users' Group Meeting (EuroMPI '10), pages 21--30. Springer-Verlag, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  24. G. Kestor, R. Gioiosa, and D. Chavarra-Miranda. Prometheus: scalable and accurate emulation of task-based applications on many-core systems. In Proc. of the 2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS '15), pages 308--317. IEEE, 2015. Google ScholarGoogle ScholarCross RefCross Ref
  25. D. Lorenz, P. Philippen, D. Schmidl, and F. Wolf. Profiling of OpenMP Tasks with Score-P. In Proc. of the 2012 41st International Conference on Parallel Processing Workshops (ICPPW '12), pages 444--453, September 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. M. R. Meswani, L. Carrington, D. Unat, A. Snavely, S. Baden, and S. Poole. Modeling and Predicting Performance of High Performance Computing Applications on Hardware Accelerators. International Journal of High Performance Computing Applications, 27 (2): 89--108, May 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. S. L. Olivier, B. R. de Supinski, M. Schulz, and J. F. Prins. Characterizing and Mitigating Work Time Inflation in Task Parallel Programs. In Proc. of the 2012 ACM/IEEE Conference on Supercomputing (SC '12), pages 65:1--65:12. IEEE Computer Society Press, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. OpenMP Architecture Review Board. OpenMP application programming interface, version 4.0. http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf.Google ScholarGoogle Scholar
  29. M. J. Quinn. Parallel Programming in C with MPI and OpenMP. McGraw-Hill Education Group, 2003.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. A. Rico, F. Cabarcas, C. Villavieja, M. Pavlovic, A. Vega, Y. Etsion, A. Ramirez, and M. Valero. On the simulation of large-scale architectures using multiple application abstraction levels. ACM Transactions on Architecture and Code Optimization, 8 (4): 1--20, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. S. Shudler, A. Calotoiu, T. Hoefler, A. Strube, and F. Wolf. Exascaling Your Library: Will Your Implementation Meet Your Expectations? In Proc. of the 29th ACM International Conference on Supercomputing (ICS '15), pages 165--175. ACM, June 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. N. R. Tallent and A. Hoisie. Palm: Easing the Burden of Analytical Performance Modeling. In Proc. of the 28th ACM International Conference on Supercomputing (ICS '14), pages 221--230. ACM, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. N. R. Tallent and J. M. Mellor-Crummey. Effective Performance Measurement and Analysis of Multithreaded Applications. In Proc. of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '09), pages 229--240. ACM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. U.S. Department of Energy. The Opportunities and Challenges of Exascale Computing. Office of Science, Washington, D.C., 2010. http://science.energy.gov/ /media/ascr/ascac/pdf/reports/Exascale_subcommittee_report.pdf.Google ScholarGoogle Scholar
  35. W. Wang, T. Dey, J. W. Davidson, and M. L. Soffa. DraMon: Predicting Memory Bandwidth Usage of Multi-threaded Programs With High Accuracy and Low Overhead. In Proc. of the 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA '14), pages 380--391, February 2014. Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Isoefficiency in Practice: Configuring and Understanding the Performance of Task-based Applications

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM SIGPLAN Notices
          ACM SIGPLAN Notices  Volume 52, Issue 8
          PPoPP '17
          August 2017
          442 pages
          ISSN:0362-1340
          EISSN:1558-1160
          DOI:10.1145/3155284
          Issue’s Table of Contents
          • cover image ACM Conferences
            PPoPP '17: Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
            January 2017
            476 pages
            ISBN:9781450344937
            DOI:10.1145/3018743

          Copyright © 2017 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 26 January 2017

          Check for updates

          Qualifiers

          • research-article

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader