research-article

Reducing task creation and termination overhead in explicitly parallel programs

Authors:
Jisheng Zhao

Rice University, Houston, TX, USA

Rice University, Houston, TX, USA
View Profile

,
Jun Shirako

Rice University, Houston, TX, USA

Rice University, Houston, TX, USA
View Profile

,
V. Krishna Nandivada

IBM, Bangalore, India

IBM, Bangalore, India
View Profile

,
Vivek Sarkar

Rice University, Houston, TX, USA

Rice University, Houston, TX, USA
View Profile

PACT '10: Proceedings of the 19th international conference on Parallel architectures and compilation techniquesSeptember 2010Pages 169–180https://doi.org/10.1145/1854273.1854298

Published:11 September 2010Publication History

PACT '10: Proceedings of the 19th international conference on Parallel architectures and compilation techniques

Pages 169–180

ABSTRACT

There has been a proliferation of task-parallel programming systems to address the requirements of multicore programmers. Current production task-parallel systems include Cilk++, Intel Threading Building Blocks, Java Concurrency, .Net Task Parallel Library, OpenMP 3.0, and current research task-parallel languages include Cilk, Chapel, Fortress, X10, and Habanero-Java (HJ). It is desirable for the programmer to express all the parallelism intrinsic to their algorithm in their code for forward scalability and portability, but the overhead incurred by doing so can be prohibitively large in today's systems. In this paper, we address the problem of reducing the total amount of overhead incurred by a program due to excessive task creation and termination. We introduce a transformation framework to optimize task-parallel programs with finish, forall and next statements. Our approach includes elimination of redundant task creation and termination operations as well as strength reduction of termination operations (finish) to lighter-weight synchronizations (next). Experimental results were obtained on three platforms: a dual-socket 128-thread (16-core) Niagara T2 system, a quad-socket 16-way Intel Xeon SMP and a quad-socket 32-way Power7 SMP. The results showed maximum speedup 66.7x, 11.25x and 23.1x respectively on each platform and 4.6x, 2.1x and 6.4x performance improvements respectively in geometric mean related to non-optimized parallel codes. The original benchmarks in this study were written with medium-grained parallelism; a larger relative improvement can be expected for programs written with finer-grained parallelism. However, even for the medium-grained parallel benchmarks studied in this paper, the significant improvement obtained by the transformation framework underscores the importance of the compiler optimizations introduced in this paper.

References

}}E. Allan et al. The Fortress language specification version 0.618. Technical report, Sun Microsystems, April 2005.Google Scholar
}}S. P. Amarasinghe and M. S. Lam. Communication Optimization and Code Generation for Distributed Memory Machines. In Proceedings of the conference on Programming language design and implementation, pages 126--138. ACM, 1993. Google ScholarDigital Library
}}R. Barik et al. Experiences with an SMP implementation for X10 based on the Java concurrency utilities. In Proceedings of the Workshop on Programming Models for Ubiqui- tous Parallelism, Seattle, Washington, 2006.Google Scholar
}}G. Bikshandi et al. Efficient, portable implementation of asynchronous multi-place programs. In Proceedings of the symposium on Principles and practice of parallel programming, 2009. Google ScholarDigital Library
}}R. D. Blumofe et al. CILK: An efficient multithreaded runtime system. Proceedings of Symposium on Principles and Practice of Parallel Programming, pages 207--216, 1995. Google ScholarDigital Library
}}R. Cytron, J. Lipkis, and E. Schonberg. A Compiler-Assisted Approach to SPMD Execution. Supercomputing, Nov 1990. Google ScholarDigital Library
}}P. C. Diniz and M. C. Rinard. Synchronization transformations for parallel computing. In Proceedings of the ACM Symposium on the Principles of Programming Languages, pages 187--200. ACM, 1997. Google ScholarDigital Library
}}D. H. Bailey et al. The nas parallel benchmarks, 1994.Google Scholar
}}P. Charles et al. X10: an object-oriented approach to non-uniform cluster computing. In Proceedings of the conference on Object oriented programming, systems, languages, and applications, pages 519--538, 2005. Google ScholarDigital Library
}}R. Ferrer, A. Duran, X. Martorell, and E. Ayguadé. Unrolling Loops Containing Task Parallelism. In Proceedings of the 22nd International Workshop on Languages and Compilers for Parallel Computing, Sep 2009. Google ScholarDigital Library
}}Andy Georges et al. Statistically Rigorous Java Performance Evaluation. SIGPLAN Not., 42(10):57--76, 2007. Google ScholarDigital Library
}}Habanero Java. http://habanero.rice.edu/hj, Dec 2009.Google Scholar
}}Cray Inc. The Chapel language specification version 0.4. Technical report, Cray Inc., February 2005.Google Scholar
}}The Java Grande Forum benchmark suite. http://www.epcc.ed.ac.uk/javagrande/javag.html.Google Scholar
}}J.Shirako, D.M.Peixotto, V.Sarkar, and W.N.Scherer III. Phasers: a unified deadlock-free construct for collective and point-to-point synchronization. In Proceedings of the 22nd annual international conference on Supercomputing, pages 277--288, New York, USA, 2008. ACM. Google ScholarDigital Library
}}K. Kennedy and J. R. Allen. Optimizing compilers for modern architectures: a dependence-based approach. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2002. Google ScholarDigital Library
}}Calvin Lin and Lawrence Snyder. Principles of Parallel Programming. Addison-Wesley, 2008. Google ScholarDigital Library
}}A. Nicolau, G. Li, A.V. Veidenbaum, and A. Kejariwal. Synchronization optimizations for efficient execution on multi-cores. In Proceedings of the 23rd international conference on Supercomputing, pages 169--180, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
}}R.J.Cytron, J.T.Lipkis, and E.T.Schonberg. A compiler-assisted approach to SPMD execution. In Proceedings of Supercomputing, 1990. Google ScholarDigital Library
}}Jun Shirako, Jisheng Zhao, V. Krishna Nandivada, and Vivek Sarkar. Chunking parallel loops in the presence of synchronization. In ICS, pages 181--192, 2009. Google ScholarDigital Library
}}C. Tseng. Compiler optimizations for eliminating barrier synchronization. In Proceedings of the symposium on Principles and practice of parallel programming, pages 144--155, New York, NY, USA, 1995. ACM. Google ScholarDigital Library
}}R. Vallée-Rai et al. Soot - a Java Optimization Framework. In Proceedings of CASCON 1999, pages 125--135, 1999.Google ScholarDigital Library
}}K. Yelick et al. Productivity and performance using partitioned global address space languages. In Proceedings of the international workshop on Parallel symbolic computation, pages 24--32, New York, USA, 2007. ACM. Google ScholarDigital Library
}}J. Zhao and V. Sarkar. A hierarchical region-based static single assignment form. Technical Report TR09-9, Department of Computer Science, Rice University, December 2009.Google Scholar

Index Terms

Reducing task creation and termination overhead in explicitly parallel programs
1. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

A Transformation Framework for Optimizing Task-Parallel Programs

Task parallelism has increasingly become a trend with programming models such as OpenMP 3.0, Cilk, Java Concurrency, X10, Chapel and Habanero-Java (HJ) to address the requirements of multicore programmers. While task parallelism increases productivity ...
Read More
Optimizing recursive task parallel programs
ICS '17: Proceedings of the International Conference on Supercomputing

We present a new optimization DECAF that optimizes recursive task parallel (RTP) programs by reducing the task creation and termination overheads. DECAF reduces the task termination (join) operations by aggressively increasing the scope of join ...
Read More
Utility-based acceleration of multithreaded applications on asymmetric CMPs
ICSA '13

Asymmetric Chip Multiprocessors (ACMPs) are becoming a reality. ACMPs can speed up parallel applications if they can identify and accelerate code segments that are critical for performance. Proposals already exist for using coarse-grained thread ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
PACT '10: Proceedings of the 19th international conference on Parallel architectures and compilation techniques
September 2010
596 pages
ISBN:9781450301787
DOI:10.1145/1854273
General Chair:
Valentina Salapura
IBM TJ Watson Research Center
,
Program Chairs:
Michael Gschwind
IBM Systems & Technology Group
,
Jens Knoop
Technische Universität Wien
Copyright © 2010 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 11 September 2010
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
barriers
ideal parallelism
optimization
redundant tasks
useful parallelism
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate121of471submissions,26%
Upcoming Conference
PACT '24

Sponsor:

sigarch

International Conference on Parallel Architectures and Compilation Techniques

October 14 - 16, 2024

Southern California , CA , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 21
  Total Citations
  View Citations
- 342
  Total Downloads
- Downloads (Last 12 months)6
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Reducing task creation and termination overhead in explicitly parallel programs

PACT '10: Proceedings of the 19th international conference on Parallel architectures and compilation techniques

ABSTRACT

References

Cited By

Index Terms

Recommendations

A Transformation Framework for Optimizing Task-Parallel Programs

Optimizing recursive task parallel programs

Utility-based acceleration of multithreaded applications on asymmetric CMPs