research-article

Efficient execution of time-step computations with pipelined parallelism and inter-thread data locality optimizaitions

Author:
Apan Qasem

Texas State University, San Marcos, TX

Texas State University, San Marcos, TX
View Profile

PMAM '12: Proceedings of the 2012 International Workshop on Programming Models and Applications for Multicores and ManycoresFebruary 2012Pages 27–35https://doi.org/10.1145/2141702.2141706

Published:26 February 2012Publication History

PMAM '12: Proceedings of the 2012 International Workshop on Programming Models and Applications for Multicores and Manycores

Pages 27–35

ABSTRACT

This paper presents a strategy that integrates a set of compiler optimizations and analysis techniques that enable the detection and transformation of time step loops for efficient execution on multicore platforms. Time-step computations, which appear frequently in scientific applications, are amenable to pipelined parallelism and exhibit a high degree of temporal locality. However, striking the right balance between data locality and parallelism often proves difficult, particularly for current multicore architectures where one or more levels of the memory hierarchy is shared among multiple processing units. Our proposed strategy addresses performance issues related to both data locality and parallelism. By carefully orchestrating a set of source-to-source transformations, our technique exposes fine-grain parallelism within a time-step loop, while improving its cache utilization and reducing its bandwidth requirements. Preliminary experiments with two time-step applications on three multicore platforms show that that the code variants generated by our strategy have significantly fewer misses in the shared caches and also achieve better execution times through reduced synchronization costs.

References

Stencilprobe: A microbenchmark for stencil applications.Google Scholar
R. Allen and K. Kennedy. Optimizing Compilers for Modern Architectures. Morgan Kaufmann, 2002. Google ScholarDigital Library
S. Coleman and K. S. McKinley. Tile size selection using cache organization. In Proceedings of the SIGPLAN '95 Conference on Programming Language Design and Implementation, La Jolla, CA, June 1995. Google ScholarDigital Library
C. Ding and K. Kennedy. Improving effective bandwidth through compiler enhancement of global cache reuse. In International Parallel and Distributed Processing Symposium, San Francisco, CA, Apr. 2001. (Best Paper Award.). Google ScholarDigital Library
G. Gao, R. Olsen, V. Sarkar, and R. Thekkath. Collective loop fusion for array contraction. In Proceedings of the Fifth Workshop on Languages and Compilers for Parallel Computing, New Haven, CT, Aug. 1992. Google ScholarDigital Library
G. Jin, J. Mellor-Crummey, and R. Fowler. Increasing temporal locality with skewing and recursive blocking. In Proceedings of SC2001, Denver, CO, Nov 2001. Google ScholarDigital Library
S. Krishnamoorthy, M. Baskaran, U. Bondhugula, J. Ramanujam, A. Rountev, and P. Sadayappan. Effective automatic parallelization of stencil computations. In PLDI '07: Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation, pages 235--244, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
M. Kudlur and S. Mahlke. Orchestrating the execution of stream programs on multicore platforms. SIGPLAN Not., 43(6):114--124, 2008. Google ScholarDigital Library
J. Mccalpin and D. Wonnacott. Time skewing: A value-based approach to optimizing for memory locality. Technical report, In http://www.haverford.edu/cmsc/davew/cache-opt/cache-opt.html, 1999.Google Scholar
J. Mccalpin and D. Wonnacott. Time skewing: A value-based approach to optimizing for memory locality. Technical report, In http://www.haverford.edu/cmsc/davew/cache-opt/cache-opt.html, 1999.Google Scholar
J. Michalake, J. Dudhia, D. Gill, T. Henderson, J. Klemp, W. Skamarock, and W. Wang. The weather reseach and forecast model: Software architecture and performance. In Proceedings of the 11th ECMWF Workshop on the Use of High Performance Computing In Meteorology, 2004.Google Scholar
A. Qasem, G. Jin, and J. Mellor-Crummey. Improving performance with integrated program transformations. Technical Report CS-TR03-419, Dept. of Computer Science, Rice University, Oct. 2003.Google Scholar
Y. Song, R. Xu, C. Wang, and Z. Li. Data locality enhancement by memory reduction. In Proceedings of the 15th ACM International Conference on Supercomputing, Sorrento, Italy, June 2001. Google ScholarDigital Library
W. Thies, V. Chandrasekhar, and S. Amarasinghe. A practical approach to exploiting coarse-grained pipeline parallelism in c programs. In International Symposium on Microarchitecture, 2007. Google ScholarDigital Library
W. Thies, M. Karczmarek, and S. P. Amarasinghe. Streamit: A language for streaming applications. In Computational Complexity, pages 179--196, 2002. Google ScholarDigital Library
J. Treibig, G. Wellein, and G. Hager. Efficient multicore-aware parallelization strategies for iterative stencil computations. CoRR, abs/1004.1741, 2010.Google Scholar
N. Vachharajani, R. Rangan, E. Raman, M. J. Bridges, G. Ottoni, and D. I. August. Speculative decoupled software pipelining. In PACT '07: Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques, pages 49--59, Washington, DC, USA, 2007. IEEE Computer Society. Google ScholarDigital Library
S. N. Vadlamani and S. F. Jenks. The synchronized pipelined parallelism model. In The 16th IASTED International Conference on Parallel and Distributed Computing and Systems, 2004.Google Scholar
L. J. Wicker. NSSL collaborative model for atmospheric simulation (NCOMMAS). http://www.nssl.noaa.gov/~wicker/commas.html.Google Scholar
M. E. Wolf and M. Lam. A data locality optimizing algorithm. In Proceedings of the SIGPLAN '91 Conference on Programming Language Design and Implementation, Toronto, Canada, June 1991. Google ScholarDigital Library
M. J. Wolfe. Optimizing Supercompilers for Supercomputers. PhD thesis, Dept. of Computer Science, University of Illinois at Urbana-Champaign, Oct. 1982. Google ScholarDigital Library
D. Wonnacott. Time skewing for parallel computers. In LCPC '99: Proceedings of the 12th International Workshop on Languages and Compilers for Parallel Computing, pages 477--480, London, UK, 2000. Springer-Verlag. Google ScholarDigital Library
D. Wonnacott. Using time skewing to eliminate idle time due to memory bandwidth and network limitations. In IPDPS '00: Proceedings of the 14th International Symposium on Parallel and Distributed Processing, page 171, Washington, DC, USA, 2000. IEEE Computer Society. Google ScholarDigital Library

Index Terms

Efficient execution of time-step computations with pipelined parallelism and inter-thread data locality optimizaitions
1. Software and its engineering
  1. Software notations and tools
    1. Compilers
  2. Software organization and properties
    1. Contextual software domains
      1. Operating systems
        Memory management
        Garbage collection

Recommendations

Thread-Aware Adaptive Prefetcher on Multicore Systems: Improving the Performance for Multithreaded Workloads

Most processors employ hardware data prefetching techniques to hide memory access latencies. However, the prefetching requests from different threads on a multicore processor can cause severe interference with prefetching and/or demand requests of ...
Read More
Optimizing stencil application on multi-thread GPU architecture using stream programming model
ARCS'10: Proceedings of the 23rd international conference on Architecture of Computing Systems

With fast development of GPU hardware and software, using GPUs to accelerate non-graphics CPU applications is becoming inevitable trend. GPUs are good at performing ALU-intensive computation and feature high peak performance; however, how to harness ...
Read More
Adjusting Thread Parallelism Dynamically to Accelerate Dynamic Programming with Irregular Workload Distribution on GPGPUs

Dynamic Programming (DP) is an important and popular method for solving a wide variety of discrete optimization problems such as scheduling, string-editing, packaging, and inventory management. DP breaks problems into simpler subproblems and combines ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
PMAM '12: Proceedings of the 2012 International Workshop on Programming Models and Applications for Multicores and Manycores
February 2012
180 pages
ISBN:9781450312110
DOI:10.1145/2141702
Conference Chairs:
Minyi Guo
Shanghai Jiao Tong University, China
,
Zhiyi Huang
University of Otago, New Zealand
Copyright © 2012 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 26 February 2012
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
multicore
optimization
pipelined parallelism
shared-cache
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate53of97submissions,55%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 101
  Total Downloads
- Downloads (Last 12 months)0
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Efficient execution of time-step computations with pipelined parallelism and inter-thread data locality optimizaitions

PMAM '12: Proceedings of the 2012 International Workshop on Programming Models and Applications for Multicores and Manycores

ABSTRACT

References

Cited By

Index Terms

Recommendations

Thread-Aware Adaptive Prefetcher on Multicore Systems: Improving the Performance for Multithreaded Workloads

Optimizing stencil application on multi-thread GPU architecture using stream programming model

Adjusting Thread Parallelism Dynamically to Accelerate Dynamic Programming with Irregular Workload Distribution on GPGPUs