ABSTRACT
This paper presents a strategy that integrates a set of compiler optimizations and analysis techniques that enable the detection and transformation of time step loops for efficient execution on multicore platforms. Time-step computations, which appear frequently in scientific applications, are amenable to pipelined parallelism and exhibit a high degree of temporal locality. However, striking the right balance between data locality and parallelism often proves difficult, particularly for current multicore architectures where one or more levels of the memory hierarchy is shared among multiple processing units. Our proposed strategy addresses performance issues related to both data locality and parallelism. By carefully orchestrating a set of source-to-source transformations, our technique exposes fine-grain parallelism within a time-step loop, while improving its cache utilization and reducing its bandwidth requirements. Preliminary experiments with two time-step applications on three multicore platforms show that that the code variants generated by our strategy have significantly fewer misses in the shared caches and also achieve better execution times through reduced synchronization costs.
- Stencilprobe: A microbenchmark for stencil applications.Google Scholar
- R. Allen and K. Kennedy. Optimizing Compilers for Modern Architectures. Morgan Kaufmann, 2002. Google ScholarDigital Library
- S. Coleman and K. S. McKinley. Tile size selection using cache organization. In Proceedings of the SIGPLAN '95 Conference on Programming Language Design and Implementation, La Jolla, CA, June 1995. Google ScholarDigital Library
- C. Ding and K. Kennedy. Improving effective bandwidth through compiler enhancement of global cache reuse. In International Parallel and Distributed Processing Symposium, San Francisco, CA, Apr. 2001. (Best Paper Award.). Google ScholarDigital Library
- G. Gao, R. Olsen, V. Sarkar, and R. Thekkath. Collective loop fusion for array contraction. In Proceedings of the Fifth Workshop on Languages and Compilers for Parallel Computing, New Haven, CT, Aug. 1992. Google ScholarDigital Library
- G. Jin, J. Mellor-Crummey, and R. Fowler. Increasing temporal locality with skewing and recursive blocking. In Proceedings of SC2001, Denver, CO, Nov 2001. Google ScholarDigital Library
- S. Krishnamoorthy, M. Baskaran, U. Bondhugula, J. Ramanujam, A. Rountev, and P. Sadayappan. Effective automatic parallelization of stencil computations. In PLDI '07: Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation, pages 235--244, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
- M. Kudlur and S. Mahlke. Orchestrating the execution of stream programs on multicore platforms. SIGPLAN Not., 43(6):114--124, 2008. Google ScholarDigital Library
- J. Mccalpin and D. Wonnacott. Time skewing: A value-based approach to optimizing for memory locality. Technical report, In http://www.haverford.edu/cmsc/davew/cache-opt/cache-opt.html, 1999.Google Scholar
- J. Mccalpin and D. Wonnacott. Time skewing: A value-based approach to optimizing for memory locality. Technical report, In http://www.haverford.edu/cmsc/davew/cache-opt/cache-opt.html, 1999.Google Scholar
- J. Michalake, J. Dudhia, D. Gill, T. Henderson, J. Klemp, W. Skamarock, and W. Wang. The weather reseach and forecast model: Software architecture and performance. In Proceedings of the 11th ECMWF Workshop on the Use of High Performance Computing In Meteorology, 2004.Google Scholar
- A. Qasem, G. Jin, and J. Mellor-Crummey. Improving performance with integrated program transformations. Technical Report CS-TR03-419, Dept. of Computer Science, Rice University, Oct. 2003.Google Scholar
- Y. Song, R. Xu, C. Wang, and Z. Li. Data locality enhancement by memory reduction. In Proceedings of the 15th ACM International Conference on Supercomputing, Sorrento, Italy, June 2001. Google ScholarDigital Library
- W. Thies, V. Chandrasekhar, and S. Amarasinghe. A practical approach to exploiting coarse-grained pipeline parallelism in c programs. In International Symposium on Microarchitecture, 2007. Google ScholarDigital Library
- W. Thies, M. Karczmarek, and S. P. Amarasinghe. Streamit: A language for streaming applications. In Computational Complexity, pages 179--196, 2002. Google ScholarDigital Library
- J. Treibig, G. Wellein, and G. Hager. Efficient multicore-aware parallelization strategies for iterative stencil computations. CoRR, abs/1004.1741, 2010.Google Scholar
- N. Vachharajani, R. Rangan, E. Raman, M. J. Bridges, G. Ottoni, and D. I. August. Speculative decoupled software pipelining. In PACT '07: Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques, pages 49--59, Washington, DC, USA, 2007. IEEE Computer Society. Google ScholarDigital Library
- S. N. Vadlamani and S. F. Jenks. The synchronized pipelined parallelism model. In The 16th IASTED International Conference on Parallel and Distributed Computing and Systems, 2004.Google Scholar
- L. J. Wicker. NSSL collaborative model for atmospheric simulation (NCOMMAS). http://www.nssl.noaa.gov/~wicker/commas.html.Google Scholar
- M. E. Wolf and M. Lam. A data locality optimizing algorithm. In Proceedings of the SIGPLAN '91 Conference on Programming Language Design and Implementation, Toronto, Canada, June 1991. Google ScholarDigital Library
- M. J. Wolfe. Optimizing Supercompilers for Supercomputers. PhD thesis, Dept. of Computer Science, University of Illinois at Urbana-Champaign, Oct. 1982. Google ScholarDigital Library
- D. Wonnacott. Time skewing for parallel computers. In LCPC '99: Proceedings of the 12th International Workshop on Languages and Compilers for Parallel Computing, pages 477--480, London, UK, 2000. Springer-Verlag. Google ScholarDigital Library
- D. Wonnacott. Using time skewing to eliminate idle time due to memory bandwidth and network limitations. In IPDPS '00: Proceedings of the 14th International Symposium on Parallel and Distributed Processing, page 171, Washington, DC, USA, 2000. IEEE Computer Society. Google ScholarDigital Library
Index Terms
- Efficient execution of time-step computations with pipelined parallelism and inter-thread data locality optimizaitions
Recommendations
Thread-Aware Adaptive Prefetcher on Multicore Systems: Improving the Performance for Multithreaded Workloads
Most processors employ hardware data prefetching techniques to hide memory access latencies. However, the prefetching requests from different threads on a multicore processor can cause severe interference with prefetching and/or demand requests of ...
Optimizing stencil application on multi-thread GPU architecture using stream programming model
ARCS'10: Proceedings of the 23rd international conference on Architecture of Computing SystemsWith fast development of GPU hardware and software, using GPUs to accelerate non-graphics CPU applications is becoming inevitable trend. GPUs are good at performing ALU-intensive computation and feature high peak performance; however, how to harness ...
Adjusting Thread Parallelism Dynamically to Accelerate Dynamic Programming with Irregular Workload Distribution on GPGPUs
Dynamic Programming (DP) is an important and popular method for solving a wide variety of discrete optimization problems such as scheduling, string-editing, packaging, and inventory management. DP breaks problems into simpler subproblems and combines ...
Comments