research-article

Data layout transformation exploiting memory-level parallelism in structured grid many-core applications

Authors:
I-Jui Sung

Center for Reliable and High-Performance Computing, University of Illinois at Urbana-Champaign, Urbana, IL, USA

Center for Reliable and High-Performance Computing, University of Illinois at Urbana-Champaign, Urbana, IL, USA
View Profile

,
John A. Stratton

Center for Reliable and High-Performance Computing, University of Illinois at Urbana-Champaign, Urbana, IL, USA

Center for Reliable and High-Performance Computing, University of Illinois at Urbana-Champaign, Urbana, IL, USA
View Profile

,
Wen-Mei W. Hwu

Center for Reliable and High-Performance Computing, University of Illinois at Urbana-Champaign, Urbana, IL, USA

Center for Reliable and High-Performance Computing, University of Illinois at Urbana-Champaign, Urbana, IL, USA
View Profile

PACT '10: Proceedings of the 19th international conference on Parallel architectures and compilation techniquesSeptember 2010Pages 513–522https://doi.org/10.1145/1854273.1854336

Published:11 September 2010Publication History

PACT '10: Proceedings of the 19th international conference on Parallel architectures and compilation techniques

Pages 513–522

ABSTRACT

We present automatic data layout transformation as an effective compiler performance optimization for memory-bound structured grid applications. Structured grid applications include stencil codes and other code structures using a dense, regular grid as the primary data structure. Fluid dynamics and heat distribution, which both solve partial differential equations on a discretized representation of space, are representative of many important structured grid applications.

Using the information available through variable-length array syntax, standardized in C99 and other modern languages, we have enabled automatic data layout transformations for structured grid codes with dynamically allocated arrays. We also present how a tool can guide these transformations to statically choose a good layout given a model of the memory system, using a modern GPU as an example. A transformed layout that distributes concurrent memory requests among parallel memory system components provides substantial speedup for structured grid applications by improving their achieved memory-level parallelism. Even with the overhead of more complex address calculations, we observe up to 560% performance increases over the language-defined layout, and a 7% performance gain in the worst case, in which the language-defined layout and access pattern is already well-vectorizable by the underlying hardware.

References

}}J. M. Anderson, S. P. Amarasinghe, and M. S. Lam. Data and computation transformations for multiprocessors. SIGPLAN Not., 30(8):166--178, 1995. Google ScholarDigital Library
}}K. Asanovic, R. Bodik, B. C. Catanzaro, J. J. Gebis, P. Husbands, K. Keutzer, D. A. Patterson, W. L. Plishker, J. Shalf, S. W. Williams, and K. A. Yelick. The landscape of parallel computing research: A view from berkeley. Technical Report UCB/EECS-2006-183, EECS Department, University of California, Berkeley, Dec 2006.Google Scholar
}}A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt. Analyzing cuda workloads using a detailed gpu simulator. In ISPASS, pages 163--174. IEEE, 2009.Google ScholarCross Ref
}}M. M. Baskaran, U. Bondhugula, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan. A compiler framework for optimization of affine loop nests for gpgpus. In ICS '08: Proceedings of the 22nd annual international conference on Supercomputing, pages 225--234, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
}}K. Datta, M. Murphy, V. Volkov, S. Williams, J. Carter, L. Oliker, D. Patterson, J. Shalf, and K. Yelick. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In SC08: Proceedings of the 2008 conference on Supercomputing, pages 1--12, Piscataway, NJ, USA, 2008. Google ScholarDigital Library
}}J. W. Demmel. Applied numerical linear algebra. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 1997. Google ScholarDigital Library
}}J. H. Ferziger and M. Peric. Computational Methods for Fluid Dynamics. Springer, Berlin, 1999.Google ScholarDigital Library
}}S. Girbal, N. Vasilache, C. Bastoul, A. Cohen, D. Parello, M. Sigler, and O. Temam. Semi-automatic composition of loop transformations for deep parallelism and memory hierarchies. Int. J. Parallel Program., 34(3):261--317, 2006. Google ScholarDigital Library
}}C. D. Gundolf, C. C. Douglas, G. Haase, J. Hu, M. Kowarschik, and C. Weiss. Portable memory hierarchy techniques for PDE solvers, part II. SIAM News, 33:8--9, 2000.Google Scholar
}}E.Ipek,O.Mutlu,J.F.Martinez,andR.Caruana. Self-optimizing memory controllers: A reinforcement learning approach. Computer Architecture News, 36(3):39--50, 2008. Google ScholarDigital Library
}}B. Jang, P. Mistry, D. Schaa, R. Dominguez, and D. Kaeli. Data transformations enabling loop vectorization on multithreaded data parallel architectures. In PPoPP '10: Proceedings of the 15th ACM SIGPLAN symposium on Principles and practice of parallel programming, pages 353--354, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
}}Y.-L. Ju and H. G. Dietz. Reduction of cache coherence overhead by compiler data layout and loop transformation. In Proceedings of the Fourth International Workshop on Languages and Compilers for Parallel Computing, pages 344--358, London, UK, 1992. Springer-Verlag. Google ScholarDigital Library
}}K. Kennedy and U. Kremer. Automatic data layout for high performance fortran. In Supercomputing '95: Proceedings of the 1995 ACM/IEEE conference on Supercomputing (CDROM), page 76, New York, NY, USA, 1995. ACM. Google ScholarDigital Library
}}V. Kindratenko, J. Enos, and G. Shi. Gpu clusters for high-performance computing. In Proceedings of the Workshop on Parallel Programming on Accelerator Clusters, Jan 2009.Google ScholarCross Ref
}}Y.-S. Kwon, B.-T. Koo, and N.-W. Eum. Partial conflict-relieving programmable address shuffler for parallel memories in multi-core processor. In ASP-DAC '09: Proceedings of the 2009 Asia and South Pacific Design Automation Conference, pages 329--334, Piscataway, NJ, USA, 2009. IEEE Press. Google ScholarDigital Library
}}Q. Lu, C. Alias, U. Bondhugula, T. Henretty, S. Krishnamoorthy, J. Ramanujam, A. Rountev, P. Sadayappan, Y. Chen, H. Lin, and T.-f. Ngai. Data layout transformation for enhancing data locality on nuca chip multiprocessors. In Proceedings of the 18th International Conference on Parallel Architectures and Compilation Techniques, pages 348--357, 2009. Google ScholarDigital Library
}}M. E. Mace. Memory Storage Patterns in Parallel Processing. Kluwer Academic Publishers, 1987. Google ScholarDigital Library
}}N. R. Mahapatra and B. Venkatrao. The processor-memory bottleneck: problems and solutions. Crossroads, 5(3es):2, Apr 1999. Google ScholarDigital Library
}}L. McVoy and C. Staelin. lmbench: portable tools for performance analysis. In Proceedings of the 1996 USENIX Annual Technical Conference, pages 23--23, 1996. Google ScholarDigital Library
}}K. W. Morton and D. F. Mayers. Numerical Solution of Partial Differential Equations: An Introduction. Cambridge University Press, New York, NY, USA, 2005. Google ScholarDigital Library
}}T. Moscibroda and O. Mutlu. Distributed order scheduling and its application to multi-core DRAM controllers. In Proceedings of the 27th Symposium on Principles of Distributed Computing, pages 365--374, 2008. Google ScholarDigital Library
}}O. Mutlu and T. Moscibroda. Parallelism-aware batch scheduling: Enhancing both performance and fairness of shared DRAM systems. Computer Architecture News, 36(3):63--74, 2008. Google ScholarDigital Library
}}nVIDIA. nvidia cuda programming guide 2.0, 2008.Google Scholar
}}T. Pohl, M. Kowarschik, J. Wilke, K. Iglberger, and U. Rüde. Optimization and profiling of the cache performance of parallel lattice boltzmann codes. Parallel Processing Letter, 13(4):549--560, 2003.Google ScholarCross Ref
}}Y. H. Qian, D. D'Humieres, and P. Lallemand. Lattice BGK models for Navier-Stokes equation. Europhysics Letters, 17(6):479--484, 1992.Google ScholarCross Ref
}}G. Rivera and C.-W. Tseng. Tiling optimizations for 3D scientific computations. In SC00: Proceedings of the 2000 conference on Supercomputing, page 32, 2000. Google ScholarDigital Library
}}S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and W.-m. W. Hwu. Optimization principles and application performance evaluation of a multithreaded gpu using cuda. In Proceedings of the 13th Symposium on Principles and Practice of Parallel Programming, pages 73--82, 2008. Google ScholarDigital Library
}}S. Sellappa and S. Chatterjee. Cache-Efficient Multigrid Algorithms. International Journal of High Performance Computing Applications, 18(1):115--133, 2004. Google ScholarDigital Library
}}J. Shao and B. T. Davis. A burst scheduling access reordering mechanism. In Proceedings of the 13th International Symposium on High Performance Computer Architecture, pages 285--294, 2007. Google ScholarDigital Library
}}C. D. Spradling. Spec cpu2006 benchmark tools. Computer Architecture News, 35(1):130--134, 2007. Google ScholarDigital Library
}}V. Volkov and J. W. Demmel. Benchmarking gpus to tune dense linear algebra. In SC08: Proceedings of the 2008 conference on Supercomputing, pages 1--11, 2008. Google ScholarDigital Library
}}Y. Zhao. Lattice Boltzmann based PDE solver on the GPU. Visual Computing, 24(5):323--333, 2008. Google ScholarDigital Library

Index Terms

Data layout transformation exploiting memory-level parallelism in structured grid many-core applications
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Multiple instruction, multiple data
2. Information systems
  1. Information storage systems
    1. Record storage systems

Recommendations

A performance study of general-purpose applications on graphics processors using CUDA

Graphics processors (GPUs) provide a vast number of simple, data-parallel, deeply multithreaded cores and high memory bandwidths. GPU architectures are becoming increasingly programmable, offering the potential for dramatic speedups for a variety of ...
Read More
On the Performance Portability of Structured Grid Codes on Many-Core Computer Architectures
ISC 2014: Proceedings of the 29th International Conference on Supercomputing - Volume 8488

With the advent of many-core computer architectures such as GPGPUs from NVIDIA and AMD, and more recently Intel's Xeon Phi, ensuring performance portability of HPC codes is potentially becoming more complex. In this work we have focused on one important ...
Read More
Mapping of option pricing algorithms onto heterogeneous many-core architectures

The rapid development of technologies and applications in recent years poses high demands and challenges for high-performance computing. Because of their competitive performance/price ratio, heterogeneous many-core architectures are widely used in high-...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
PACT '10: Proceedings of the 19th international conference on Parallel architectures and compilation techniques
September 2010
596 pages
ISBN:9781450301787
DOI:10.1145/1854273
General Chair:
Valentina Salapura
IBM TJ Watson Research Center
,
Program Chairs:
Michael Gschwind
IBM Systems & Technology Group
,
Jens Knoop
Technische Universität Wien
Copyright © 2010 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 11 September 2010
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
GPU
data layout transformation
parallel programming
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate121of471submissions,26%
Upcoming Conference
PACT '24

Sponsor:

sigarch

International Conference on Parallel Architectures and Compilation Techniques

October 14 - 16, 2024

Southern California , CA , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 67
  Total Citations
  View Citations
- 873
  Total Downloads
- Downloads (Last 12 months)18
- Downloads (Last 6 weeks)5
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Data layout transformation exploiting memory-level parallelism in structured grid many-core applications

PACT '10: Proceedings of the 19th international conference on Parallel architectures and compilation techniques

ABSTRACT

References

Cited By

Index Terms

Recommendations

A performance study of general-purpose applications on graphics processors using CUDA

On the Performance Portability of Structured Grid Codes on Many-Core Computer Architectures

Mapping of option pricing algorithms onto heterogeneous many-core architectures