skip to main content
10.1145/2141702.2141706acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
research-article

Efficient execution of time-step computations with pipelined parallelism and inter-thread data locality optimizaitions

Published:26 February 2012Publication History

ABSTRACT

This paper presents a strategy that integrates a set of compiler optimizations and analysis techniques that enable the detection and transformation of time step loops for efficient execution on multicore platforms. Time-step computations, which appear frequently in scientific applications, are amenable to pipelined parallelism and exhibit a high degree of temporal locality. However, striking the right balance between data locality and parallelism often proves difficult, particularly for current multicore architectures where one or more levels of the memory hierarchy is shared among multiple processing units. Our proposed strategy addresses performance issues related to both data locality and parallelism. By carefully orchestrating a set of source-to-source transformations, our technique exposes fine-grain parallelism within a time-step loop, while improving its cache utilization and reducing its bandwidth requirements. Preliminary experiments with two time-step applications on three multicore platforms show that that the code variants generated by our strategy have significantly fewer misses in the shared caches and also achieve better execution times through reduced synchronization costs.

References

  1. Stencilprobe: A microbenchmark for stencil applications.Google ScholarGoogle Scholar
  2. R. Allen and K. Kennedy. Optimizing Compilers for Modern Architectures. Morgan Kaufmann, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. S. Coleman and K. S. McKinley. Tile size selection using cache organization. In Proceedings of the SIGPLAN '95 Conference on Programming Language Design and Implementation, La Jolla, CA, June 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. C. Ding and K. Kennedy. Improving effective bandwidth through compiler enhancement of global cache reuse. In International Parallel and Distributed Processing Symposium, San Francisco, CA, Apr. 2001. (Best Paper Award.). Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. G. Gao, R. Olsen, V. Sarkar, and R. Thekkath. Collective loop fusion for array contraction. In Proceedings of the Fifth Workshop on Languages and Compilers for Parallel Computing, New Haven, CT, Aug. 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. G. Jin, J. Mellor-Crummey, and R. Fowler. Increasing temporal locality with skewing and recursive blocking. In Proceedings of SC2001, Denver, CO, Nov 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. S. Krishnamoorthy, M. Baskaran, U. Bondhugula, J. Ramanujam, A. Rountev, and P. Sadayappan. Effective automatic parallelization of stencil computations. In PLDI '07: Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation, pages 235--244, New York, NY, USA, 2007. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. M. Kudlur and S. Mahlke. Orchestrating the execution of stream programs on multicore platforms. SIGPLAN Not., 43(6):114--124, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. J. Mccalpin and D. Wonnacott. Time skewing: A value-based approach to optimizing for memory locality. Technical report, In http://www.haverford.edu/cmsc/davew/cache-opt/cache-opt.html, 1999.Google ScholarGoogle Scholar
  10. J. Mccalpin and D. Wonnacott. Time skewing: A value-based approach to optimizing for memory locality. Technical report, In http://www.haverford.edu/cmsc/davew/cache-opt/cache-opt.html, 1999.Google ScholarGoogle Scholar
  11. J. Michalake, J. Dudhia, D. Gill, T. Henderson, J. Klemp, W. Skamarock, and W. Wang. The weather reseach and forecast model: Software architecture and performance. In Proceedings of the 11th ECMWF Workshop on the Use of High Performance Computing In Meteorology, 2004.Google ScholarGoogle Scholar
  12. A. Qasem, G. Jin, and J. Mellor-Crummey. Improving performance with integrated program transformations. Technical Report CS-TR03-419, Dept. of Computer Science, Rice University, Oct. 2003.Google ScholarGoogle Scholar
  13. Y. Song, R. Xu, C. Wang, and Z. Li. Data locality enhancement by memory reduction. In Proceedings of the 15th ACM International Conference on Supercomputing, Sorrento, Italy, June 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. W. Thies, V. Chandrasekhar, and S. Amarasinghe. A practical approach to exploiting coarse-grained pipeline parallelism in c programs. In International Symposium on Microarchitecture, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. W. Thies, M. Karczmarek, and S. P. Amarasinghe. Streamit: A language for streaming applications. In Computational Complexity, pages 179--196, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. J. Treibig, G. Wellein, and G. Hager. Efficient multicore-aware parallelization strategies for iterative stencil computations. CoRR, abs/1004.1741, 2010.Google ScholarGoogle Scholar
  17. N. Vachharajani, R. Rangan, E. Raman, M. J. Bridges, G. Ottoni, and D. I. August. Speculative decoupled software pipelining. In PACT '07: Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques, pages 49--59, Washington, DC, USA, 2007. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. S. N. Vadlamani and S. F. Jenks. The synchronized pipelined parallelism model. In The 16th IASTED International Conference on Parallel and Distributed Computing and Systems, 2004.Google ScholarGoogle Scholar
  19. L. J. Wicker. NSSL collaborative model for atmospheric simulation (NCOMMAS). http://www.nssl.noaa.gov/~wicker/commas.html.Google ScholarGoogle Scholar
  20. M. E. Wolf and M. Lam. A data locality optimizing algorithm. In Proceedings of the SIGPLAN '91 Conference on Programming Language Design and Implementation, Toronto, Canada, June 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. M. J. Wolfe. Optimizing Supercompilers for Supercomputers. PhD thesis, Dept. of Computer Science, University of Illinois at Urbana-Champaign, Oct. 1982. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. D. Wonnacott. Time skewing for parallel computers. In LCPC '99: Proceedings of the 12th International Workshop on Languages and Compilers for Parallel Computing, pages 477--480, London, UK, 2000. Springer-Verlag. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. D. Wonnacott. Using time skewing to eliminate idle time due to memory bandwidth and network limitations. In IPDPS '00: Proceedings of the 14th International Symposium on Parallel and Distributed Processing, page 171, Washington, DC, USA, 2000. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Efficient execution of time-step computations with pipelined parallelism and inter-thread data locality optimizaitions

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      PMAM '12: Proceedings of the 2012 International Workshop on Programming Models and Applications for Multicores and Manycores
      February 2012
      180 pages
      ISBN:9781450312110
      DOI:10.1145/2141702
      • Conference Chairs:
      • Minyi Guo,
      • Zhiyi Huang

      Copyright © 2012 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 26 February 2012

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate53of97submissions,55%
    • Article Metrics

      • Downloads (Last 12 months)0
      • Downloads (Last 6 weeks)0

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader