skip to main content
research-article
Public Access

Analytical Modeling Is Enough for High-Performance BLIS

Published:16 August 2016Publication History
Skip Abstract Section

Abstract

We show how the BLAS-like Library Instantiation Software (BLIS) framework, which provides a more detailed layering of the GotoBLAS (now maintained as OpenBLAS) implementation, allows one to analytically determine tuning parameters for high-end instantiations of the matrix-matrix multiplication. This is of both practical and scientific importance, as it greatly reduces the development effort required for the implementation of the level-3 BLAS while also advancing our understanding of how hierarchically layered memories interact with high-performance software. This allows the community to move on from valuable engineering solutions (empirically autotuning) to scientific understanding (analytical insight).

References

  1. AMD. 2015. AMD Core Math Library. (2015). http://developer.amd.com/tools-and-sdks/cpu-development/amd-core-math-library-acml/.Google ScholarGoogle Scholar
  2. Edward Anderson, Zhaojun Bai, L. Susan Blackford, James Demmesl, Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, Anne Greenbaum, Alan McKenney, and Danny C. Sorensen. 1999. LAPACK Users' Guide (3rd ed.). SIAM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Jeff Bilmes, Krste Asanović, Chee whye Chin, and Jim Demmel. 1997. Optimizing matrix multiply using PHiPAC: A Portable, high-performance, ANSI c coding methodology. In Proceedings of the International Conference on Supercomputing. Vienna, Austria. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, and Iain Duff. 1990. A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Soft. 16, 1 (March 1990), 1--17. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, and Richard J. Hanson. 1988. An extended set of FORTRAN basic linear algebra subprograms. ACM Trans. Math. Soft. 14, 1 (March 1988), 1--17. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Kazushige Goto and Robert van de Geijn. 2008a. High performance implementation of the level-3 BLAS. ACM Trans. Math. Software 35, 1 (July 2008), 4:1--4:14. http://doi.acm.org/10.1145/1377603. 1377607 Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Kazushige Goto and Robert A. van de Geijn. 2008b. Anatomy of a high-performance matrix multiplication. ACM Trans. Math. Software 34, 3 (May 2008), 12:1--12:25. http://doi.acm.org/10.1145/1356052.1356053 Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. John L. Hennessy and David A. Patterson. 2003. Computer Architecture: A Quantitative Approach. Morgan Kaufmann Pub., San Francisco. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Greg Henry. 1992. BLAS Based on Block Data Structures. Theory Center Technical Report CTC92TR89. Advanced Computing Research Institute. Cornell University. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. IBM. 2015. Engineering and Scientific Subroutine Library. (2015). http://www-03.ibm.com/systems/power/software/essl/.Google ScholarGoogle Scholar
  11. Intel. 2015. Math Kernel Library. (2015). https://software.intel.com/en-us/intel-mkl.Google ScholarGoogle Scholar
  12. Vasilios Kelefouras, Angeliki Kritikakou, and Costas Goutis. 2014. A matrix-matrix multiplication methodology for single/multi-core architectures using SIMD. J, Supercomput, (2014), 1--23. DOI:http://dx.doi.org/10.1007/s11227-014-1098-9 Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Charles L. Lawson, Richard J. Hanson, David R. Kincaid, and Fred T. Krogh. 1979. Basic linear algebra subprograms for fortran usage. ACM Trans. Math. Software 5, 3 (Sept. 1979), 308--323. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. OpenBLAS 2015. http://www.openblas.net. (2015).Google ScholarGoogle Scholar
  15. Ardavan Pedram, Andreas Gerstlauer, and Robert A. van de Geijn. 2012a. On the efficiency of register file versus broadcast interconnect for collective communications in data-parallel hardware accelerators. In 2012 24th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD). Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Ardavan Pedram, Robert A. van de Geijn, and Andreas Gerstlauer. 2012b. Codesign tradeoffs for high-performance, low-power linear algebra architectures. IEEE Trans. Comput. 61 (Dec. 2012), 1724--1736. DOI:http://dx.doi.org/10.1109/TC.2012.132 Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Tyler M. Smith, Robert van de Geijn, Mikhail Smelyanskiy, Jeff R. Hammond, and Field G. Van Zee. 2014. Anatomy of high-Performance many-threaded matrix multiplication. In Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium (IPDPS'14). IEEE Computer Society, Washington, DC, USA, 1049--1059. DOI:http://dx.doi.org/10.1109/IPDPS.2014.110 Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Field G. Van Zee, Tyler Smith, Bryan Marker, Tze Meng Low, Robert A. van de Geijn, Francisco D. Igual, Mikhail Smelyanskiy, Xianyi Zhang, Michael Kistler, Vernon Austel, John Gunnels, and Lee Killough. 2014. The BLIS framework: Experiments in portability. ACM Trans. Math. Software (2014). In review. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Field G. Van Zee and Robert A. van de Geijn. 2015. BLIS: A framework for rapidly instantiating BLAS functionality. ACM Trans. Math. Softw. 41, 3, Article 14 (June 2015), 33 pages. DOI:http://dx.doi.org/10.1145/2764454 Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Qian Wang, Xianyi Zhang, Yunquan Zhang, and Qing Yi. 2013. AUGEM: Automatically generate high performance dense linear algebra kernels on x86 CPUs. In Proceedings of SC13: International Conference for High Performance Computing, Networking, Storage and Analysis (SC'13). ACM, Article 25, 12 pages. DOI:http://dx.doi.org/10.1145/2503210.2503219 Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. R. Clint Whaley and Jack J. Dongarra. 1998. Automatically tuned linear algebra software. In Proceedings of the 1998 ACM/IEEE Conference on Supercomputing (SC'98). Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. R. Clint Whaley, Antoine Petitet, and Jack J. Dongarra. 2001. Automated empirical optimizations of software and the ATLAS project. Parallel Comput. 27, 1--2 (2001), 3--35.Google ScholarGoogle ScholarCross RefCross Ref
  23. Kamen Yotov, Xiaoming Li, María Jesús Garzarán, David Padua, Keshav Pingali, and Paul Stodghill. 2005. Is search really necessary to generate high-performance BLAS? Proc. IEEE, special issue on “Program Generation, Optimization, and Adaptation” 93, 2 (2005).Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Analytical Modeling Is Enough for High-Performance BLIS

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Mathematical Software
      ACM Transactions on Mathematical Software  Volume 43, Issue 2
      June 2017
      200 pages
      ISSN:0098-3500
      EISSN:1557-7295
      DOI:10.1145/2988256
      Issue’s Table of Contents

      Copyright © 2016 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 16 August 2016
      • Revised: 1 April 2016
      • Accepted: 1 April 2016
      • Received: 1 February 2015
      Published in toms Volume 43, Issue 2

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader