Abstract
We show how the BLAS-like Library Instantiation Software (BLIS) framework, which provides a more detailed layering of the GotoBLAS (now maintained as OpenBLAS) implementation, allows one to analytically determine tuning parameters for high-end instantiations of the matrix-matrix multiplication. This is of both practical and scientific importance, as it greatly reduces the development effort required for the implementation of the level-3 BLAS while also advancing our understanding of how hierarchically layered memories interact with high-performance software. This allows the community to move on from valuable engineering solutions (empirically autotuning) to scientific understanding (analytical insight).
- AMD. 2015. AMD Core Math Library. (2015). http://developer.amd.com/tools-and-sdks/cpu-development/amd-core-math-library-acml/.Google Scholar
- Edward Anderson, Zhaojun Bai, L. Susan Blackford, James Demmesl, Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, Anne Greenbaum, Alan McKenney, and Danny C. Sorensen. 1999. LAPACK Users' Guide (3rd ed.). SIAM. Google ScholarDigital Library
- Jeff Bilmes, Krste Asanović, Chee whye Chin, and Jim Demmel. 1997. Optimizing matrix multiply using PHiPAC: A Portable, high-performance, ANSI c coding methodology. In Proceedings of the International Conference on Supercomputing. Vienna, Austria. Google ScholarDigital Library
- Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, and Iain Duff. 1990. A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Soft. 16, 1 (March 1990), 1--17. Google ScholarDigital Library
- Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, and Richard J. Hanson. 1988. An extended set of FORTRAN basic linear algebra subprograms. ACM Trans. Math. Soft. 14, 1 (March 1988), 1--17. Google ScholarDigital Library
- Kazushige Goto and Robert van de Geijn. 2008a. High performance implementation of the level-3 BLAS. ACM Trans. Math. Software 35, 1 (July 2008), 4:1--4:14. http://doi.acm.org/10.1145/1377603. 1377607 Google ScholarDigital Library
- Kazushige Goto and Robert A. van de Geijn. 2008b. Anatomy of a high-performance matrix multiplication. ACM Trans. Math. Software 34, 3 (May 2008), 12:1--12:25. http://doi.acm.org/10.1145/1356052.1356053 Google ScholarDigital Library
- John L. Hennessy and David A. Patterson. 2003. Computer Architecture: A Quantitative Approach. Morgan Kaufmann Pub., San Francisco. Google ScholarDigital Library
- Greg Henry. 1992. BLAS Based on Block Data Structures. Theory Center Technical Report CTC92TR89. Advanced Computing Research Institute. Cornell University. Google ScholarDigital Library
- IBM. 2015. Engineering and Scientific Subroutine Library. (2015). http://www-03.ibm.com/systems/power/software/essl/.Google Scholar
- Intel. 2015. Math Kernel Library. (2015). https://software.intel.com/en-us/intel-mkl.Google Scholar
- Vasilios Kelefouras, Angeliki Kritikakou, and Costas Goutis. 2014. A matrix-matrix multiplication methodology for single/multi-core architectures using SIMD. J, Supercomput, (2014), 1--23. DOI:http://dx.doi.org/10.1007/s11227-014-1098-9 Google ScholarDigital Library
- Charles L. Lawson, Richard J. Hanson, David R. Kincaid, and Fred T. Krogh. 1979. Basic linear algebra subprograms for fortran usage. ACM Trans. Math. Software 5, 3 (Sept. 1979), 308--323. Google ScholarDigital Library
- OpenBLAS 2015. http://www.openblas.net. (2015).Google Scholar
- Ardavan Pedram, Andreas Gerstlauer, and Robert A. van de Geijn. 2012a. On the efficiency of register file versus broadcast interconnect for collective communications in data-parallel hardware accelerators. In 2012 24th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD). Google ScholarDigital Library
- Ardavan Pedram, Robert A. van de Geijn, and Andreas Gerstlauer. 2012b. Codesign tradeoffs for high-performance, low-power linear algebra architectures. IEEE Trans. Comput. 61 (Dec. 2012), 1724--1736. DOI:http://dx.doi.org/10.1109/TC.2012.132 Google ScholarDigital Library
- Tyler M. Smith, Robert van de Geijn, Mikhail Smelyanskiy, Jeff R. Hammond, and Field G. Van Zee. 2014. Anatomy of high-Performance many-threaded matrix multiplication. In Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium (IPDPS'14). IEEE Computer Society, Washington, DC, USA, 1049--1059. DOI:http://dx.doi.org/10.1109/IPDPS.2014.110 Google ScholarDigital Library
- Field G. Van Zee, Tyler Smith, Bryan Marker, Tze Meng Low, Robert A. van de Geijn, Francisco D. Igual, Mikhail Smelyanskiy, Xianyi Zhang, Michael Kistler, Vernon Austel, John Gunnels, and Lee Killough. 2014. The BLIS framework: Experiments in portability. ACM Trans. Math. Software (2014). In review. Google ScholarDigital Library
- Field G. Van Zee and Robert A. van de Geijn. 2015. BLIS: A framework for rapidly instantiating BLAS functionality. ACM Trans. Math. Softw. 41, 3, Article 14 (June 2015), 33 pages. DOI:http://dx.doi.org/10.1145/2764454 Google ScholarDigital Library
- Qian Wang, Xianyi Zhang, Yunquan Zhang, and Qing Yi. 2013. AUGEM: Automatically generate high performance dense linear algebra kernels on x86 CPUs. In Proceedings of SC13: International Conference for High Performance Computing, Networking, Storage and Analysis (SC'13). ACM, Article 25, 12 pages. DOI:http://dx.doi.org/10.1145/2503210.2503219 Google ScholarDigital Library
- R. Clint Whaley and Jack J. Dongarra. 1998. Automatically tuned linear algebra software. In Proceedings of the 1998 ACM/IEEE Conference on Supercomputing (SC'98). Google ScholarDigital Library
- R. Clint Whaley, Antoine Petitet, and Jack J. Dongarra. 2001. Automated empirical optimizations of software and the ATLAS project. Parallel Comput. 27, 1--2 (2001), 3--35.Google ScholarCross Ref
- Kamen Yotov, Xiaoming Li, María Jesús Garzarán, David Padua, Keshav Pingali, and Paul Stodghill. 2005. Is search really necessary to generate high-performance BLAS? Proc. IEEE, special issue on “Program Generation, Optimization, and Adaptation” 93, 2 (2005).Google ScholarCross Ref
Index Terms
- Analytical Modeling Is Enough for High-Performance BLIS
Recommendations
The BLIS Framework: Experiments in Portability
BLIS is a new software framework for instantiating high-performance BLAS-like dense linear algebra libraries. We demonstrate how BLIS acts as a productivity multiplier by using it to implement the level-3 BLAS on a variety of current architectures. The ...
BLIS: A Framework for Rapidly Instantiating BLAS Functionality
The BLAS-like Library Instantiation Software (BLIS) framework is a new infrastructure for rapidly instantiating Basic Linear Algebra Subprograms (BLAS) functionality. Its fundamental innovation is that virtually all computation within level-2 (matrix-...
High-performance up-and-downdating via householder-like transformations
We present high-performance algorithms for up-and-downdating a Cholesky factor or QR factorization. The method uses Householder-like transformations, sometimes called hyperbolic Householder transformations, that are accumulated so that most computation ...
Comments