Skip to main content
Erschienen in: The Journal of Supercomputing 3/2018

16.11.2017

Theoretical peak FLOPS per instruction set: a tutorial

verfasst von: Romain Dolbeau

Erschienen in: The Journal of Supercomputing | Ausgabe 3/2018

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Traditionally, evaluating the theoretical peak performance of a CPU in FLOPS (floating-point operations per second) was merely a matter of multiplying the frequency by the number of floating-point instructions per cycle. Today however, CPUs have features such as vectorization, fused multiply-add, hyperthreading, and “turbo” mode. In this tutorial, we look into this theoretical peak for recent fully featured Intel CPUs and other hardware, taking into account not only the simple absolute peak, but also the relevant instruction sets, encoding and the frequency scaling behaviour of modern hardware.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Fußnoten
1
Most processors even older than Nehalem but supporting SSE2 would fall into the same category. Strictly speaking, SSE only supports single-precision floating-point operations, and SSE2 supports double precision. Processors without SSE2 have to rely on x87 for double-precision arithmetic and are not considered. In the remainder of this tutorial, the term SSE will be used to describe the SSE & SSE2 combination, since both are mandatory on all x86-64 processors.
 
2
i.e. -march=native -mtune=native.
 
3
Numbers not shown are the same than for the next shown number, e.g. using 18 cores has the same limits than using 20 cores.
 
4
Beware that consumer-grade GPUs might have degraded double-precision performance compared to their compute-oriented siblings; this is documented in footnotes of the aforementioned table.
 
5
The now-obsolete Tesla micro-architecture (Compute Capability 1.x) also supported an extra multiplication-only single-precision pipeline, but we only consider Fermi and newer (Compute Capability 2.x and higher) GPUs here.
 
Literatur
1.
Zurück zum Zitat Alverson R, Callahan D, Cummings D, Koblenz B, Porterfield A, Smith B (1990) The tera computer system. ACM SIGARCH Comput Archit News 18(3b):1–6CrossRef Alverson R, Callahan D, Cummings D, Koblenz B, Porterfield A, Smith B (1990) The tera computer system. ACM SIGARCH Comput Archit News 18(3b):1–6CrossRef
7.
Zurück zum Zitat Zuras D, Cowlishaw M, Aiken A, Applegate M, Bailey D, Bass S, Bhandarkar D, Bhat M, Bindel D, Boldo S et al (2008) IEEE standard for floating-point arithmetic. IEEE Std 754–2008, pp 1–70 Zuras D, Cowlishaw M, Aiken A, Applegate M, Bailey D, Bass S, Bhandarkar D, Bhat M, Bindel D, Boldo S et al (2008) IEEE standard for floating-point arithmetic. IEEE Std 754–2008, pp 1–70
8.
Zurück zum Zitat August MC, Brost GM, Hsiung CC, Schiffleger AJ (1989) Cray X-MP: the birth of a supercomputer. Computer 22(1):45–52CrossRef August MC, Brost GM, Hsiung CC, Schiffleger AJ (1989) Cray X-MP: the birth of a supercomputer. Computer 22(1):45–52CrossRef
9.
Zurück zum Zitat Brisebarre N, Defour D, Kornerup P, Muller JM, Revol N (2005) A new range-reduction algorithm. IEEE Trans Comput 54(3):331–339CrossRef Brisebarre N, Defour D, Kornerup P, Muller JM, Revol N (2005) A new range-reduction algorithm. IEEE Trans Comput 54(3):331–339CrossRef
10.
Zurück zum Zitat Buchholz W (1962) Planning a computer system: project stretch. McGraw-Hill Inc, Hightstown, NJ, USA Buchholz W (1962) Planning a computer system: project stretch. McGraw-Hill Inc, Hightstown, NJ, USA
11.
Zurück zum Zitat Butler M (2010) Bulldozer: a new approach to multi-threaded compute performance. In: Hot Chips 22 Symposium (HCS), 2010 IEEE. IEEE, pp 1–17 Butler M (2010) Bulldozer: a new approach to multi-threaded compute performance. In: Hot Chips 22 Symposium (HCS), 2010 IEEE. IEEE, pp 1–17
13.
Zurück zum Zitat Clark M (2016) A new X86 core architecture for the next generation of computing. Hot Chips 28 Symposium (HCS). IEEE, pp 1–19 Clark M (2016) A new X86 core architecture for the next generation of computing. Hot Chips 28 Symposium (HCS). IEEE, pp 1–19
14.
Zurück zum Zitat Daumas M, Mazenc C, Merrheim X, Muller JM (1995) Modular range reduction: a new algorithm for fast and accurate computation on the elementary functions. J Univers Comput Sci 1(3):162–175MathSciNetMATH Daumas M, Mazenc C, Merrheim X, Muller JM (1995) Modular range reduction: a new algorithm for fast and accurate computation on the elementary functions. J Univers Comput Sci 1(3):162–175MathSciNetMATH
15.
Zurück zum Zitat Diefendorff K, Dubey PK, Hochsprung R, Scale H (2000) Altivec extension to PowerPC accelerates media processing. IEEE Micro 20(2):85–95CrossRef Diefendorff K, Dubey PK, Hochsprung R, Scale H (2000) Altivec extension to PowerPC accelerates media processing. IEEE Micro 20(2):85–95CrossRef
16.
Zurück zum Zitat Dolbeau R, Seznec A (2004) CASH: revisiting hardware sharing in single-chip parallel processor. J Instr Level Parallelism 6:1–16 Dolbeau R, Seznec A (2004) CASH: revisiting hardware sharing in single-chip parallel processor. J Instr Level Parallelism 6:1–16
17.
Zurück zum Zitat Fayneh E, Yuffe M, Knoll E, Zelikson M, Abozaed M, Talker Y, Shmuely Z, Rahme SA (2016) 4.1 14nm 6th-Generation core processor soc with low power consumption and improved performance. In: Solid-State Circuits Conference (ISSCC), 2016 IEEE International. IEEE, pp 72–73 Fayneh E, Yuffe M, Knoll E, Zelikson M, Abozaed M, Talker Y, Shmuely Z, Rahme SA (2016) 4.1 14nm 6th-Generation core processor soc with low power consumption and improved performance. In: Solid-State Circuits Conference (ISSCC), 2016 IEEE International. IEEE, pp 72–73
19.
Zurück zum Zitat Govindu G, Zhuo L, Choi S, Prasanna V (2004) Analysis of high-performance floating-point arithmetic on fpgas. In: Parallel and Distributed Processing Symposium, 2004. Proceedings. 18th International. IEEE, p 149 Govindu G, Zhuo L, Choi S, Prasanna V (2004) Analysis of high-performance floating-point arithmetic on fpgas. In: Parallel and Distributed Processing Symposium, 2004. Proceedings. 18th International. IEEE, p 149
20.
Zurück zum Zitat Grisenthwaite R (2011) Armv8 technology preview. In: IEEE Conference Grisenthwaite R (2011) Armv8 technology preview. In: IEEE Conference
21.
Zurück zum Zitat Gwennap L (2011) Adapteva: more flops, less watts. Microprocess Rep 6(13):11–02 Gwennap L (2011) Adapteva: more flops, less watts. Microprocess Rep 6(13):11–02
22.
Zurück zum Zitat Henderson D (2000) Elementary functions: algorithms and implementation. Math Comput Educ 34(1):94 Henderson D (2000) Elementary functions: algorithms and implementation. Math Comput Educ 34(1):94
23.
Zurück zum Zitat Hennessy JL, Patterson DA (2011) Computer architecture: a quantitative approach, 5th edn. Elsevier, AmsterdamMATH Hennessy JL, Patterson DA (2011) Computer architecture: a quantitative approach, 5th edn. Elsevier, AmsterdamMATH
32.
33.
Zurück zum Zitat Kumar R, Jouppi NP, Tullsen DM (2004) Conjoined-core chip multiprocessing. In: Proceedings of the 37th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, pp 195–206 Kumar R, Jouppi NP, Tullsen DM (2004) Conjoined-core chip multiprocessing. In: Proceedings of the 37th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, pp 195–206
34.
Zurück zum Zitat Lee B, Burgess N (2002) Parameterisable floating-point operations on FPGA. In: Conference Record of the Thirty-Sixth Asilomar Conference on Signals, Systems and Computers, 2002, vol 2. IEEE, pp 1064–1068 Lee B, Burgess N (2002) Parameterisable floating-point operations on FPGA. In: Conference Record of the Thirty-Sixth Asilomar Conference on Signals, Systems and Computers, 2002, vol 2. IEEE, pp 1064–1068
37.
Zurück zum Zitat Lo YJ, Williams S, Van Straalen B, Ligocki TJ, Cordery MJ, Wright NJ, Hall MW, Oliker L (2014) Roofline model toolkit: a practical tool for architectural and program analysis. In: International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems. Springer, pp 129–148 Lo YJ, Williams S, Van Straalen B, Ligocki TJ, Cordery MJ, Wright NJ, Hall MW, Oliker L (2014) Roofline model toolkit: a practical tool for architectural and program analysis. In: International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems. Springer, pp 129–148
38.
Zurück zum Zitat Mantor M (2012) AMD Radeon™ HD 7970 with graphics core next (GCN) architecture. In: Hot Chips 24 Symposium (HCS), 2012 IEEE. IEEE, pp 1–35 Mantor M (2012) AMD Radeon™ HD 7970 with graphics core next (GCN) architecture. In: Hot Chips 24 Symposium (HCS), 2012 IEEE. IEEE, pp 1–35
39.
Zurück zum Zitat Montoye RK, Hokenek E, Runyon SL (1990) Design of the IBM RISC System/6000 floating-point execution unit. IBM J Res Dev 34(1):59–70CrossRef Montoye RK, Hokenek E, Runyon SL (1990) Design of the IBM RISC System/6000 floating-point execution unit. IBM J Res Dev 34(1):59–70CrossRef
40.
Zurück zum Zitat Munger B, Akeson D, Arekapudi S, Burd T, Fair HR, Farrell J, Johnson D, Krishnan G, McIntyre H, McLellan E et al (2016) Carrizo: a high performance, energy efficient 28 nm APU. IEEE J Solid State Circuits 51(1):105–116CrossRef Munger B, Akeson D, Arekapudi S, Burd T, Fair HR, Farrell J, Johnson D, Krishnan G, McIntyre H, McLellan E et al (2016) Carrizo: a high performance, energy efficient 28 nm APU. IEEE J Solid State Circuits 51(1):105–116CrossRef
41.
Zurück zum Zitat Muñoz DM, Sanchez DF, Llanos CH, Ayala-Rincón M (2010) Tradeoff of FPGA design of a floating-point library for arithmetic operators. J Integr Circuits Syst 5(1):42–52CrossRef Muñoz DM, Sanchez DF, Llanos CH, Ayala-Rincón M (2010) Tradeoff of FPGA design of a floating-point library for arithmetic operators. J Integr Circuits Syst 5(1):42–52CrossRef
45.
Zurück zum Zitat Oberman S, Favor G, Weber F (1999) AMD 3DNow! technology: architecture and implementations. IEEE Micro 19(2):37–48CrossRef Oberman S, Favor G, Weber F (1999) AMD 3DNow! technology: architecture and implementations. IEEE Micro 19(2):37–48CrossRef
46.
Zurück zum Zitat Olofsson A, Nordström T, Ul-Abdin Z (2014) Kickstarting high-performance energy-efficient manycore architectures with epiphany. In: 2014 48th Asilomar Conference on Signals, Systems and Computers. IEEE, pp 1719–1726 Olofsson A, Nordström T, Ul-Abdin Z (2014) Kickstarting high-performance energy-efficient manycore architectures with epiphany. In: 2014 48th Asilomar Conference on Signals, Systems and Computers. IEEE, pp 1719–1726
47.
Zurück zum Zitat Russell RM (1978) The CRAY-1 computer system. Commun ACM 21(1):63–72CrossRef Russell RM (1978) The CRAY-1 computer system. Commun ACM 21(1):63–72CrossRef
48.
Zurück zum Zitat Shayesteh A (2006) Factored multi-core architectures. PhD thesis, University of California Los Angeles Shayesteh A (2006) Factored multi-core architectures. PhD thesis, University of California Los Angeles
49.
Zurück zum Zitat Singh AYG, Favor G, Yeung A (2014) AppliedMicro X-Gene 2. In: HotChips Singh AYG, Favor G, Yeung A (2014) AppliedMicro X-Gene 2. In: HotChips
50.
Zurück zum Zitat Smith JE, Sohi GS (1995) The microarchitecture of superscalar processors. Proc IEEE 83(12):1609–1624CrossRef Smith JE, Sohi GS (1995) The microarchitecture of superscalar processors. Proc IEEE 83(12):1609–1624CrossRef
51.
Zurück zum Zitat Snavely A, Carter L, Boisseau J, Majumdar A, Gatlin KS, Mitchell N, Feo J, Koblenz B (1998) Multi-processor performance on the Tera MTA. In: Proceedings of the 1998 ACM/IEEE Conference on Supercomputing. IEEE Computer Society, pp 1–8 Snavely A, Carter L, Boisseau J, Majumdar A, Gatlin KS, Mitchell N, Feo J, Koblenz B (1998) Multi-processor performance on the Tera MTA. In: Proceedings of the 1998 ACM/IEEE Conference on Supercomputing. IEEE Computer Society, pp 1–8
52.
Zurück zum Zitat Sodani A (2015) Knights landing (KNL): 2nd Generation Intel® Xeon Phi Processor. In: Hot Chips 27 Symposium (HCS), 2015 IEEE. IEEE, pp 1–24 Sodani A (2015) Knights landing (KNL): 2nd Generation Intel® Xeon Phi Processor. In: Hot Chips 27 Symposium (HCS), 2015 IEEE. IEEE, pp 1–24
55.
Zurück zum Zitat Thornton JE (1965) Parallel operation in the control data 6600. In: Proceedings of the October 27–29, 1964, Fall Joint Computer Conference, Part II: Very High Speed Computer Systems. ACM, New York, NY, USA, AFIPS ’64 (Fall, part II), pp 33–40. https://doi.org/10.1145/1464039.1464045 Thornton JE (1965) Parallel operation in the control data 6600. In: Proceedings of the October 27–29, 1964, Fall Joint Computer Conference, Part II: Very High Speed Computer Systems. ACM, New York, NY, USA, AFIPS ’64 (Fall, part II), pp 33–40. https://​doi.​org/​10.​1145/​1464039.​1464045
Metadaten
Titel
Theoretical peak FLOPS per instruction set: a tutorial
verfasst von
Romain Dolbeau
Publikationsdatum
16.11.2017
Erschienen in
The Journal of Supercomputing / Ausgabe 3/2018
Print ISSN: 0920-8542
Elektronische ISSN: 1573-0484
DOI
https://doi.org/10.1007/s11227-017-2177-5

Weitere Artikel der Ausgabe 3/2018

The Journal of Supercomputing 3/2018 Zur Ausgabe