nach oben

The Journal of Supercomputing

Erschienen in:

16.11.2017

Theoretical peak FLOPS per instruction set: a tutorial

verfasst von: Romain Dolbeau

Erschienen in: The Journal of Supercomputing | Ausgabe 3/2018

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Traditionally, evaluating the theoretical peak performance of a CPU in FLOPS (floating-point operations per second) was merely a matter of multiplying the frequency by the number of floating-point instructions per cycle. Today however, CPUs have features such as vectorization, fused multiply-add, hyperthreading, and “turbo” mode. In this tutorial, we look into this theoretical peak for recent fully featured Intel CPUs and other hardware, taking into account not only the simple absolute peak, but also the relevant instruction sets, encoding and the frequency scaling behaviour of modern hardware.

Vorheriger Artikel Hierarchical multicore thread mapping via estimation of remote communication

Nächster Artikel An analytical method for developing appropriate protection profiles of Instrumentation & Control System for nuclear power plants

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Most processors even older than Nehalem but supporting SSE2 would fall into the same category. Strictly speaking, SSE only supports single-precision floating-point operations, and SSE2 supports double precision. Processors without SSE2 have to rely on x87 for double-precision arithmetic and are not considered. In the remainder of this tutorial, the term SSE will be used to describe the SSE & SSE2 combination, since both are mandatory on all x86-64 processors.

i.e. -march=native -mtune=native.

Numbers not shown are the same than for the next shown number, e.g. using 18 cores has the same limits than using 20 cores.

Beware that consumer-grade GPUs might have degraded double-precision performance compared to their compute-oriented siblings; this is documented in footnotes of the aforementioned table.

The now-obsolete Tesla micro-architecture (Compute Capability 1.x) also supported an extra multiplication-only single-precision pipeline, but we only consider Fermi and newer (Compute Capability 2.x and higher) GPUs here.

Alverson R, Callahan D, Cummings D, Koblenz B, Porterfield A, Smith B (1990) The tera computer system. ACM SIGARCH Comput Archit News 18(3b):1–6CrossRef

AMD® (2017) AMD optimizing C/C++ compiler. http://developer.amd.com/amd-aocc/

AMD® (2017) Introducing the Radeon™ RX Vega\(^{64}\). https://gaming.radeon.com/en/product/vega/radeon-rx-vega-64/

Arm® (2017) Cortex-A57 processor. https://www.arm.com/products/processors/cortex-a/cortex-a57-processor.php

Arm® (2017) NEON. https://developer.arm.com/technologies/neon

Arm® (2017) Arm compiler for HPC. https://developer.arm.com/products/software-development-tools/hpc/arm-compiler-for-hpc

Zuras D, Cowlishaw M, Aiken A, Applegate M, Bailey D, Bass S, Bhandarkar D, Bhat M, Bindel D, Boldo S et al (2008) IEEE standard for floating-point arithmetic. IEEE Std 754–2008, pp 1–70

August MC, Brost GM, Hsiung CC, Schiffleger AJ (1989) Cray X-MP: the birth of a supercomputer. Computer 22(1):45–52CrossRef

Brisebarre N, Defour D, Kornerup P, Muller JM, Revol N (2005) A new range-reduction algorithm. IEEE Trans Comput 54(3):331–339CrossRef

10.

Buchholz W (1962) Planning a computer system: project stretch. McGraw-Hill Inc, Hightstown, NJ, USA

11.

Butler M (2010) Bulldozer: a new approach to multi-threaded compute performance. In: Hot Chips 22 Symposium (HCS), 2010 IEEE. IEEE, pp 1–17

12.

Butler M, Barnes L, Sarma DD, Gelinas B (2011) Bulldozer: an approach to multithreaded compute performance. IEEE Micro 31(2):6–15. https://doi.org/10.1109/MM.2011.23

13.

Clark M (2016) A new X86 core architecture for the next generation of computing. Hot Chips 28 Symposium (HCS). IEEE, pp 1–19

14.

Daumas M, Mazenc C, Merrheim X, Muller JM (1995) Modular range reduction: a new algorithm for fast and accurate computation on the elementary functions. J Univers Comput Sci 1(3):162–175MathSciNetMATH

15.

Diefendorff K, Dubey PK, Hochsprung R, Scale H (2000) Altivec extension to PowerPC accelerates media processing. IEEE Micro 20(2):85–95CrossRef

16.

Dolbeau R, Seznec A (2004) CASH: revisiting hardware sharing in single-chip parallel processor. J Instr Level Parallelism 6:1–16

17.

Fayneh E, Yuffe M, Knoll E, Zelikson M, Abozaed M, Talker Y, Shmuely Z, Rahme SA (2016) 4.1 14nm 6th-Generation core processor soc with low power consumption and improved performance. In: Solid-State Circuits Conference (ISSCC), 2016 IEEE International. IEEE, pp 72–73

18.

Fog A (1996–2016) Instruction tables: lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD and VIA cpus. Copenhagen University College of Engineering. http://www.agner.org/optimize/instruction_tables.pdf

19.

Govindu G, Zhuo L, Choi S, Prasanna V (2004) Analysis of high-performance floating-point arithmetic on fpgas. In: Parallel and Distributed Processing Symposium, 2004. Proceedings. 18th International. IEEE, p 149

20.

Grisenthwaite R (2011) Armv8 technology preview. In: IEEE Conference

21.

Gwennap L (2011) Adapteva: more flops, less watts. Microprocess Rep 6(13):11–02

22.

Henderson D (2000) Elementary functions: algorithms and implementation. Math Comput Educ 34(1):94

23.

Hennessy JL, Patterson DA (2011) Computer architecture: a quantitative approach, 5th edn. Elsevier, AmsterdamMATH

24.

Intel® (2010) Intel® Xeon® Processor X5650 (12M Cache, 2.66 GHz, 6.40 GT/s Intel® QPI). http://ark.intel.com/products/47922/Intel-Xeon-Processor-X5650-12M-Cache-2_66-GHz-6_40-GTs-Intel-QPI

25.

Intel® (2014) Intel® Xeon® Processor E5-2695 v3 (35m Cache, 2.30 GHz). http://ark.intel.com/products/81057/Intel-Xeon-Processor-E5-2695-v3-35M-Cache-2_30-GHz

26.

Intel® (2014) Intel® Xeon® Processor E5 v3 product families specification update. http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-e5-v3-spec-update.pdf

27.

Intel® (2014) Optimizing performance with Intel® advanced vector extensions. http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/performance-xeon-e5-v3-advanced-vector-extensions-paper.pdf

28.

Intel® (2016) Intel® 64 and IA-32 architectures software developer’s manual volume 2 (2A, 2B & 2C): instruction set reference, A–Z. 325383-060. http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-instruction-set-reference-manual-325383.html

29.

Intel® (2016) Intel® Xeon Phi™ processor software optimization guide (334541-001). https://software.intel.com/sites/default/files/managed/11/56/intel-xeon-phi-processor-software-optimization-guide.pdf

30.

Intel® (2017) Intel® Intrinsics guide. https://software.intel.com/sites/landingpage/IntrinsicsGuide/

31.

Kanter D (2016) AMD finds Zen in microarchitecture. Microprocess Rep. http://www.linleygroup.com/newsletters/newsletter_detail.php?num=5577

32.

Kumar A (1997) The HP PA-8000 RISC CPU. IEEE Micro 17(2):27–32CrossRef

33.

Kumar R, Jouppi NP, Tullsen DM (2004) Conjoined-core chip multiprocessing. In: Proceedings of the 37th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, pp 195–206

34.

Lee B, Burgess N (2002) Parameterisable floating-point operations on FPGA. In: Conference Record of the Thirty-Sixth Asilomar Conference on Signals, Systems and Computers, 2002, vol 2. IEEE, pp 1064–1068

35.

LLVM (2017) LLVM.org. https://llvm.org

36.

LLVM Documentation (2017) Auto-vectorization in LLVM. https://llvm.org/docs/Vectorizers.html

37.

Lo YJ, Williams S, Van Straalen B, Ligocki TJ, Cordery MJ, Wright NJ, Hall MW, Oliker L (2014) Roofline model toolkit: a practical tool for architectural and program analysis. In: International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems. Springer, pp 129–148

38.

Mantor M (2012) AMD Radeon™ HD 7970 with graphics core next (GCN) architecture. In: Hot Chips 24 Symposium (HCS), 2012 IEEE. IEEE, pp 1–35

39.

Montoye RK, Hokenek E, Runyon SL (1990) Design of the IBM RISC System/6000 floating-point execution unit. IBM J Res Dev 34(1):59–70CrossRef

40.

Munger B, Akeson D, Arekapudi S, Burd T, Fair HR, Farrell J, Johnson D, Krishnan G, McIntyre H, McLellan E et al (2016) Carrizo: a high performance, energy efficient 28 nm APU. IEEE J Solid State Circuits 51(1):105–116CrossRef

41.

Muñoz DM, Sanchez DF, Llanos CH, Ayala-Rincón M (2010) Tradeoff of FPGA design of a floating-point library for arithmetic operators. J Integr Circuits Syst 5(1):42–52CrossRef

42.

NVidia (2008–2017) CUDA C programming guide. http://docs.nvidia.com/cuda/cuda-c-programming-guide/

43.

NVidia (2008–2017) CUDA C programming guide: 5.4.1. Arithmetic instructions. http://docs.nvidia.com/cuda/cuda-c-programming-guide/#arithmetic-instructions

44.

NVidia (2008–2017) CUDA GPUs. https://developer.nvidia.com/cuda-gpus

45.

Oberman S, Favor G, Weber F (1999) AMD 3DNow! technology: architecture and implementations. IEEE Micro 19(2):37–48CrossRef

46.

Olofsson A, Nordström T, Ul-Abdin Z (2014) Kickstarting high-performance energy-efficient manycore architectures with epiphany. In: 2014 48th Asilomar Conference on Signals, Systems and Computers. IEEE, pp 1719–1726

47.

Russell RM (1978) The CRAY-1 computer system. Commun ACM 21(1):63–72CrossRef

48.

Shayesteh A (2006) Factored multi-core architectures. PhD thesis, University of California Los Angeles

49.

Singh AYG, Favor G, Yeung A (2014) AppliedMicro X-Gene 2. In: HotChips

50.

Smith JE, Sohi GS (1995) The microarchitecture of superscalar processors. Proc IEEE 83(12):1609–1624CrossRef

51.

Snavely A, Carter L, Boisseau J, Majumdar A, Gatlin KS, Mitchell N, Feo J, Koblenz B (1998) Multi-processor performance on the Tera MTA. In: Proceedings of the 1998 ACM/IEEE Conference on Supercomputing. IEEE Computer Society, pp 1–8

52.

Sodani A (2015) Knights landing (KNL): 2nd Generation Intel® Xeon Phi Processor. In: Hot Chips 27 Symposium (HCS), 2015 IEEE. IEEE, pp 1–24

53.

Stephens N (2016) Technology update: the scalable vector extension (sve) for the armv8-a architecture. https://community.arm.com/groups/processors/blog/2016/08/22/technology-update-the-scalable-vector-extension-sve-for-the-armv8-a-architecture

54.

Strenski D (2007) FPGA floating point performance—a pencil and paper evaluation. HPC Wire. https://www.hpcwire.com/2007/01/12/fpga_floating_point_performance/

55.

Thornton JE (1965) Parallel operation in the control data 6600. In: Proceedings of the October 27–29, 1964, Fall Joint Computer Conference, Part II: Very High Speed Computer Systems. ACM, New York, NY, USA, AFIPS ’64 (Fall, part II), pp 33–40. https://doi.org/10.1145/1464039.1464045

56.

Tullsen DM, Eggers SJ, Levy HM (1995) Simultaneous multithreading: maximizing on-chip parallelism. ACM SIGARCH Comp Archit News 23(2):392–403. http://doi.acm.org/10.1145/225830.224449

57.

Wikipedia (2017) x87. https://en.wikipedia.org/wiki/X87

Titel: Theoretical peak FLOPS per instruction set: a tutorial
verfasst von: Romain Dolbeau
Publikationsdatum: 16.11.2017
Verlag: Springer New York
Erschienen in: The Journal of Supercomputing / Ausgabe 3/2018
Print ISSN: 0920-8542
Elektronische ISSN: 1573-0484
DOI: https://doi.org/10.1007/s11227-017-2177-5

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Springer Professional "Wirtschaft+Technik"

Weitere Artikel der Ausgabe 3/2018

A shareable keyword search over encrypted data in cloud computing

An efficient distributed mutual exclusion algorithm for intersection traffic control

Cross-layer design and performance analysis for maximizing the network utilization of wireless mesh networks in cloud computing

A method for enhancing end-to-end transfer efficiency via performance tuning factors on dedicated circuit networks with a public cloud platform

Fusion algorithms and high-performance applications for vehicular cloud computing

E2FS: an elastic storage system for cloud computing