nach oben

Erschienen in:

2013 | OriginalPaper | Buchkapitel

3. Efficiency, Energy Efficiency and Programming of Accelerated HPC Servers: Highlights of PRACE Studies

verfasst von : Lennart Johnsson

Erschienen in: GPU Solutions to Multi-scale Problems in Science and Engineering

Verlag: Springer Berlin Heidelberg

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

During the last few years the convergence in architecture for High-Performance Computing systems that took place for over a decade has been replaced by a divergence. The divergence is driven by the quest for performance, cost-performance and in the last few years also energy consumption that during the life-time of a system have come to exceed the HPC system cost in many cases. Mass market, specialized processors, such as the Cell Broadband Engine (CBE) and Graphics Processors, have received particular attention, the latter especially after hardware support for double-precision floating-point arithmetic was introduced about three years ago. The recent support of Error Correcting Code (ECC) for memory and significantly enhanced performance for double-precision arithmetic in the current generation of Graphic Processing Units (GPUs) have further solidified the interest in GPUs for HPC. In order to assess the issues involved in potentially deploying clusters with nodes consisting of commodity microprocessors with some type of specialized processor for enhanced performance or enhanced energy efficiency or both for science and engineering workloads, PRACE, the Partnership for Advanced Computing in Europe, undertook a study that included three types of accelerators, the CBE, GPUs and ClearSpeed, and tools for their programming. The study focused on assessing performance, efficiency, power efficiency for double-precision arithmetic and programmer productivity. Four kernels, matrix multiplication, sparse matrix-vector multiplication, FFT, random number generation were used for the assessment together with High-Performance Linpack (HPL) and a few application codes. We report here on the results from the kernels and HPL for GPU and ClearSpeed accelerated systems. The GPU performed surprisingly significantly better than the CPU on the sparse matrix-vector multiplication on which the ClearSpeed performed surprisingly poorly. For matrix-multiplication, HPL and FFT the ClearSpeed accelerator was by far the most energy efficient device.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Vorheriges Kapitel Happenings at the GPU Conference

Nächstes Kapitel GRAPE and GRAPE-DR

Ali A, Johnsson L, Mirkovic D (2007) Empirical auto-tuning code generator for FFT and trigonometric transforms. Paper presented at the 5th workshop on optimizations for DSP and embedded systems. International symposium on code generation and optimization, San Jose

AMD™ Processor Pricing (2011) Advanced Micro Devices, Inc. Accessed 2 May 2011, from http://www.amd.com/us/products/pricing/Pages/server-opteron.aspx, Advanced Micro Devices, Inc

Asanovic K, Bodik R, Catanzaro BC, Gebis JJ, Husbands P, Keutzer K, Yelick KA (2006) The landscape of parallel computing research: a view from Berkeley. (UCB/EECS-2006-183). http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.pdf

ATI Radeo HD 5870 Graphics (2011) Advanced Micro Devices, Inc. Accessed 2 May 2011, from http://www.amd.com/us/products/desktop/graphics/ati-radeon-hd-5000/hd-5870/Pages/ati-radeon-hd-5870-overview.aspx#2, Advanced Micro Devices, Inc

Belady CL (2007) In the data center, power and cooling costs more than the IT equipment it supports. Electronics cooling

Bell BS (2009) RV870 architecture, FS Media, Inc. Accessed 2 May 2011, from http://www.firingsquad.com/hardware/ati_radeon_hd_5870_performance_preview/page3.asp, FS Media, Inc

CAPS (2011) CAPS enterprise. Accessed 2 May 2011, from http://www.caps-entreprise.com/index.php, CAPS enterprise

Cell (2011) Wikipedia. Accessed 2 May 2011, from http://en.wikipedia.org/w/index.php?title=Cell&oldid=426379510, Wikipedia

Cell Project at IBM Research (2011) IBM. Accessed 2 May 2011, from http://www.research.ibm.com/cell/, IBM

Chen T, Raghavan R, Dale J, Iwata E (2005) Cell broadband engine architecture and its first implementation. Accessed from https://www.ibm.com/developerworks/power/library/pa-cellperf/

Christadler I, Weinberg V (2010) RapidMind: portability across architectures and its limitations. Paper presented at the facing the multi-core challenge (conference proceedings), Heidelberg

Clark J (1980) A VLSI geometry processor for graphics. Comput Mag 13(7):59–68

Clark J (1982) The geometry engine: a VLSI geometry systems for graphics. Comput Graph 16(3):127–133CrossRef

ClearSpeed (2011) ClearSpeed Technology. Accessed 2 May 2011, from http://www.clearspeed.com/, ClearSpeed Technology

Colella P (2004) Defining software requirements for scientific computing

Comparison of AMD Graphics Processing Units (2011) Wikipedia. Accessed 2 May 2011, from http://en.wikipedia.org/w/index.php?title=Comparison_of_AMD_graphics_processing_units&oldid=427053994, Wikipedia

Comparison of Nvidia Graphics Processing Units (2011) Wikipedia. Accessed 2 May 2011, from http://en.wikipedia.org/wiki/Comparison_of_Nvidia_graphics_processing_units, Wikipedia

Connection Machine (2011) Wikipedia. Accessed 2 May 2011, from http://en.wikipedia.org/wiki/Connection_Machine, Wikipedia

Cray-1 Computer System (1976) Cray Research, Inc, Minnesota

CSX700 Datasheet (2011) (06-PD-1425 Rev 1E). http://www.clearspeed.com/products/documents/CSX700_Datasheet_Rev1E.pdf

CSX700 Processor (2011) http://www.clearspeed.com/products/csx700.php

CUDA (2011) Wikipedia. Accessed 2 May 2011, from http://en.wikipedia.org/w/index.php?title=Special:Cite&page=CUDA&id=427059959, Wikipedia

CUDA Case Studies (2009) http://www.lunarc.lu.se/Documents/nvidia-workshop/files/presentation/50_Case_Studies.pdf

CXSL User Guide (2010) (06-RM-1305), p 54. http://support.clearspeed.com/resources/documentation/CSXL_User_Guide_3.1_Rev1.C.pdf

Dolbeau R, Bihan S, Bodin F (2007) HMPP: a hybrid multi-core parallel programming environment. Paper presented at the proceedings of the workshop on general purpose processing on graphics processing units (GPGPU 2007), Boston. http://www.caps-entreprise.com/upload/ckfinder/userfiles/files/caps-hmpp-gpgpu-Boston-Workshop-Oct-2007.pdf

Dongarra J, Graybill R, Harrod W, Lucas R, Lusk E, Luszczek P, Tikir M (2008) DARPA’s HPCS program: history, models, tools, languages. Adv Comput 72:1–100CrossRef

Erbacci G, Cavazzoni C, Spiga F, Christadler I (2009) Report on petascale software libraries and programming models. Deliverable 6.6(RI-211528), 163. http://www.prace-project.eu/documents/public-deliverables-1/public-deliverables/d6-6.pdf

ESC Corporation (ed) LDS-1/PDP-10 display system. Evans and Sutherland Computer Corporation, Salt Lake City

EuroBen Benchmark (2011) EuroBen. Accessed 2 May 2011, from http://www.euroben.nl/index.php, EuroBen

Evans and Sutherland (2011a) Wikipedia. Accessed 2 May 2011, from http://en.wikipedia.org/wiki/Evans_%26_Sutherland, Wikipedia

Evans and Sutherland (2011b) Evans and Sutherland. Accessed 2 May 2011, from http://www.es.com/, Evans and Sutherland

Feldman M (2009) Benchmark challenge: Nehalem versus Istanbul, HPC wire. HCP wire. Accessed from http://www.hpcwire.com/hpcwire/2009-06-18/benchmark_challenge_nehalem_versus_istanbul.html

Flynn’s Taxonomy (2011) Wikipedia. Accessed 2 May 2011, from http://en.wikipedia.org/wiki/Flynn's_taxonomy, Wikipedia

GeForce 256 (2011) NVIDIA corporation. Accessed 2 May 2011, from http://www.nvidia.com/page/geforce256.html, NVIDIA corporation

Gelas JD (2008) Linpack: Intel’s Nehalem versus AMD Shanghai. Anandtech. Accessed from http://www.anandtech.com/show/3470

Gelas JD (2010) AMD’s 12-core “Magny-Cours” Opteron 6174 versus Intel’s 6-core Xeon Anandtech. Accessed 2 May 2011, from http://www.anandtech.com/show/2978, Anandtech

Ghuloum A, Sprangle E, Fang J, Wu G, Zhou Z (2007a) Ct: a flexible parallel programming model for tera-scale architectures. http://software.intel.com/file/25739

Ghuloum A, Smith T, Wu G, Zhou X, Fang J, Guo P, So B, Rajagopalan M, Chen Y, Chen B (2007b) Future-proof data parallel algorithms and software on Intel® multi-core architecture. Intel Technol J 11(4):333–348

Goodyear MPP (2011) Wikipedia. Accessed 2 May 2011, from http://en.wikipedia.org/wiki/Goodyear_MPP, Wikipedia

GPU Shipments Report by Jon Peddie Research (2011) Jon Peddie Research. Accessed 2 May 2011, from http://jonpeddie.com/publications/market_watch/, Jon Peddie Research

Graphics Processing Unit (2011) Wikipedia. Accessed 2 May 2011, from http://en.wikipedia.org/w/index.php?title=Graphics_processing_unit&oldid=427152592, Wikipedia

Grochowski E, Annavaram M (2006) Energy per instruction trends in Intel® microprocessors

Hills WD (1989) The connection machine. MIT Press, Cambridge

HMPP Open Standard (2011) Wikipedia. Accessed 2 May 2011, from http://en.wikipedia.org/w/index.php?title=HMPP_Open_Standard&oldid=415481893, Wikipedia

Homberg W (2009) Network specification and software data structures for the eQPACE architecture Jülich supercomputing center (JSC). Accessed 2 May 2011, from http://www2.fz-juelich.de/jsc/juice/eQPACE_Meeting/, Jülich supercomputing center (JSC)

HP Challenge Benchmark Record (2011) University of Tennessee. Accessed 2 May 2011, from http://icl.cs.utk.edu/hpcc/hpcc_record.cgi?id=403, University of Tennessee

HPC Challenge Benchmark Record (2011) University of Tennessee. Accessed 2 May 2011, from http://icl.cs.utk.edu/hpcc/hpcc_record.cgi?id=434, University of Tennessee

Hybrid Multi-Core Parallel Programming Workbench (2011) CAPS enterprise. Accessed 2 May 2011, from http://www.caps-entreprise.com/fr/page/index.php?id=49&p_p=36, CAPS enterprise

IA-32 (Intel Architecture 32-bit) (2011) Wikipedia. Accessed 2 May 2011, from http://en.wikipedia.org/wiki/IA-32, Wikipedia

ILLIAC IV (1972) Corporation system characteristics and programming manual. Burroughs corporation

ILLIAC IV (2011a) Wikipedia. Accessed 2 May 2011, from http://en.wikipedia.org/wiki/ILLIAC_IV, Wikipedia

ILLIAC IV (2011b) Burroughs corporation. Accessed 2 May 2011, from http://archive.computerhistory.org/resources/text/Burroughs/Burroughs.ILLIAC%20IV.1974.102624911.pdf

Intel 4004 (2011a) Wikipedia. Accessed 2 May 2011, from http://en.wikipedia.org/wiki/Intel_4004, Wikipedia

Intel 4004 (2011b) A big deal then, a big deal now. Intel corporation. Accessed 2 May 2011, from http://www.intel.com/about/companyinfo/museum/exhibits/4004/facts.htm, Intel corporation

Intel 56XX (2011) Series products (formerly Westemere-\({\rm EP}\_\)). Intel corporation. Accessed 2 May 2011, from http://ark.intel.com/ProductCollection.aspx?codeName=33174, Intel corporation

Intel Hyper-Threading Technology (Intel HT Technology) (2011) Intel Corporation. Accessed 2 May 2011, from http://www.intel.com/technology/platform-technology/hyper-threading/index.htm, Intel corporation

Intel Math Kernel Library (2011) Intel corporation. Accessed 2 May 2011, from http://software.intel.com/en-us/articles/intel-mkl/, Intel corporation

Intel Processor (2011) Clock speed (MHz). Accessed 2 May 2011, from http://smoothspan.files.wordpress.com/2007/09/clockspeeds.jpg

Intel Xeon Processor E5540 (2011) Intel corporation. Accessed 2 May 2011, from http://ark.intel.com/Product.aspx?id=37104&processor=E5540&spec-codes=SLBF6, Intel corporation

Intel(R) Array Building Blocks for Linux OS, User’s Guide (2011) (324171-006US), p 74. http://software.intel.com/sites/products/documentation/arbb/arbb_userguide_linux.pdf

Intel(R) Array Building Blocks Virtual Machine, Specification (2011) (324820-002US), p 118. http://software.intel.com/sites/products/documentation/arbb/vm/arbb_vm.pdf

Intel’s Ct Technology Code Samples (2010) Intel. Accessed 2 May 2011, from http://software.intel.com/en-us/articles/intels-ct-technology-code-samples/, Intel

Introducing Intel many Integrated Core Architecture (2011) Intel corporation. Accessed 2 May 2011, from http://www.intel.com/technology/architecture-silicon/mic/index.htm, Intel corporation

Introduction to Parallel GPU Computing (2010) Center for scalable application development software

Johnsson L (2011) Overview of data centers energy efficiency evolution. In: Ranka S, Ahmad I (eds) Handbook of green computing. CRC Press, New York

Kanellos M (2001) Intel’s accidental revolution. CNET news. Accessed from CNET News website

Kennedy K, Koelbel C, Schreiber R (2004) Defining and measuring the productivity of programming languages. Int J High Perform Comput Appl 18(4):441–448CrossRef

Kozin IN (2008) Evaluation of ClearSpeed accelerators for HPC, p 15. http://www.cse.scitech.ac.uk/disco/publications/Clearspeed.pdf

Linpack, ClearSpeed (2010) CleerSpeed technology limited. Accessed 2 May 2011, from http://www.clearspeed.com/applications/highperformancecomputing/hpclinpack.php, CleerSpeed technology limited

Matsuoka S, Dongarra J TESLA GPU computint. http://www.microway.com/pdfs/TeslaC2050-Fermi-Performance.pdf

McCalpin JD (2011) STREAM: sustainable memory bandwidth in high-performance computers, University of Virginia. Accessed 2 May 2011, from http://www.cs.virginia.edu/stream/, University of Virginia

McCool MD (2007) RapidMind multi-core development platform. CASCON Cell Workshop

McCool MD (2008) Developing for GPUs, cell, and multi-core CPUs using a unified programming model. Linux J

Memory Bandwidth (STREAM)—Two-Socket Servers (including AMD™ 6100 Series Processors) (2011) Advanced Micro Devices, Inc. Accessed 2 May 2011, from http://www.amd.com/us/products/server/benchmarks/Pages/memory-bandwidth-stream-two-socket-servers.aspx, Advanced Micro Devices, Inc

Mirkovic D, Mahasoom R, Johnsson L (2000) An adaptive software library for fast fourier transforms. Paper presented at the 2000 international conference on supercomputing, Santa Fe

Moore GE (1965) Craming more components onto integrated circuits. Electronics 38(8):114–117

Non-Uniform Memory Access (2011) Wikipedia. Accessed 2 May 2011, from http://en.wikipedia.org/wiki/Non-Uniform_Memory_Access, Wikipedia

NVIDIA Corporation (2011) What is CUDA? Accessed 2 May 2011, from http://www.nvidia.com/object/what_is_cuda_new.html, NVIDIA corporation

OpenCL (2010) Specification Version: 1(1), p 379. http://www.khronos.org/registry/cl/specs/opencl-1.1.pdf

OpenCL (2011) The open standard for parallel programming of heterogeneous systems. Khronos Group. Accessed 2 May 2011, from http://www.khronos.org/opencl/, Khronos Group

Pentium 4 (2011) Wikipedia. Accessed 2 May 2011, from http://en.wikipedia.org/wiki/Pentium_4, Wikipedia

Petitet A, Whaley RC, Dongarra J, Cleary A (2008) HPL–a portable implementation of the high-performance Linpack benchmark for distributed-memory computers, University of Tennessee Computer Science Department. Accessed 2 May 2011, from http://www.netlib.org/benchmark/hpl/, University of Tennessee Computer Science Department

Petrov V, Fedorov G (2010) MKL FFT performance—comparison of local and distributed-memory implementations. Intel software network. Retrieved from http://software.intel.com/en-us/articles/mkl-fft-performance-using-local-and-distributed-implementation/

Pettey C (2011) Gartner says worldwide PC shipments in fourth quarter of 2010 grew 3.1 percent; year-end shipments increased 13.8 percent. Accessed from http://www.gartner.com/it/page.jsp?id=1519417, Gartner, Inc

Pettey C, Stevens H (2011) Gartner says 2010 worldwide server market returned to growth with shipments up 17 percent and revenue 13 percent. Gartner, Inc. Accessed 2 May 2011, from http://www.gartner.com/it/page.jsp?id=1561014, Gartner, Inc

PGI Accelerator Programming Model for Fortran and C (2010) p 36. http://www.pgroup.com/lit/whitepapers/pgi_accel_prog_model_1.3.pdf

Phillips E, Fatica M (2010) CUDA accelerated Linpack on clusters, E. Phillips. http://www.nvidia.com/content/GTC-2010/pdfs/2057_GTC2010.pdf

Pollack F (1999) New microarchitecture challenges in the coming generations of CMOS process technologies. Paper presented at the proceedings of the 32nd annual IEEE/ACM international symposium on microarchitecture, Haifa

Portland Group Inc (2011) Accelerated compilers. STMicroelectronics. Accessed 2 May 2011, from http://www.pgroup.com/resources/accel.htm, STMicroelectronics

PRACE (2009) Preparatory phase project, Deliverable 8.3.1, technical component assessment and development, report

PRACE (2011) PRACE. Accessed 2 May 2011, from http://www.prace-ri.eu/, PRACE

Productivity benefits of Intel Ct Technology (2010) Intel corporation. Accessed 2 May 2011, from http://software.intel.com/en-us/articles/productivity-benefits-of-intel-ct-technology/, Intel corporation

RapidMind (2011) Wikipedia. Accessed 2 May 2011, from http://en.wikipedia.org/wiki/RapidMind, Wikipedia

Sagar RS, Labarta J, van der Steen A, Christadler I, Huber H (2010) PRACE preparatory phase project, Deliverable 8.3.2, final technical report and architecture proposal. http://www.prace-project.eu/documents/public-deliverables/d8-3-2-extended.pdf

Shalf J, Donofrio D, Oliker L, Wehner M (2006) Green flash: application driven system design for power efficient HPC. Paper presented at the Salishan conference on high-speed computing

Shimpi AL (2010) New westmere details emerge: power efficiency and 4/6 core plans. AnandTech, Inc. Accessed 2 May 2011, from http://www.anandtech.com/show/2930, AnandTech, Inc

Silicon Graphics (2011) Wikipedia. Accessed 2 May 2011, from http://en.wikipedia.org/wiki/Silicon_Graphics, Wikipedia

Simpson AD, Bull M, Hill J (2008) http://www.prace-project.eu/documents/Identification_and_Categorisatio_of_Applications_and_Initial_Benchmark_Suite_final.pdf

Single Chip 4-Bit P-Channel Microprocessor (1987) Intel corporation

Sophisticated Library for Vector Parallelism (2011) Intel array building blocks: a flexible parallel programming model for multicore and many-core architectures. Intel corporation. Accessed 2 May 2011, from http://software.intel.com/en-us/articles/intel-array-building-blocks/, Intel corporation

Team TsG (2005) The mother of All CPU charts 2005/2006. Bestofmedia network. Accessed 2 May 2011, from http://www.tomshardware.com/reviews/mother-cpu-charts-2005,1175.html, Bestofmedia network

Tesla C1060 Computing Processor Board Specification (2010) (BD-04111-001-v06). http://www.nvidia.com/docs/IO/43395/BD-04111-001v-06.pdf

Tesla C2050/C2070 GPU Computing Processor (2010) NVIDIA Corporation

The Green500 (2010) Green 500: ranking the worlds most energy-efficient supercomputers. Accessed 2 May 2011, from www.green500.org, The Green500

Thelen E (2003) The connection machine -1-2-5. Ed-Thelen.org. Accessed 2 May 2011, from http://ed-thelen.org/comp-hist/vs-cm-1-2-5.html, Ed-Thelen.org

Thelen E (2005) ILLIAC IV. Ed-Thelen.org. Accessed 2 May 2011, from http://ed-thelen.org/comp-hist/vs-illiac-iv.html, Ed-Thelen.org

Thornton JE (1963) Considerations in computer design–leading up to the control data 6600. http://www.bitsavers.org/pdf/cdc/cyber/cyber_70/thornton_6600_paper.pdf

Thornton JE (1970) The design of a computer: the control data 6600. Scott, Foresman and Company, Glenview

Top 500 (2011) Top500.org. Accessed 2 May 2011, from http://www.top500.org/, Top500.org

Valich T (2010) nVidia GF100 architecture: alea iacta est. Accessed from http://www.brightsideofnews.com/print/2010/1/18/nvidia-gf100-architecture-alea-iacta-est.aspx

Writing Applications for the GPU Using the RapidMind™ Development Platform (2006) p 7. Accessed from http://www.cs.ucla.edu/palsberg/course/cs239/papers/rapidmind.pdf

Titel: Efficiency, Energy Efficiency and Programming of Accelerated HPC Servers: Highlights of PRACE Studies
verfasst von: Lennart Johnsson
Verlag: Springer Berlin Heidelberg
Buch: GPU Solutions to Multi-scale Problems in Science and Engineering
Print ISBN: 978-3-642-16404-0

Electronic ISBN: 978-3-642-16405-7

Copyright-Jahr: 2013
DOI: https://doi.org/10.1007/978-3-642-16405-7_3

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"