Top

International Journal of Parallel Programming

Published in:

01-12-2015

Extending Summation Precision for Network Reduction Operations

Authors: George Michelogiannakis, Xiaoye S. Li, David H. Bailey, John Shalf

Published in: International Journal of Parallel Programming | Issue 6/2015

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Double precision summation is at the core of numerous important algorithms such as Newton–Krylov methods and other operations involving inner products, such as matrix multiplication and dot products. However, the effectiveness of summation is limited by the accumulation of rounding errors due to compressed representations, which are an increasing problem with the scaling of modern HPC systems and data sets that can easily perform summations with millions or billions of operands. To reduce the impact of precision loss, researchers have proposed increased- and arbitrary-precision libraries that provide reproducible error or even bounded error accumulation for large sums. However, such libraries increase computation and communication time significantly, and do not always guarantee an exact result. In this article, we propose fixed-point representations of double precision variables that enable arbitrarily large summations without error and provide exact and reproducible results. We call this format big integer (BigInt). Even though such formats have been studied for local processor computations, we make the case that using fixed-point representation for distributed computation over a system-wide network is feasible with performance comparable to that of double-precision floating point summation. This is possible by the inclusion of simple and inexpensive logic into modern NICs, or by using the programmable logic found in many modern NICs, in order to accelerate performance on large-scale systems in order to avoid waking up processors.

previous article The Scalability of Disjoint Data Structures on a New Hardware Transactional Memory System

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Adams, M., Brown, J., Shalf, J., Straalen, B.V., Strohmaier, E., Williams, S.: HPGMG 1.0: a benchmark for ranking high performance computing systems. Tech. rep., Lawrence Berkeley national laboratory (2014). doi:10.2172/1131029. http://www.osti.gov/scitech/servlets/purl/1131029

Allen, E., Burns, J., Gilliam, D., Hill, J., Shubov, V.: The impact of finite precision arithmetic and sensitivity on the numerical solution of partial differential equations. Math. Comput. Model. 35(11–12) (2002). doi:10.1016/S0895-7177(02)00078-X

Antypas, K.: The Hopper XE6 system: delivering high end computing to the nation’s science and research community. Tech. rep, Cray Quarterly Review (2011)

Astfalk, G.: Why optical data communications and why now? Appl. Phys. A 95(4), 933–940 (2009). doi:10.1007/s00339-009-5115-4

Bailey, D.H.: High-precision floating-point arithmetic in scientific computation. Comput. Sci. Eng. 7(3) (2005). doi:10.1109/MCSE.2005.52

Bailey, D.H., Barrio, R., Borwein, J.M.: High-precision computation: mathematical physics and dynamics. Appl. Math. Comput. 218(20), 10106–10121 (2012)

Bailey, D.H., Hida, Y., Li, X.S., Thompson, O.: ARPREC: an arbitrary precision computation package. Tech. rep, Lawrence Berkeley National Laboratory (2002)

Boden, N.J., Cohen, D., Felderman, R.E., Kulawik, A.E., Seitz, C.L., Seizovic, J.N., Su, W.K.: Myrinet: A gigabit-per-second local area network. IEEE Micro 15(1), 29–36 (1995). doi:10.1109/40.342015 CrossRef

Buntinas, D., Panda, D.K.: NIC-based reduction in Myrinet clusters: is it beneficial? In: SAN-02 Workshop (2003)

10.

Carreo, V.A., Miner, P.S.: Specification of the IEEE-854 floating-point standard in HOL and PVS (1995)

11.

Case, L.: Inside Intel’s Haswell CPU: better performance, all-day battery. http://www.pcworld.com/article/262241/inside_intels_haswell_cpu_better_performance_all_day_battery.html/ (2012)

12.

Chervenak, A., Deelman, E., Livny, M., Su, M.H., Schuler, R., Bharathi, S., Mehta, G., Vahi, K.: Data placement for scientific applications in distributed environments. In: 8th IEEE/ACM International Conference on Grid Computing, GRID ’07 (2007). doi:10.1109/GRID.2007.4354142

13.

Chesneaux, J.M., Graillat, S., Jézéquel, F.: Rounding errors. In: Wiley Encyclopedia of Computer Science and Engineering (2008)

14.

Corporation, I.: Intel 64 and IA-32 architectures developer’s manual: vol. 1 (2012)

15.

Damaraju, S., George, V., Jahagirdar, S., Khondker, T., Milstrey, R., Sarkar, S., Siers, S., Stolero, I., Subbiah, A.: A 22nm IA multi-CPU and GPU system-on-chip. In: 59th IEEE International Solid-State Circuits Conference Digest of Technical Papers, ISSCC ’12 (2012). doi:10.1109/ISSCC.2012.6176876

16.

Demmel, J., Diament, B., Malajovich, G.: On the complexity of computing error bounds. Found. Comput. Math. 1(1), 101–125 (2001)MATHMathSciNetCrossRef

17.

Demmel, J., Dumitriu, I., Holtz, O., Koev, P.: Accurate and efficient expression evaluation and linear algebra. Comput. Res. Reporitory abs/0712.4027 (2007)

18.

Demmel, J., Nguyen, H.D.: Fast reproducible floating-point summation. In: 21st IEEE Symposium on Computer Arithmetic (2013)

19.

Forum, M.P.I.: MPI: a message-passing interface standard. version 3.0. http://www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf/ (2012)

20.

Fousse, L., Hanrot, G., Lefèvre, V., Pélissier, P., Zimmermann, P.: MPFR: a multiple-precision binary floating-point library with correct rounding. ACM Trans. Math. Softw. 33(2) (2007). doi:10.1145/1236463.1236468

21.

Gene Frantz, R.S.: Comparing fixed- and floating-point DSPs. Texas instruments while paper. http://www.ti.com/lit/wp/spry061/spry061.pdf (2004)

22.

Ghazi, K.R., Lefevre, V., Theveny, P., Zimmermann, P.: Why and how to use arbitrary precision. Comput. Sci. Eng. 12(3), 5 (2010). doi:10.1109/MCSE.2010.73

23.

Govindu, G., Zhuo, L., Choi, S., Prasanna, V.: Analysis of high-performance floating-point arithmetic on FPGAs. In: 18th IEEE International Parallel and Distributed Processing Symposium, IPDPS ’04 (2004). doi:10.1109/IPDPS.2004.1303135

24.

Graillat, S., Ménissier-Morain, V.: Accurate summation, dot product and polynomial evaluation in complex floating point arithmetic. Inf. Comput. 216, 57–71 (2012)MATHCrossRef

25.

Granlund, T., the GMP development team: GNU MP: the GNU Multiple Precision Arithmetic Library, 5.0.5 edn. (2012)

26.

He, Y., Ding, C.H.Q.: Using accurate arithmetics to improve numerical reproducibility and stability in parallel applications. In: 14th International Conference on Supercomputing, ICS ’00 (2000). doi:10.1145/335231.335253

27.

Heroux, M.A., Dongarra, J., Luszczek, P.: HPCG benchmark technical specification. Tech. rep., Sandia national laboratory (2013). http://www.osti.gov/scitech/servlets/purl/1113870

28.

Hida, Y., Li, X., Bailey, D.H.: Library for double-double and quad-double arithmetic. http://web.mit.edu/tabbott/Public/quaddouble-debian/qd-2.3.4-old/docs/qd.pdf/ (2007)

29.

Higham, N.J.: The accuracy of floating point summation. SIAM J. Sci. Comput. 14, 783–799 (1993)

30.

Hoefler, T., Gottlieb, S.: Parallel zero-copy algorithms for fast Fourier transform and conjugate gradient using MPI datatypes. In: Proceedings of the 17th European MPI Users’ Group Meeting Conference on Recent Advances in the Message Passing Interface, EuroMPI’10, pp. 132–141 (2010). http://dl.acm.org/citation.cfm?id=1894122.1894140

31.

Hoefler, T., Traff, J.: Sparse collective operations for MPI. In: 29th IEEE International Symposium on Parallel Distributed Processing, IPDPS ’09 (2009). doi:10.1109/IPDPS.2009.5160935

32.

Hong, X., Chongyang, W., Jiangyu, Y.: Analysis and research of floating-point exceptions. In: 2nd International Conference on Information Science and Engineering, ICISE ’10 (2010). doi:10.1109/ICISE.2010.5690343

33.

IEEE: IEEE standard for binary floating-point arithmetic. ANSI/IEEE Std 754–1985 (1985). doi:10.1109/IEEESTD.1985.82928

34.

IEEE: IEEE standard for floating-point arithmetic. ANSI/IEEE Std 754–2008 (2008). DOI 10.1109/IEEESTD.2008.4610935

35.

Katz, R.H.: Contemporary logic design. Benjamin-Cummings, Redwood City (1993)

36.

Kielmann, T., Hofman, R.E.H., Bal, H.E., Plaat, A., Bhoedjang, R.A.E.: MPI’s reduction operations in clustered wide area systems. In: Message Passing Interface Developer’s and User’s Conference, MPIFC ’99 (1999). doi:10.1109/IPDPS.2006.1639334

37.

Krueger, J., Donofrio, D., Shalf, J., Mohiyuddin, M., Williams, S., Oliker, L., Pfreund, F.J.: Hardware/software co-design for energy-efficient seismic modeling. In: Conference on High Performance Computing Networking, Storage and Analysis (2011)

38.

Kulisch, U.: Very fast and exact accumulation of products. Computing 91(4), 397–405 (2011). doi:10.1007/s00607-010-0131-y MATHMathSciNetCrossRef

39.

Kulisch, U., Snyder, V.: The exact dot product as basic tool for long interval arithmetic. Computing 91(3) (2011). doi:10.1007/s00607-010-0127-7

40.

Kwon, T.J., Sondeen, J., Draper, J.: Design trade-offs in floating-point unit implementation for embedded and processing-in-memory systems. In: IEEE International Symposium on Circuits and Systems, ISCAS ’05 (2005). doi:10.1109/ISCAS.2005.1465341

41.

McNamee, J.M.: A comparison of methods for accurate summation. ACM SIGSAM Bull. 38(1) (2004). doi:10.1145/980175.980177

42.

Petrini, F., Feng, W.c., Hoisie, A., Coll, S., Frachtenberg, E.: The quadrics network (QsNet): High-performance clustering technology. In: Proceedings of the The Ninth Symposium on High Performance Interconnects, HOTI ’01, pp. 125–130 (2001)

43.

Petrini, F., Moody, A., Fernandez, J., Frachtenberg, E., Panda, D.K.: NIC-based reduction algorithms for large-scale clusters. Int. J. High Perform. Comput. Netw. 4(3/4), 122–136 (2006). doi:10.1504/IJHPCN.2006.010635 CrossRef

44.

Pritchard, H., Gorodetsky, I., Buntinas, D.: A uGNI-based MPICH2 nemesis network module for the Cray XE. In: 18th European MPI Users’ Group Conference on Recent Advances in the Message Passing Interface, EuroMPI’11 (2011)

45.

Reussner, R., Sanders, P., Träff, J.L.: SKaMPI: a comprehensive benchmark for public benchmarking of MPI. Sci. Program. 10(1), 55–65 (2002)

46.

Ritzdorf, H., Traff, J.: Collective operations in NEC’s high-performance MPI libraries. In: International Parallel and Distributed Processing Symposium, IPDPS ’06 (2006). doi:10.1109/IPDPS.2006.1639334

47.

Schuite, M., Balzola, P., Akkas, A., Brocato, R.: Integer multiplication with overflow detection or saturation. IEEE Trans. Comput. 49(7), 681–691 (2000). doi:10.1109/12.863038 CrossRef

48.

Shalf, J., Dosanjh, S., Morrison, J.: Exascale computing technology challenges. In: 9th International Conference on High Performance Computing for Computational Science, VECPAR’10 (2011)

49.

Siegel, S., Wolff von Gudenberg, J.: A long accumulator like a carry-save adder. Computing 94(2–4), 203–213 (2012). doi:10.1007/s00607-011-0164-x

50.

Tsafrir, D.: The context-switch overhead inflicted by hardware interrupts (and the enigma of do-nothing loops). In: Experimental Computer Science on Experimental Computer Science, ECS ’07. USENIX Association (2007)

51.

Vishnu, A., ten Bruggencate, M., Olson, R.: Evaluating the potential of Cray Gemini interconnect for PGAS communication runtime systems. In: 19th IEEE Annual Symposium on High Performance Interconnects, HOTI ’11 (2011). doi:10.1109/HOTI.2011.19

52.

Vishnu, A., Koop, M., Moody, A., Mamidala, A., Narravula, S., Panda, D.: Hot-spot avoidance with multi-pathing over InfiniBand: an MPI perspective. In: 7th IEEE International Symposium on Cluster Computing and the Grid, CCGRID ’07 (2007). doi:10.1109/CCGRID.2007.60

53.

Zhu, Y.K., Hayes, W.B.: Algorithm 908: Online exact summation of floating-point streams. ACM Trans. Math. Softw. 37(3), 37:1–37:13 (2010). doi:10.1145/1824801.1824815

Title: Extending Summation Precision for Network Reduction Operations
Authors: George Michelogiannakis
Xiaoye S. Li
David H. Bailey
John Shalf
Publication date: 01-12-2015
Publisher: Springer US
Published in: International Journal of Parallel Programming / Issue 6/2015
Print ISSN: 0885-7458
Electronic ISSN: 1573-7640
DOI: https://doi.org/10.1007/s10766-014-0326-5

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Other articles of this Issue 6/2015

Fully Optimized Code Block Segmentation Algorithm for LTE-Advanced

An Autotuning Engine for the 3D Fast Wavelet Transform on Clusters with Hybrid CPU + GPU Platforms

Cluster Cache Monitor: Leveraging the Proximity Data in CMP

A Hardware/Software Approach for Database Query Acceleration with FPGAs

A Decomposition-Based Approach for Scalable Many-Field Packet Classification on Multi-core Processors

PageRank Computation Using a Multiple Implicitly Restarted Arnoldi Method for Modeling Epidemic Spread

Premium Partner