Top

Published in:

2019 | OriginalPaper | Chapter

8. Parallel FFT Algorithms for Distributed-Memory Parallel Computers

Author : Daisuke Takahashi

Published in: Fast Fourier Transform Algorithms for Parallel Computers

Publisher: Springer Singapore

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

This chapter presents parallel FFT algorithms for distributed-memory parallel computers. First, implementation of parallel FFTs in distributed-memory parallel computers and computation–communication overlap for parallel one-dimensional FFT are explained. Next, parallel three-dimensional FFT using two-dimensional decomposition is described. Then, the optimization of all-to-all communication on multicore cluster systems is explained. Finally, parallel one-dimensional FFT in a GPU cluster is described.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

previous chapter Parallel FFT Algorithms for Shared-Memory Parallel Computers

2DECOMP & FFT—Library for 2D Pencil Decomposition and Distributed FFTs. http://www.2decomp.org/

AccFFT. http://accfft.org/

FFTE: A Fast Fourier Transform Package. http://www.ffte.jp/

FFTW Home Page. http://www.fftw.org/

MPI: A Message-Passing Interface Standard. http://www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf

MVAPICH: MPI over InfiniBand and iWARP. http://mvapich.cse.ohio-state.edu/

P3DFFT Home Page: Scalable Spectral transforms in 3D - P3DFFT. https://www.p3dfft.net/

TOP500 Supercomputer Sites. https://www.top500.org/

Intel math kernel library developer reference. https://software.intel.com/sites/default/files/managed/ff/c8/mkl-2017-developer-reference-c_0.pdf (2017)

10.

CUDA Fortran Programming Guide and Reference. https://www.pgroup.com/resources/docs/19.1/pdf/pgi19cudaforug.pdf (2019)

11.

CUFFT Library User’s Guide. https://docs.nvidia.com/cuda/pdf/CUFFT_Library.pdf (2019)

12.

Agarwal, R.C., Gustavson, F.G., Zubair, M.: A high performance parallel algorithm for 1-D FFT. In: Proceedings of Supercomputing ’94, pp. 34–40 (1994)

13.

Almási, G., Archer, C.J., Erway, C.C., Heidelberger, P., Martorell, X., Moreira, J.E., Steinmacher-Burow, B., Zheng, Y.: Optimization of MPI collective communication on BlueGene/L systems. In: Proceedings of 19th ACM International Conference on Supercomputing (ICS’05), pp. 253–262 (2005)

14.

Ayala, O., Wang, L.P.: Parallel implementation and scalability analysis of 3D fast Fourier transform using 2D domain decomposition. Parallel Comput. 39, 58–77 (2013)MathSciNetCrossRef

15.

Bonelli, A., Franchetti, F., Lorenz, J., Püschel, M., Ueberhuber, C.W.: Automatic performance optimization of the discrete Fourier transform on distributed memory computers. In: Proceedings of 4th International Symposium on Parallel and Distributed Processing and Applications (ISPA 2006). Lecture Notes in Computer Science, vol. 4330, pp. 818–832. Springer-Verlag, Berlin (2006)

16.

Brass, A., Pawley, G.S.: Two and three dimensional FFTs on highly parallel computers. Parallel Comput. 3, 167–184 (1986)MathSciNetCrossRef

17.

Calvin, C.: Implementation of parallel FFT algorithms on distributed memory machines with a minimum overhead of communication. Parallel Comput. 22, 1255–1279 (1986)MathSciNetCrossRef

18.

Chen, Y., Cui, X., Mei, H.: Large-scale FFT on GPU clusters. In: Proceedings of 24th International Conference on Supercomputing (ICS’10) , pp. 315–324 (2010)

19.

Doi, J., Negishi, Y.: Overlapping methods of all-to-all communication and FFT algorithms for torus-connected massively parallel supercomputers. In: Proceedings of 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC’10) (2010)

20.

Edelman, A., McCorquodale, P., Toledo, S.: The future fast Fourier transform? SIAM J. Sci. Comput. 20, 1094–1114 (1999)MathSciNetCrossRef

21.

Eleftheriou, M., Fitch, B.G., Rayshubskiy, A., Ward, T.J.C., Germain, R.S.: Scalable framework for 3D FFTs on the Blue Gene/L supercomputer: implementation and early performance measurements. IBM J. Res. Dev. 49, 457–464 (2005)CrossRef

22.

Fang, B., Deng, Y., Martyna, G.: Performance of the 3D FFT on the 6D network torus QCDOC parallel supercomputer. Comput. Phys. Commun. 176, 531–538 (2007)CrossRef

23.

Faraj, A., Kumar, S., Smith, B., Mamidala, A., Gunnels, J.: MPI collective communications on the Blue Gene/P supercomputer: algorithms and optimizations. In: Proceedings of 17th IEEE Symposium on High Performance Interconnects (HOTI’09), pp. 63–72 (2009)

24.

Faraj, A., Yuan, X.: Automatic generation and tuning of MPI collective communication routines. In: Proceedings of 19th ACM International Conference on Supercomputing (ICS’05), pp. 393–402 (2005)

25.

Faraj, A., Yuan, X., Lowenthal, D.: STAR-MPI: self tuned adaptive routines for MPI collective operations. In: Proceedings of 20th ACM International Conference on Supercomputing (ICS’06), pp. 199–208 (2006)

26.

Franchetti, F., Low, T.M., Popovici, D.T., Veras, R.M., Spampinato, D.G., Johnson, J.R., Püschel, M., Hoe, J.C., Moura, J.M.F.: SPIRAL: extreme performance portability. Proc. IEEE 106, 1935–1968 (2018)CrossRef

27.

Frigo, M., Johnson, S.G.: The design and implementation of FFTW3. Proc. IEEE 93, 216–231 (2005)CrossRef

28.

Idomura, Y., Nakata, M., Yamada, S., Machida, M., Imamura, T., Watanabe, T., Nunami, M., Inoue, H., Tsutsumi, S., Miyoshi, I., Shida, N.: Communication-overlap techniques for improved strong scaling of gyrokinetic eulerian code beyond 100k cores on the K-computer. Int. J. High Perform. Comput. Appl. 28, 73–86 (2014)CrossRef

29.

Johnsson, S.L., Krawitz, R.L.: Cooley-Tukey FFT on the Connection Machine. Parallel Comput. 18, 1201–1221 (1992)

30.

Kumar, R., Mamidala, A., Panda, D.K.: Scaling alltoall collective on multi-core systems. In: Proceedings of 2008 IEEE International Parallel and Distributed Processing Symposium (IPDPS 2008) (2008)

31.

Kumar, S., Sabharwal, Y., Garg, R., Heidelberger, P.: Optimization of all-to-all communication on the Blue Gene/L supercomputer. In: Proceedings of 37th International Conference on Parallel Processing (ICPP 2008), pp. 320–329 (2008)

32.

Maeyama, S., Watanabe, T., Idomura, Y., Nakata, M., Nunami, M., Ishizawa, A.: Improved strong scaling of a spectral/finite difference gyrokinetic code for multi-scale plasma turbulence. Parallel Comput. 49, 1–12 (2015)MathSciNetCrossRef

33.

Mirković, D., Johnsson, S.L.: Automatic performance tuning in the UHFFT library. In: Proceedings of 2001 International Conference on Computational Science (ICCS 2001). Lecture Notes in Computer Science, vol. 2073, pp. 71–80. Springer, Berlin (2001)

34.

Nukada, A., Matsuoka, S.: Auto-tuning 3-D FFT library for CUDA GPUs. In: Proceedings of Conference on High Performance Computing Networking, Storage and Analysis (SC’09) (2009)

35.

Nukada, A., Sato, K., Matsuoka, S.: Scalable multi-GPU 3-D FFT for TSUBAME 2.0 supercomputer. In: Proceedings of International Conference on High Performance Computing, Networking, Storage and Analysis (SC’12) (2012)

36.

Park, J., Bikshandi, G., Vaidyanathan, K., Tang, P.T.P., Dubey, P., Kim, D.: Tera-scale 1D FFT with low-communication algorithm and Intel Xeon Phi coprocessors. In: Proceedings of International Conference on High Performance Computing, Networking, Storage and Analysis (SC’13) (2013)

37.

Pekurovsky, D.: P3DFFT: a framework for parallel computations of Fourier transforms in three dimensions. SIAM J. Sci. Comput. 34, C192–C209 (2012)MathSciNetCrossRef

38.

Popovici, D.T., Low, T.M., Franchetti, F.: Large bandwidth-efficient FFTs on multicore and multi-socket systems. In: Proceedings of 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS 2018), pp. 379–388 (2018)

39.

Püschel, M., Moura, J.M.F., Johnson, J.R., Padua, D., Veloso, M.M., Singer, B.W., Xiong, J., Franchetti, F., Gačić, A., Voronenko, Y., Chen, K., Johnson, R.W., Rizzolo, N.: SPIRAL: code generation for DSP transforms. Proc. IEEE 93, 232–275 (2005)CrossRef

40.

Ruetsch, G., Micikevicius, P.: Optimizing matrix transpose in CUDA.http://www.cs.colostate.edu/~,cs675/MatrixTranspose.pdf (2009)

41.

Sanders, P., Träff, J.L.: The hierarchical factor algorithm for all-to-all communication. In: Proceedings of 8th International Euro-Par Conference (Euro-Par 2002). Lecture Notes in Computer Science, vol. 2400, pp. 799–803. Springer, Berlin (2002)

42.

Song, S., Hollingsworth, J.K.: Computation-communication overlap and parameter auto-tuning for scalable parallel 3-D FFT. Parallel Comput. 14, 38–50 (2016)MathSciNet

43.

Takahashi, D.: Efficient implementation of parallel three-dimensional FFT on clusters of PCs. Comput. Phys. Commun. 152, 144–150 (2003)CrossRef

44.

Takahashi, D.: A parallel 1-D FFT algorithm for the Hitachi SR8000. Parallel Comput. 29, 679–690 (2003)MathSciNetCrossRef

45.

Takahashi, D.: Automatic Tuning for Parallel FFTs, pp. 49–67. Springer, New York (2010)

46.

Takahashi, D.: An implementation of parallel 3-D FFT with 2-D decomposition on a massively parallel cluster of multi-core processors. In: Proceedings of 8th International Conference on Parallel Processing and Applied Mathematics (PPAM 2009), Part I, Workshop on Memory Issues on Multi- and Manycore Platforms. Lecture Notes in Computer Science, vol. 6067, pp. 606–614. Springer, Berlin (2010)

47.

Takahashi, D.: Implementation of parallel 1-D FFT on GPU clusters. In: Proceedings of 2013 IEEE 16th International Conference on Computational Science and Engineering (CSE 2013), pp. 174–180 (2013)

48.

Takahashi, D.: Automatic tuning of computation-communication overlap for parallel 1-D FFT. In: Proceedings of 2016 IEEE 19th International Conference on Computational Science and Engineering (CSE 2016), pp. 253–256 (2016)

49.

Takahashi, D., Boku, T., Sato, M.: A blocking algorithm for parallel 1-D FFT on clusters of PCs. In: Proceedings of 8th International Euro-Par Conference (Euro-Par 2002). Lecture Notes in Computer Science, vol. 2400, pp. 691–700. Springer, Berlin (2002)

50.

Takahashi, D., Kanada, Y.: High-performance radix-2, 3 and 5 parallel 1-D complex FFT algorithms for distributed-memory parallel computers. J. Supercomput. 15, 207–228 (2000)CrossRef

51.

Takahashi, D., Uno, A., Yokokawa, M.: An implementation of parallel 1-D FFT on the K computer. In: Proceedings of 2012 IEEE 14th International Conference on High Performance Computing and Communications (HPCC-2012), pp. 344–350 (2012)

52.

Tang, P.T.P., Park, J., Kim, D., Petrov, V.: A framework for low-communication 1-D FFT. In: Proceedings of International Conference on High Performance Computing, Networking, Storage and Analysis (SC’12) (2012)

53.

Vadhiyar, S.S., Fagg, G.E., Dongarra, J.: Automatically tuned collective communications. In: Proceedings of 2000 ACM/IEEE Conference on Supercomputing (SC’00) (2000)

54.

Wang, H., Potluri, S., Luo, M., Singh, A.K., Sur, S., Panda, D.K.: MVAPICH2-GPU: optimized GPU to GPU communication for InfiniBand clusters. Comput. Sci. Res. Dev. 26, 257–266 (2011)CrossRef

Title: Parallel FFT Algorithms for Distributed-Memory Parallel Computers
Author: Daisuke Takahashi
Publisher: Springer Singapore
Book: Fast Fourier Transform Algorithms for Parallel Computers
Print ISBN: 978-981-13-9964-0

Electronic ISBN: 978-981-13-9965-7

Copyright Year: 2019
DOI: https://doi.org/10.1007/978-981-13-9965-7_8

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Premium Partner