Skip to main content
Top

2019 | OriginalPaper | Chapter

8. Parallel FFT Algorithms for Distributed-Memory Parallel Computers

Author : Daisuke Takahashi

Published in: Fast Fourier Transform Algorithms for Parallel Computers

Publisher: Springer Singapore

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

This chapter presents parallel FFT algorithms for distributed-memory parallel computers. First, implementation of parallel FFTs in distributed-memory parallel computers and computation–communication overlap for parallel one-dimensional FFT are explained. Next, parallel three-dimensional FFT using two-dimensional decomposition is described. Then, the optimization of all-to-all communication on multicore cluster systems is explained. Finally, parallel one-dimensional FFT in a GPU cluster is described.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
12.
go back to reference Agarwal, R.C., Gustavson, F.G., Zubair, M.: A high performance parallel algorithm for 1-D FFT. In: Proceedings of Supercomputing ’94, pp. 34–40 (1994) Agarwal, R.C., Gustavson, F.G., Zubair, M.: A high performance parallel algorithm for 1-D FFT. In: Proceedings of Supercomputing ’94, pp. 34–40 (1994)
13.
go back to reference Almási, G., Archer, C.J., Erway, C.C., Heidelberger, P., Martorell, X., Moreira, J.E., Steinmacher-Burow, B., Zheng, Y.: Optimization of MPI collective communication on BlueGene/L systems. In: Proceedings of 19th ACM International Conference on Supercomputing (ICS’05), pp. 253–262 (2005) Almási, G., Archer, C.J., Erway, C.C., Heidelberger, P., Martorell, X., Moreira, J.E., Steinmacher-Burow, B., Zheng, Y.: Optimization of MPI collective communication on BlueGene/L systems. In: Proceedings of 19th ACM International Conference on Supercomputing (ICS’05), pp. 253–262 (2005)
14.
go back to reference Ayala, O., Wang, L.P.: Parallel implementation and scalability analysis of 3D fast Fourier transform using 2D domain decomposition. Parallel Comput. 39, 58–77 (2013)MathSciNetCrossRef Ayala, O., Wang, L.P.: Parallel implementation and scalability analysis of 3D fast Fourier transform using 2D domain decomposition. Parallel Comput. 39, 58–77 (2013)MathSciNetCrossRef
15.
go back to reference Bonelli, A., Franchetti, F., Lorenz, J., Püschel, M., Ueberhuber, C.W.: Automatic performance optimization of the discrete Fourier transform on distributed memory computers. In: Proceedings of 4th International Symposium on Parallel and Distributed Processing and Applications (ISPA 2006). Lecture Notes in Computer Science, vol. 4330, pp. 818–832. Springer-Verlag, Berlin (2006) Bonelli, A., Franchetti, F., Lorenz, J., Püschel, M., Ueberhuber, C.W.: Automatic performance optimization of the discrete Fourier transform on distributed memory computers. In: Proceedings of 4th International Symposium on Parallel and Distributed Processing and Applications (ISPA 2006). Lecture Notes in Computer Science, vol. 4330, pp. 818–832. Springer-Verlag, Berlin (2006)
16.
go back to reference Brass, A., Pawley, G.S.: Two and three dimensional FFTs on highly parallel computers. Parallel Comput. 3, 167–184 (1986)MathSciNetCrossRef Brass, A., Pawley, G.S.: Two and three dimensional FFTs on highly parallel computers. Parallel Comput. 3, 167–184 (1986)MathSciNetCrossRef
17.
go back to reference Calvin, C.: Implementation of parallel FFT algorithms on distributed memory machines with a minimum overhead of communication. Parallel Comput. 22, 1255–1279 (1986)MathSciNetCrossRef Calvin, C.: Implementation of parallel FFT algorithms on distributed memory machines with a minimum overhead of communication. Parallel Comput. 22, 1255–1279 (1986)MathSciNetCrossRef
18.
go back to reference Chen, Y., Cui, X., Mei, H.: Large-scale FFT on GPU clusters. In: Proceedings of 24th International Conference on Supercomputing (ICS’10) , pp. 315–324 (2010) Chen, Y., Cui, X., Mei, H.: Large-scale FFT on GPU clusters. In: Proceedings of 24th International Conference on Supercomputing (ICS’10) , pp. 315–324 (2010)
19.
go back to reference Doi, J., Negishi, Y.: Overlapping methods of all-to-all communication and FFT algorithms for torus-connected massively parallel supercomputers. In: Proceedings of 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC’10) (2010) Doi, J., Negishi, Y.: Overlapping methods of all-to-all communication and FFT algorithms for torus-connected massively parallel supercomputers. In: Proceedings of 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC’10) (2010)
20.
go back to reference Edelman, A., McCorquodale, P., Toledo, S.: The future fast Fourier transform? SIAM J. Sci. Comput. 20, 1094–1114 (1999)MathSciNetCrossRef Edelman, A., McCorquodale, P., Toledo, S.: The future fast Fourier transform? SIAM J. Sci. Comput. 20, 1094–1114 (1999)MathSciNetCrossRef
21.
go back to reference Eleftheriou, M., Fitch, B.G., Rayshubskiy, A., Ward, T.J.C., Germain, R.S.: Scalable framework for 3D FFTs on the Blue Gene/L supercomputer: implementation and early performance measurements. IBM J. Res. Dev. 49, 457–464 (2005)CrossRef Eleftheriou, M., Fitch, B.G., Rayshubskiy, A., Ward, T.J.C., Germain, R.S.: Scalable framework for 3D FFTs on the Blue Gene/L supercomputer: implementation and early performance measurements. IBM J. Res. Dev. 49, 457–464 (2005)CrossRef
22.
go back to reference Fang, B., Deng, Y., Martyna, G.: Performance of the 3D FFT on the 6D network torus QCDOC parallel supercomputer. Comput. Phys. Commun. 176, 531–538 (2007)CrossRef Fang, B., Deng, Y., Martyna, G.: Performance of the 3D FFT on the 6D network torus QCDOC parallel supercomputer. Comput. Phys. Commun. 176, 531–538 (2007)CrossRef
23.
go back to reference Faraj, A., Kumar, S., Smith, B., Mamidala, A., Gunnels, J.: MPI collective communications on the Blue Gene/P supercomputer: algorithms and optimizations. In: Proceedings of 17th IEEE Symposium on High Performance Interconnects (HOTI’09), pp. 63–72 (2009) Faraj, A., Kumar, S., Smith, B., Mamidala, A., Gunnels, J.: MPI collective communications on the Blue Gene/P supercomputer: algorithms and optimizations. In: Proceedings of 17th IEEE Symposium on High Performance Interconnects (HOTI’09), pp. 63–72 (2009)
24.
go back to reference Faraj, A., Yuan, X.: Automatic generation and tuning of MPI collective communication routines. In: Proceedings of 19th ACM International Conference on Supercomputing (ICS’05), pp. 393–402 (2005) Faraj, A., Yuan, X.: Automatic generation and tuning of MPI collective communication routines. In: Proceedings of 19th ACM International Conference on Supercomputing (ICS’05), pp. 393–402 (2005)
25.
go back to reference Faraj, A., Yuan, X., Lowenthal, D.: STAR-MPI: self tuned adaptive routines for MPI collective operations. In: Proceedings of 20th ACM International Conference on Supercomputing (ICS’06), pp. 199–208 (2006) Faraj, A., Yuan, X., Lowenthal, D.: STAR-MPI: self tuned adaptive routines for MPI collective operations. In: Proceedings of 20th ACM International Conference on Supercomputing (ICS’06), pp. 199–208 (2006)
26.
go back to reference Franchetti, F., Low, T.M., Popovici, D.T., Veras, R.M., Spampinato, D.G., Johnson, J.R., Püschel, M., Hoe, J.C., Moura, J.M.F.: SPIRAL: extreme performance portability. Proc. IEEE 106, 1935–1968 (2018)CrossRef Franchetti, F., Low, T.M., Popovici, D.T., Veras, R.M., Spampinato, D.G., Johnson, J.R., Püschel, M., Hoe, J.C., Moura, J.M.F.: SPIRAL: extreme performance portability. Proc. IEEE 106, 1935–1968 (2018)CrossRef
27.
go back to reference Frigo, M., Johnson, S.G.: The design and implementation of FFTW3. Proc. IEEE 93, 216–231 (2005)CrossRef Frigo, M., Johnson, S.G.: The design and implementation of FFTW3. Proc. IEEE 93, 216–231 (2005)CrossRef
28.
go back to reference Idomura, Y., Nakata, M., Yamada, S., Machida, M., Imamura, T., Watanabe, T., Nunami, M., Inoue, H., Tsutsumi, S., Miyoshi, I., Shida, N.: Communication-overlap techniques for improved strong scaling of gyrokinetic eulerian code beyond 100k cores on the K-computer. Int. J. High Perform. Comput. Appl. 28, 73–86 (2014)CrossRef Idomura, Y., Nakata, M., Yamada, S., Machida, M., Imamura, T., Watanabe, T., Nunami, M., Inoue, H., Tsutsumi, S., Miyoshi, I., Shida, N.: Communication-overlap techniques for improved strong scaling of gyrokinetic eulerian code beyond 100k cores on the K-computer. Int. J. High Perform. Comput. Appl. 28, 73–86 (2014)CrossRef
29.
go back to reference Johnsson, S.L., Krawitz, R.L.: Cooley-Tukey FFT on the Connection Machine. Parallel Comput. 18, 1201–1221 (1992) Johnsson, S.L., Krawitz, R.L.: Cooley-Tukey FFT on the Connection Machine. Parallel Comput. 18, 1201–1221 (1992)
30.
go back to reference Kumar, R., Mamidala, A., Panda, D.K.: Scaling alltoall collective on multi-core systems. In: Proceedings of 2008 IEEE International Parallel and Distributed Processing Symposium (IPDPS 2008) (2008) Kumar, R., Mamidala, A., Panda, D.K.: Scaling alltoall collective on multi-core systems. In: Proceedings of 2008 IEEE International Parallel and Distributed Processing Symposium (IPDPS 2008) (2008)
31.
go back to reference Kumar, S., Sabharwal, Y., Garg, R., Heidelberger, P.: Optimization of all-to-all communication on the Blue Gene/L supercomputer. In: Proceedings of 37th International Conference on Parallel Processing (ICPP 2008), pp. 320–329 (2008) Kumar, S., Sabharwal, Y., Garg, R., Heidelberger, P.: Optimization of all-to-all communication on the Blue Gene/L supercomputer. In: Proceedings of 37th International Conference on Parallel Processing (ICPP 2008), pp. 320–329 (2008)
32.
go back to reference Maeyama, S., Watanabe, T., Idomura, Y., Nakata, M., Nunami, M., Ishizawa, A.: Improved strong scaling of a spectral/finite difference gyrokinetic code for multi-scale plasma turbulence. Parallel Comput. 49, 1–12 (2015)MathSciNetCrossRef Maeyama, S., Watanabe, T., Idomura, Y., Nakata, M., Nunami, M., Ishizawa, A.: Improved strong scaling of a spectral/finite difference gyrokinetic code for multi-scale plasma turbulence. Parallel Comput. 49, 1–12 (2015)MathSciNetCrossRef
33.
go back to reference Mirković, D., Johnsson, S.L.: Automatic performance tuning in the UHFFT library. In: Proceedings of 2001 International Conference on Computational Science (ICCS 2001). Lecture Notes in Computer Science, vol. 2073, pp. 71–80. Springer, Berlin (2001) Mirković, D., Johnsson, S.L.: Automatic performance tuning in the UHFFT library. In: Proceedings of 2001 International Conference on Computational Science (ICCS 2001). Lecture Notes in Computer Science, vol. 2073, pp. 71–80. Springer, Berlin (2001)
34.
go back to reference Nukada, A., Matsuoka, S.: Auto-tuning 3-D FFT library for CUDA GPUs. In: Proceedings of Conference on High Performance Computing Networking, Storage and Analysis (SC’09) (2009) Nukada, A., Matsuoka, S.: Auto-tuning 3-D FFT library for CUDA GPUs. In: Proceedings of Conference on High Performance Computing Networking, Storage and Analysis (SC’09) (2009)
35.
go back to reference Nukada, A., Sato, K., Matsuoka, S.: Scalable multi-GPU 3-D FFT for TSUBAME 2.0 supercomputer. In: Proceedings of International Conference on High Performance Computing, Networking, Storage and Analysis (SC’12) (2012) Nukada, A., Sato, K., Matsuoka, S.: Scalable multi-GPU 3-D FFT for TSUBAME 2.0 supercomputer. In: Proceedings of International Conference on High Performance Computing, Networking, Storage and Analysis (SC’12) (2012)
36.
go back to reference Park, J., Bikshandi, G., Vaidyanathan, K., Tang, P.T.P., Dubey, P., Kim, D.: Tera-scale 1D FFT with low-communication algorithm and Intel Xeon Phi coprocessors. In: Proceedings of International Conference on High Performance Computing, Networking, Storage and Analysis (SC’13) (2013) Park, J., Bikshandi, G., Vaidyanathan, K., Tang, P.T.P., Dubey, P., Kim, D.: Tera-scale 1D FFT with low-communication algorithm and Intel Xeon Phi coprocessors. In: Proceedings of International Conference on High Performance Computing, Networking, Storage and Analysis (SC’13) (2013)
37.
go back to reference Pekurovsky, D.: P3DFFT: a framework for parallel computations of Fourier transforms in three dimensions. SIAM J. Sci. Comput. 34, C192–C209 (2012)MathSciNetCrossRef Pekurovsky, D.: P3DFFT: a framework for parallel computations of Fourier transforms in three dimensions. SIAM J. Sci. Comput. 34, C192–C209 (2012)MathSciNetCrossRef
38.
go back to reference Popovici, D.T., Low, T.M., Franchetti, F.: Large bandwidth-efficient FFTs on multicore and multi-socket systems. In: Proceedings of 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS 2018), pp. 379–388 (2018) Popovici, D.T., Low, T.M., Franchetti, F.: Large bandwidth-efficient FFTs on multicore and multi-socket systems. In: Proceedings of 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS 2018), pp. 379–388 (2018)
39.
go back to reference Püschel, M., Moura, J.M.F., Johnson, J.R., Padua, D., Veloso, M.M., Singer, B.W., Xiong, J., Franchetti, F., Gačić, A., Voronenko, Y., Chen, K., Johnson, R.W., Rizzolo, N.: SPIRAL: code generation for DSP transforms. Proc. IEEE 93, 232–275 (2005)CrossRef Püschel, M., Moura, J.M.F., Johnson, J.R., Padua, D., Veloso, M.M., Singer, B.W., Xiong, J., Franchetti, F., Gačić, A., Voronenko, Y., Chen, K., Johnson, R.W., Rizzolo, N.: SPIRAL: code generation for DSP transforms. Proc. IEEE 93, 232–275 (2005)CrossRef
41.
go back to reference Sanders, P., Träff, J.L.: The hierarchical factor algorithm for all-to-all communication. In: Proceedings of 8th International Euro-Par Conference (Euro-Par 2002). Lecture Notes in Computer Science, vol. 2400, pp. 799–803. Springer, Berlin (2002) Sanders, P., Träff, J.L.: The hierarchical factor algorithm for all-to-all communication. In: Proceedings of 8th International Euro-Par Conference (Euro-Par 2002). Lecture Notes in Computer Science, vol. 2400, pp. 799–803. Springer, Berlin (2002)
42.
go back to reference Song, S., Hollingsworth, J.K.: Computation-communication overlap and parameter auto-tuning for scalable parallel 3-D FFT. Parallel Comput. 14, 38–50 (2016)MathSciNet Song, S., Hollingsworth, J.K.: Computation-communication overlap and parameter auto-tuning for scalable parallel 3-D FFT. Parallel Comput. 14, 38–50 (2016)MathSciNet
43.
go back to reference Takahashi, D.: Efficient implementation of parallel three-dimensional FFT on clusters of PCs. Comput. Phys. Commun. 152, 144–150 (2003)CrossRef Takahashi, D.: Efficient implementation of parallel three-dimensional FFT on clusters of PCs. Comput. Phys. Commun. 152, 144–150 (2003)CrossRef
44.
45.
go back to reference Takahashi, D.: Automatic Tuning for Parallel FFTs, pp. 49–67. Springer, New York (2010) Takahashi, D.: Automatic Tuning for Parallel FFTs, pp. 49–67. Springer, New York (2010)
46.
go back to reference Takahashi, D.: An implementation of parallel 3-D FFT with 2-D decomposition on a massively parallel cluster of multi-core processors. In: Proceedings of 8th International Conference on Parallel Processing and Applied Mathematics (PPAM 2009), Part I, Workshop on Memory Issues on Multi- and Manycore Platforms. Lecture Notes in Computer Science, vol. 6067, pp. 606–614. Springer, Berlin (2010) Takahashi, D.: An implementation of parallel 3-D FFT with 2-D decomposition on a massively parallel cluster of multi-core processors. In: Proceedings of 8th International Conference on Parallel Processing and Applied Mathematics (PPAM 2009), Part I, Workshop on Memory Issues on Multi- and Manycore Platforms. Lecture Notes in Computer Science, vol. 6067, pp. 606–614. Springer, Berlin (2010)
47.
go back to reference Takahashi, D.: Implementation of parallel 1-D FFT on GPU clusters. In: Proceedings of 2013 IEEE 16th International Conference on Computational Science and Engineering (CSE 2013), pp. 174–180 (2013) Takahashi, D.: Implementation of parallel 1-D FFT on GPU clusters. In: Proceedings of 2013 IEEE 16th International Conference on Computational Science and Engineering (CSE 2013), pp. 174–180 (2013)
48.
go back to reference Takahashi, D.: Automatic tuning of computation-communication overlap for parallel 1-D FFT. In: Proceedings of 2016 IEEE 19th International Conference on Computational Science and Engineering (CSE 2016), pp. 253–256 (2016) Takahashi, D.: Automatic tuning of computation-communication overlap for parallel 1-D FFT. In: Proceedings of 2016 IEEE 19th International Conference on Computational Science and Engineering (CSE 2016), pp. 253–256 (2016)
49.
go back to reference Takahashi, D., Boku, T., Sato, M.: A blocking algorithm for parallel 1-D FFT on clusters of PCs. In: Proceedings of 8th International Euro-Par Conference (Euro-Par 2002). Lecture Notes in Computer Science, vol. 2400, pp. 691–700. Springer, Berlin (2002) Takahashi, D., Boku, T., Sato, M.: A blocking algorithm for parallel 1-D FFT on clusters of PCs. In: Proceedings of 8th International Euro-Par Conference (Euro-Par 2002). Lecture Notes in Computer Science, vol. 2400, pp. 691–700. Springer, Berlin (2002)
50.
go back to reference Takahashi, D., Kanada, Y.: High-performance radix-2, 3 and 5 parallel 1-D complex FFT algorithms for distributed-memory parallel computers. J. Supercomput. 15, 207–228 (2000)CrossRef Takahashi, D., Kanada, Y.: High-performance radix-2, 3 and 5 parallel 1-D complex FFT algorithms for distributed-memory parallel computers. J. Supercomput. 15, 207–228 (2000)CrossRef
51.
go back to reference Takahashi, D., Uno, A., Yokokawa, M.: An implementation of parallel 1-D FFT on the K computer. In: Proceedings of 2012 IEEE 14th International Conference on High Performance Computing and Communications (HPCC-2012), pp. 344–350 (2012) Takahashi, D., Uno, A., Yokokawa, M.: An implementation of parallel 1-D FFT on the K computer. In: Proceedings of 2012 IEEE 14th International Conference on High Performance Computing and Communications (HPCC-2012), pp. 344–350 (2012)
52.
go back to reference Tang, P.T.P., Park, J., Kim, D., Petrov, V.: A framework for low-communication 1-D FFT. In: Proceedings of International Conference on High Performance Computing, Networking, Storage and Analysis (SC’12) (2012) Tang, P.T.P., Park, J., Kim, D., Petrov, V.: A framework for low-communication 1-D FFT. In: Proceedings of International Conference on High Performance Computing, Networking, Storage and Analysis (SC’12) (2012)
53.
go back to reference Vadhiyar, S.S., Fagg, G.E., Dongarra, J.: Automatically tuned collective communications. In: Proceedings of 2000 ACM/IEEE Conference on Supercomputing (SC’00) (2000) Vadhiyar, S.S., Fagg, G.E., Dongarra, J.: Automatically tuned collective communications. In: Proceedings of 2000 ACM/IEEE Conference on Supercomputing (SC’00) (2000)
54.
go back to reference Wang, H., Potluri, S., Luo, M., Singh, A.K., Sur, S., Panda, D.K.: MVAPICH2-GPU: optimized GPU to GPU communication for InfiniBand clusters. Comput. Sci. Res. Dev. 26, 257–266 (2011)CrossRef Wang, H., Potluri, S., Luo, M., Singh, A.K., Sur, S., Panda, D.K.: MVAPICH2-GPU: optimized GPU to GPU communication for InfiniBand clusters. Comput. Sci. Res. Dev. 26, 257–266 (2011)CrossRef
Metadata
Title
Parallel FFT Algorithms for Distributed-Memory Parallel Computers
Author
Daisuke Takahashi
Copyright Year
2019
Publisher
Springer Singapore
DOI
https://doi.org/10.1007/978-981-13-9965-7_8

Premium Partner