ABSTRACT
A recent development in radio astronomy is to replace traditional dishes with many small antennas. The signals are combined to form one large, virtual telescope. The enormous data streams are cross-correlated to filter out noise. This is especially challenging, since the computational demands grow quadratically with the number of data streams. Moreover, the correlator is not only computationally intensive, but also very I/O intensive. The LOFAR telescope, for instance, will produce over 100 terabytes per day. The future SKA telescope will even require in the order of exaflops, and petabits/s of I/O. A recent trend is to correlate in software instead of dedicated hardware. This is done to increase flexibility and to reduce development efforts. Examples include e-VLBI and LOFAR.
In this paper, we evaluate the correlator algorithm on multi-core CPUs and many-core architectures, such as NVIDIA and ATI GPUs, and the Cell/B.E. The correlator is a streaming, real-time application, and is much more I/O intensive than applications that are typically implemented on many-core hardware today. We compare with the LOFAR production correlator on an IBM Blue Gene/P supercomputer. We investigate performance, power efficiency, and programmability. We identify several important architectural problems which cause architectures to perform suboptimally. Our findings are applicable to data-intensive applications in general.
The results show that the processing power and memory bandwidth of current GPUs are highly imbalanced for correlation purposes. While the production correlator on the Blue Gene/P achieves a superb 96% of the theoretical peak performance, this is only 14% on ATI GPUs, and 26% on NVIDIA GPUs. The Cell/B.E. processor, in contrast, achieves an excellent 92%. We found that the Cell/B.E. is also the most energy-efficient solution, it runs the correlator 5-7 times more energy efficiently than the Blue Gene/P. The research presented is an important pathfinder for next-generation telescopes.
- The Karoo Array Telescope (MeerKAT). See http://www.ska.ac.za.Google Scholar
- NVIDIA CUDA Compute Unified Device Architecture Programming Guide Version 2.0, july 2008.Google Scholar
- Advanced Micro Devices Corporation (AMD). AMD Stream Computing User Guide, august 2008. Revision 1.1.Google Scholar
- I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanrahan. Brook for GPUs: Stream Computing on Graphics Hardware. In ACM Transactions on Graphics, Proceedings of SIGGRAPH 2004, pages 777--786, Los Angeles, California, August 2004. ACM Press. Google ScholarDigital Library
- M. Gschwind, H. P. Hofstee, B. K. Flachs, M. Hopkins, Y. Watanabe, and T. Yamazaki. Synergistic Processing in Cell's Multicore Architecture. IEEE Micro, 26(2):10--24, 2006. Google ScholarDigital Library
- IBM Blue Gene team. Overview of the IBM Blue Gene/P project. IBM Journal of Research and Development, 52(1/2):199--220, January/March 2008. Google ScholarDigital Library
- S. Johnston, R. Taylor, M. Bailes, et al. Science with ASKAP. The Australian square-kilometre-array pathfinder. Experimental Astronomy, 22(3):151--273, 2008.Google ScholarCross Ref
- L. de Souza, J. D. Bunton, D. Campbell-Wilson, R. J. Cappallo, and B. Kincaid. A radio astronomy correlator optimized for the Xilinx Virtex-4 SX FPGA. In International Conference on Field Programmable Logic and Applications (FPL'07), pages 62--67, August 2007.Google ScholarCross Ref
- E. D. Lazowska, J. Zahorjana, G. S. Graham, and K. C. Sevcik. Quantitative System Performance, Computer System Analysis Using Queueing Network Models. Prentice-Hall, 1984. Google ScholarDigital Library
- T. G. Mattson, R. V. der Wijngaart, and M. Frumkin. Programming the Intel 80-core network-on-a-chip terascale processor. In Proceedings of the 2008 ACM/IEEE conference on Supercomputing (SC'08), pages 1--11, Austin, Texas, 2008. Google ScholarDigital Library
- J. D. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Krüger, A. E. Lefohn, and T. Purcell. A Survey of General-Purpose Computation on Graphics Hardware. Computer Graphics Forum, 26(1):80--113, 2007.Google ScholarCross Ref
- J. W. Romein, P. C. Broekema, J. D. Mol, and Rob V. van Nieuwpoort. Processing Real-Time LOFAR Telescope Data on a Blue Gene/P Supercomputer. 2009. Submitted for publication. See http://www.astron.nl/ romein/papers.Google Scholar
- J. W. Romein, P. C. Broekema, E. van Meijeren, K. van der Schaaf, and W. H. Zwart. Astronomical Real-Time Streaming Signal Processing on a Blue Gene/L Supercomputer. In ACM Symposium on Parallel Algorithms and Architectures (SPAA'06), pages 59--66, Cambridge, MA, July 2006. Google ScholarDigital Library
- R. T. Schilizzi, P. E. F. Dewdney, and T. J. W. Lazio. The Square Kilometre Array. Proceedings of SPIE, 7012, july 2008.Google Scholar
- L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey, S. Junkins, A. Lake, J. Sugerman, R. Cavin, R. Espasa, E. Grochowski, T. Juan, and P. Hanrahan. Larrabee: A Many-Core x86 Architecture for Visual Computing. ACM Transactions on Graphics, 27(3), August 2008. Google ScholarDigital Library
- M. Silberstein, A. Schuster, D. Geiger, A. Patney, and J. D. Owens. Efficient Computation of Sum-products on GPUs Through Software-Managed Cache. In Proceedings of the 22nd ACM International Conference on Supercomputing, pages 309--318, June 2008. Google ScholarDigital Library
- A. Varbanescu, A. van Amesfoort, T. Cornwell, G. van Diepen, R. van Nieuwpoort, B. Elmegreen, and H. Sips. Building High-Resolution Sky Images using the Cell/B.E. Scientific Programming (accepted, to appear) Special Issue on High Performance Computing on the Cell BE, 2008. Google ScholarDigital Library
- S. Williams, K. Datta, J. Carter, L. Oliker, J. Half, K. Yelick, and D. Bailey. PERI - Auto-tuning memory-intensive kernels for multicore. Journal of Physics: Conference Series, 125(012038), 2008.Google Scholar
- S. Williams, A. Waterman, and D. Patterson. Roofline: An Insightful Visual Performance Model for Floating-Point Programs and Multicore Architectures. Communications of the ACM (CACM), 2009. to appear. Google ScholarDigital Library
Index Terms
- Using many-core hardware to correlate radio astronomy signals
Recommendations
From GPGPU to Many-Core: Nvidia Fermi and Intel Many Integrated Core Architecture
Comparing the architectures and performance levels of an Nvidia Fermi accelerator with an Intel MIC Architecture coprocessor demonstrates the benefit of the coprocessor for bringing highly parallel applications into, or even beyond, GPGPU performance ...
Using many-core coprocessor to boost up Erlang VM
Erlang '13: Proceedings of the twelfth ACM SIGPLAN workshop on ErlangThe trend in processor design is to build more cores on a single chip. Commercial many-core processor is emerging these years. Intel Xeon Phi coprocessor , which is equipped with at least 60 relatively slow cores, is the first commercial many-core ...
Multi- and many-core data mining with adaptive sparse grids
CF '11: Proceedings of the 8th ACM International Conference on Computing FrontiersGaining knowledge out of vast datasets is a main challenge in data-driven applications nowadays. Sparse grids provide a numerical method for both classification and regression in data mining which scales only linearly in the number of data points and is ...
Comments