Abstract
Petascale supercomputers will be available by 2008. The largest machine of these complex leadership-class machines will probably have nearly 250K CPUs. These massively parallel systems have a number of challenging operating system issues. In this paper, we focus on the issues most important for the system that will first breach the petaflop barrier: synchronization and collective operations, parallel I/O, and fault tolerance.
- S. Agarwal, R. Garg, and N. K. Vishnoi. The impact of noise on the scaling of collectives: A theoretical approach. In Proceedings of the 12th International Conference on High Performance Computing, volume 3769 of Springer Lecture Notes in Computer Science, pages 280--289, Goa, India, Dec. 2005. Google ScholarDigital Library
- J. J. Dongarra and G. W. Stewart. LINPACK---A package for solving linear systems. In W. R. Cowell, editor, Sources and Development of Mathematical Software, Prentice-Hall Series in Computational Mathematics, Cleve Moler, advisor, pages 20--48. Prentice-Hall, Englewood Cliffs, NJ, 1984.Google Scholar
- T. Jones, S. Dawson, R. Neely, W. Tuel, L. Brenner, J. Fier, R. Blackmore, P. Caffrey, B. Maskell, P. Tomlinson, and M. Roberts. Improving the scalability of parallel jobs by adding parallel awareness to the operating system. In Proceedings of the ACM/IEEE Conference on Supercomputing, Phoenix, AZ, Nov. 2003. Google ScholarDigital Library
- J. Makino, M. Taiji, T. Ebisuzaki, and D. Sugimoto. GRAPE-4: A one-Tflops special-purpose computer for astrophysical N-body problem. In Proceedings of the ACM/IEEE Conference on Supercomputing, pages 429--438, Nov. 1994. Google ScholarDigital Library
- S. S. Mukherjee, J. Emer, and S. K. Reinhardt. The soft error problem: An architectural perspective. In Proceedings of the 11th International Conference on High-Performance Computer Architecture, pages 243--247, San Francisco, CA, Feb. 2005. Google ScholarDigital Library
- F. Petrini, D. J. Kerbyson, and S. Pakin. The case of the missing supercomputer performance: Achieving optimal performance on the 8, 192 processors of ASCI Q. In Proceedings of the ACM/IEEE Conference on Supercomputing, Phoenix, AZ, Nov. 2003. Google ScholarDigital Library
- F. B. Schmuck and R. L. Haskin. GPFS: A shared-disk file system for large computing clusters. In Proceedings of the Conference on File and Storage Technologies, pages 231--244, Monterey, CA, Jan. 2002. Google ScholarDigital Library
- http://www.cray.com/products/xt3/.Google Scholar
- http://www.research.ibm.com/bluegene/.Google Scholar
- http://www.lustre.org/.Google Scholar
- http://www.pvfs.org/pvfs2/.Google Scholar
- http://www.top500.org/.Google Scholar
Index Terms
- Operating system issues for petascale systems
Recommendations
Benchmarking the effects of operating system interference on extreme-scale parallel machines
We investigate operating system noise, which we identify as one of the main reasons for a lack of synchronicity in parallel applications. Using a microbenchmark, we measure the noise on several contemporary platforms and find that, even with a general-...
I/O threads to reduce checkpoint blocking for an electromagnetics solver on Blue Gene/P and Cray XK6
ROSS '12: Proceedings of the 2nd International Workshop on Runtime and Operating Systems for SupercomputersApplication-level checkpointing has been one of the most popular techniques to proactively deal with unexpected failures in supercomputers with hundreds of thousands of cores. Unfortunately, this approach results in heavy I/O load and often causes I/O ...
Scalable spectral transforms at petascale
XSEDE '13: Proceedings of the Conference on Extreme Science and Engineering Discovery Environment: Gateway to DiscoveryIn this paper, I describe a framework for spectral transforms called P3DFFT, and its extended features and applications. I discuss the scaling seen on petascale platforms, and directions and some results of the ongoing work on improving performance, ...
Comments