skip to main content
article

Operating system issues for petascale systems

Published:01 April 2006Publication History
Skip Abstract Section

Abstract

Petascale supercomputers will be available by 2008. The largest machine of these complex leadership-class machines will probably have nearly 250K CPUs. These massively parallel systems have a number of challenging operating system issues. In this paper, we focus on the issues most important for the system that will first breach the petaflop barrier: synchronization and collective operations, parallel I/O, and fault tolerance.

References

  1. S. Agarwal, R. Garg, and N. K. Vishnoi. The impact of noise on the scaling of collectives: A theoretical approach. In Proceedings of the 12th International Conference on High Performance Computing, volume 3769 of Springer Lecture Notes in Computer Science, pages 280--289, Goa, India, Dec. 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. J. J. Dongarra and G. W. Stewart. LINPACK---A package for solving linear systems. In W. R. Cowell, editor, Sources and Development of Mathematical Software, Prentice-Hall Series in Computational Mathematics, Cleve Moler, advisor, pages 20--48. Prentice-Hall, Englewood Cliffs, NJ, 1984.Google ScholarGoogle Scholar
  3. T. Jones, S. Dawson, R. Neely, W. Tuel, L. Brenner, J. Fier, R. Blackmore, P. Caffrey, B. Maskell, P. Tomlinson, and M. Roberts. Improving the scalability of parallel jobs by adding parallel awareness to the operating system. In Proceedings of the ACM/IEEE Conference on Supercomputing, Phoenix, AZ, Nov. 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. J. Makino, M. Taiji, T. Ebisuzaki, and D. Sugimoto. GRAPE-4: A one-Tflops special-purpose computer for astrophysical N-body problem. In Proceedings of the ACM/IEEE Conference on Supercomputing, pages 429--438, Nov. 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. S. S. Mukherjee, J. Emer, and S. K. Reinhardt. The soft error problem: An architectural perspective. In Proceedings of the 11th International Conference on High-Performance Computer Architecture, pages 243--247, San Francisco, CA, Feb. 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. F. Petrini, D. J. Kerbyson, and S. Pakin. The case of the missing supercomputer performance: Achieving optimal performance on the 8, 192 processors of ASCI Q. In Proceedings of the ACM/IEEE Conference on Supercomputing, Phoenix, AZ, Nov. 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. F. B. Schmuck and R. L. Haskin. GPFS: A shared-disk file system for large computing clusters. In Proceedings of the Conference on File and Storage Technologies, pages 231--244, Monterey, CA, Jan. 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. http://www.cray.com/products/xt3/.Google ScholarGoogle Scholar
  9. http://www.research.ibm.com/bluegene/.Google ScholarGoogle Scholar
  10. http://www.lustre.org/.Google ScholarGoogle Scholar
  11. http://www.pvfs.org/pvfs2/.Google ScholarGoogle Scholar
  12. http://www.top500.org/.Google ScholarGoogle Scholar

Index Terms

  1. Operating system issues for petascale systems

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader