Skip to main content

2019 | OriginalPaper | Buchkapitel

The Implementation and Evaluation of High-Speed Link Monitoring Tool for Supercomputer

verfasst von : Jiaqing Xu, Jie He, Xiaotao Hu, Jijun Cao, Lei Zhang, Chongfeng Wang

Erschienen in: Computer Engineering and Technology

Verlag: Springer Singapore

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

With the increase of system scale and link speed, the link failure has become the most important type of interconnect fault in supercomputers, which has brought great challenges to the maintenance of high-performance interconnect networks. In order to meet the needs of operation and maintenance personnel to monitor the status and performance of all high-speed links of supercomputer in real-time, this paper designs a high-speed link monitoring tool based on in-band network, which has good scalability and robustness for real-time monitoring of high-speed link status and performance information. The tool has been practically utilized in the operation and maintenance of domestic supercomputers to speed up the process of locating and troubleshooting link failures, effectively reducing the downtime of supercomputers.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
2.
Zurück zum Zitat The Opportunities and Challenges of Exascale Computing. Summary Report of the Advanced Scientific Computing Advisory Committee (ASCAC) Subcommittee, Office of Science, DOE (2010) The Opportunities and Challenges of Exascale Computing. Summary Report of the Advanced Scientific Computing Advisory Committee (ASCAC) Subcommittee, Office of Science, DOE (2010)
3.
Zurück zum Zitat Duato, J., Yalamanchili, S., Ni, L.: Interconnection Networks: An Engineering Approach. Morgan Kaufmann Publishers, California (2003) Duato, J., Yalamanchili, S., Ni, L.: Interconnection Networks: An Engineering Approach. Morgan Kaufmann Publishers, California (2003)
4.
Zurück zum Zitat Dally, W.J., Towles, B.: Principles and Practices of Interconnection Networks. Morgan Kaufmann Publishers, California (2004) Dally, W.J., Towles, B.: Principles and Practices of Interconnection Networks. Morgan Kaufmann Publishers, California (2004)
5.
Zurück zum Zitat Domke, J., Hoefler, T., Matsuoka, S.: Fail-in-place network design: interaction between topology, routing algorithm and failures. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2014), pp. 597–608. IEEE Press, New Orleans (2014) Domke, J., Hoefler, T., Matsuoka, S.: Fail-in-place network design: interaction between topology, routing algorithm and failures. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2014), pp. 597–608. IEEE Press, New Orleans (2014)
6.
Zurück zum Zitat Cao, J.J., Xiao, L.Q., Wang, K.F.: The implementation and evaluation of in-band network management in supercomputing system. Chin. J. Comput. 39(9), 1717–1732 (2016)MathSciNet Cao, J.J., Xiao, L.Q., Wang, K.F.: The implementation and evaluation of in-band network management in supercomputing system. Chin. J. Comput. 39(9), 1717–1732 (2016)MathSciNet
8.
Zurück zum Zitat Wang, H.R., Xu, M.W.: Survey on SNMP network management. Mini Micro Syst. 25(3), 358–366 (2004) Wang, H.R., Xu, M.W.: Survey on SNMP network management. Mini Micro Syst. 25(3), 358–366 (2004)
9.
Zurück zum Zitat Guo, C.X., Yuan, L.H., Xiang, D., et al.: Pingmesh: a large-scale system for data center network latency measurement and analysis. In: Proceeding of ACM SIGCOMM 2015, pp. 139–152. ACM Press, London (2015)CrossRef Guo, C.X., Yuan, L.H., Xiang, D., et al.: Pingmesh: a large-scale system for data center network latency measurement and analysis. In: Proceeding of ACM SIGCOMM 2015, pp. 139–152. ACM Press, London (2015)CrossRef
10.
Zurück zum Zitat Peng, Y., Yang, J., Wu, C., et al.: deTector: a topology-aware monitoring system for data center networks. In: 2017 USENIX Annual Technical Conference (USENIX ATC 2017), USENIX Association, pp. 55–68. USENIX Association, California (2017) Peng, Y., Yang, J., Wu, C., et al.: deTector: a topology-aware monitoring system for data center networks. In: 2017 USENIX Annual Technical Conference (USENIX ATC 2017), USENIX Association, pp. 55–68. USENIX Association, California (2017)
11.
Zurück zum Zitat Wang, J.X., Qi, H.: Real-time link fault detection as a service for datacenter netwrok. J. Comput. Res. Dev. 55(4), 704–716 (2018) Wang, J.X., Qi, H.: Real-time link fault detection as a service for datacenter netwrok. J. Comput. Res. Dev. 55(4), 704–716 (2018)
13.
Zurück zum Zitat Birrittella, M.S., Debbage, M., et al.: Intel omni-path architecture: enabling scalable, high performance fabrics. In: Proceeding of 23rd IEEE Annual Symposium on High-Performance Interconnects, pp. 1–9. IEEE Press, California (2015) Birrittella, M.S., Debbage, M., et al.: Intel omni-path architecture: enabling scalable, high performance fabrics. In: Proceeding of 23rd IEEE Annual Symposium on High-Performance Interconnects, pp. 1–9. IEEE Press, California (2015)
14.
Zurück zum Zitat Wen, J.W.: Infiniband subnet management technology. Master thesis, National University of Defense Technology (2009) Wen, J.W.: Infiniband subnet management technology. Master thesis, National University of Defense Technology (2009)
17.
Zurück zum Zitat Huang, P., Guo, C.X., Zhou, L.D, et al.: Gray failure: the Achilles’ heel of cloud-scale systems. In: Proceedings of the 16th Workshop on Hot Topics in Operating Systems (HotOS 2017), pp. 150–155. ACM Press, Whistler (2017) Huang, P., Guo, C.X., Zhou, L.D, et al.: Gray failure: the Achilles’ heel of cloud-scale systems. In: Proceedings of the 16th Workshop on Hot Topics in Operating Systems (HotOS 2017), pp. 150–155. ACM Press, Whistler (2017)
Metadaten
Titel
The Implementation and Evaluation of High-Speed Link Monitoring Tool for Supercomputer
verfasst von
Jiaqing Xu
Jie He
Xiaotao Hu
Jijun Cao
Lei Zhang
Chongfeng Wang
Copyright-Jahr
2019
Verlag
Springer Singapore
DOI
https://doi.org/10.1007/978-981-13-5919-4_16

Neuer Inhalt