Skip to main content
Top
Published in: International Journal of Computer Assisted Radiology and Surgery 5/2021

24-03-2021 | Original Article

Cross-modal self-supervised representation learning for gesture and skill recognition in robotic surgery

Authors: Jie Ying Wu, Aniruddha Tamhane, Peter Kazanzides, Mathias Unberath

Published in: International Journal of Computer Assisted Radiology and Surgery | Issue 5/2021

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Purpose

Multi- and cross-modal learning consolidates information from multiple data sources which may offer a holistic representation of complex scenarios. Cross-modal learning is particularly interesting, because synchronized data streams are immediately useful as self-supervisory signals. The prospect of achieving self-supervised continual learning in surgical robotics is exciting as it may enable lifelong learning that adapts to different surgeons and cases, ultimately leading to a more general machine understanding of surgical processes.

Methods

We present a learning paradigm using synchronous video and kinematics from robot-mediated surgery. Our approach relies on an encoder–decoder network that maps optical flow to the corresponding kinematics sequence. Clustering on the latent representations reveals meaningful groupings for surgeon gesture and skill level. We demonstrate the generalizability of the representations on the JIGSAWS dataset by classifying skill and gestures on tasks not used for training.

Results

For tasks seen in training, we report a 59 to 70% accuracy in surgical gestures classification. On tasks beyond the training setup, we note a 45 to 65% accuracy. Qualitatively, we find that unseen gestures form clusters in the latent space of novice actions, which may enable the automatic identification of novel interactions in a lifelong learning scenario.

Conclusion

From predicting the synchronous kinematics sequence, optical flow representations of surgical scenes emerge that separate well even for new tasks that the model had not seen before. While the representations are useful immediately for a variety of tasks, the self-supervised learning paradigm may enable research in lifelong and user-specific learning.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Appendix
Available only for authorised users
Literature
1.
go back to reference Ahmidi N, Tao L, Sefati S, Gao Y, Lea C, Haro BB, Zappella L, Khudanpur S, Vidal R, Hager GD (2017) A dataset and benchmarks for segmentation and recognition of gestures in robotic surgery. IEEE Trans Biomed Eng 64(9):2025–2041CrossRef Ahmidi N, Tao L, Sefati S, Gao Y, Lea C, Haro BB, Zappella L, Khudanpur S, Vidal R, Hager GD (2017) A dataset and benchmarks for segmentation and recognition of gestures in robotic surgery. IEEE Trans Biomed Eng 64(9):2025–2041CrossRef
2.
go back to reference Arandjelovic R, Zisserman A (2018) Objects that sound. In: Proceedings of the European conference on computer vision, pp. 435–451 Arandjelovic R, Zisserman A (2018) Objects that sound. In: Proceedings of the European conference on computer vision, pp. 435–451
3.
go back to reference Chen T, Guestrin C (2016) Xgboost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 785–794 Chen T, Guestrin C (2016) Xgboost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 785–794
4.
go back to reference DiPietro R, Hager GD (2018) Unsupervised learning for surgical motion by learning to predict the future. In: International conference on medical image computing and computer-assisted intervention, pp. 281–288. Springer DiPietro R, Hager GD (2018) Unsupervised learning for surgical motion by learning to predict the future. In: International conference on medical image computing and computer-assisted intervention, pp. 281–288. Springer
5.
go back to reference DiPietro R, Hager GD (2019) Automated surgical activity recognition with one labeled sequence. In: International conference on medical image computing and computer-assisted intervention, pp. 458–466. Springer DiPietro R, Hager GD (2019) Automated surgical activity recognition with one labeled sequence. In: International conference on medical image computing and computer-assisted intervention, pp. 458–466. Springer
6.
go back to reference Farnebäck G (2003) Two-frame motion estimation based on polynomial expansion. In: Scandinavian conference on image analysis, pp. 363–370. Springer Farnebäck G (2003) Two-frame motion estimation based on polynomial expansion. In: Scandinavian conference on image analysis, pp. 363–370. Springer
7.
go back to reference Funke I, Mees ST, Weitz J, Speidel S (2019) Video-based surgical skill assessment using 3D convolutional neural networks. Int J Comput Assist Radiol Surg 14(7):1217–1225CrossRef Funke I, Mees ST, Weitz J, Speidel S (2019) Video-based surgical skill assessment using 3D convolutional neural networks. Int J Comput Assist Radiol Surg 14(7):1217–1225CrossRef
8.
go back to reference Gao Y, Vedula SS, Reiley CE, Ahmidi N, Varadarajan B, Lin HC, Tao L, Zappella L, Béjar B, Yuh DD, Chen CCG, Vidal R, Khudanpur S, Hager GD (2014) JHU-ISI gesture and skill assessment working set (jigsaws): a surgical activity dataset for human motion modeling. In: MICCAI workshop: M2CAI, vol. 3, p. 3 Gao Y, Vedula SS, Reiley CE, Ahmidi N, Varadarajan B, Lin HC, Tao L, Zappella L, Béjar B, Yuh DD, Chen CCG, Vidal R, Khudanpur S, Hager GD (2014) JHU-ISI gesture and skill assessment working set (jigsaws): a surgical activity dataset for human motion modeling. In: MICCAI workshop: M2CAI, vol. 3, p. 3
9.
go back to reference Guthart GS, Salisbury JK (2000) The intuitive\(^{TM}\) telesurgery system: overview and application. In: IEEE international conference on robotics and automation, vol. 1, pp. 618–621 Guthart GS, Salisbury JK (2000) The intuitive\(^{TM}\) telesurgery system: overview and application. In: IEEE international conference on robotics and automation, vol. 1, pp. 618–621
10.
go back to reference Jing L, Tian Y (2020) Self-supervised visual feature learning with deep neural networks: a survey. IEEE Trans Pattern Anal Mach Intell Jing L, Tian Y (2020) Self-supervised visual feature learning with deep neural networks: a survey. IEEE Trans Pattern Anal Mach Intell
11.
go back to reference Kazanzides P, Chen Z, Deguet A, Fischer GS, Taylor RH, DiMaio SP (2014) An open-source research kit for the da vinci\(^{\textregistered }\) surgical system. In: IEEE international conference on robotics and automation, pp. 6434–6439 Kazanzides P, Chen Z, Deguet A, Fischer GS, Taylor RH, DiMaio SP (2014) An open-source research kit for the da vinci\(^{\textregistered }\) surgical system. In: IEEE international conference on robotics and automation, pp. 6434–6439
12.
go back to reference Long YH, Wu JY, Lu B, Jin YM, Unberath M, Liu YH, Heng PA, Dou Q (2020) Relational graph learning on visual and kinematics embeddings for accurate gesture recognition in robotic surgery Long YH, Wu JY, Lu B, Jin YM, Unberath M, Liu YH, Heng PA, Dou Q (2020) Relational graph learning on visual and kinematics embeddings for accurate gesture recognition in robotic surgery
13.
go back to reference Mazomenos E, Watson D, Kotorov R, Stoyanov D (2018) Gesture classification in robotic surgery using recurrent neural networks with kinematic information. In: 8th Joint workshop on new technologies for computer/robotic assisted surgery Mazomenos E, Watson D, Kotorov R, Stoyanov D (2018) Gesture classification in robotic surgery using recurrent neural networks with kinematic information. In: 8th Joint workshop on new technologies for computer/robotic assisted surgery
14.
go back to reference McInnes L, Healy J, Melville J (2018) Umap: uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 McInnes L, Healy J, Melville J (2018) Umap: uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:​1802.​03426
15.
go back to reference Murali A, Garg A, Krishnan S, Pokorny FT, Abbeel P, Darrell T, Goldberg K (2016) TSC-DL: unsupervised trajectory segmentation of multi-modal surgical demonstrations with deep learning. In: IEEE international conference on robotics and automation, pp. 4150–4157 Murali A, Garg A, Krishnan S, Pokorny FT, Abbeel P, Darrell T, Goldberg K (2016) TSC-DL: unsupervised trajectory segmentation of multi-modal surgical demonstrations with deep learning. In: IEEE international conference on robotics and automation, pp. 4150–4157
16.
go back to reference Qin Y, Feyzabadi S, Allan M, Burdick JW, Azizian M (2020) davincinet: joint prediction of motion and surgical state in robot-assisted surgery. arXiv preprint arXiv:2009.11937 Qin Y, Feyzabadi S, Allan M, Burdick JW, Azizian M (2020) davincinet: joint prediction of motion and surgical state in robot-assisted surgery. arXiv preprint arXiv:​2009.​11937
18.
go back to reference Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Adv Neural Inf Process Syst pp. 568–576 Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Adv Neural Inf Process Syst pp. 568–576
19.
go back to reference Tanwani AK, Sermanet P, Yan A, Anand R, Phielipp M, Goldberg K (2020) Motion2vec: semi-supervised representation learning from surgical videos. arXiv preprint arXiv:2006.00545 Tanwani AK, Sermanet P, Yan A, Anand R, Phielipp M, Goldberg K (2020) Motion2vec: semi-supervised representation learning from surgical videos. arXiv preprint arXiv:​2006.​00545
20.
go back to reference van Amsterdam B, Nakawala H, De Momi E, Stoyanov D (2019) Weakly supervised recognition of surgical gestures. In: IEEE international conference on robotics and automation, pp. 9565–9571 van Amsterdam B, Nakawala H, De Momi E, Stoyanov D (2019) Weakly supervised recognition of surgical gestures. In: IEEE international conference on robotics and automation, pp. 9565–9571
21.
go back to reference Wang Z, Fey AM (2018) Deep learning with convolutional neural network for objective skill evaluation in robot-assisted surgery. Int J Comput Assist Radiol Surg 13(12):1959–1970CrossRef Wang Z, Fey AM (2018) Deep learning with convolutional neural network for objective skill evaluation in robot-assisted surgery. Int J Comput Assist Radiol Surg 13(12):1959–1970CrossRef
22.
go back to reference Weiss MY, Melnyk R, Mix D, Ghazi A, Vates GE, Stone JJ (2020) Design and validation of a cervical laminectomy simulator using 3D printing and hydrogel phantoms. Oper Neurosurg 18(2):202–208 Weiss MY, Melnyk R, Mix D, Ghazi A, Vates GE, Stone JJ (2020) Design and validation of a cervical laminectomy simulator using 3D printing and hydrogel phantoms. Oper Neurosurg 18(2):202–208
23.
go back to reference Wu JY, Kazanzides P, Unberath M (2020) Leveraging vision and kinematics data to improve realism of biomechanic soft tissue simulation for robotic surgery. Int J Comput Assist Radiol Surg pp. 1–8 Wu JY, Kazanzides P, Unberath M (2020) Leveraging vision and kinematics data to improve realism of biomechanic soft tissue simulation for robotic surgery. Int J Comput Assist Radiol Surg pp. 1–8
24.
go back to reference Zhang Y, Lu H (2018) Deep cross-modal projection learning for image-text matching. In: Proceedings of the European conference on computer vision, pp. 686–701 Zhang Y, Lu H (2018) Deep cross-modal projection learning for image-text matching. In: Proceedings of the European conference on computer vision, pp. 686–701
25.
go back to reference Zhen L, Hu P, Wang X, Peng D (2019) Deep supervised cross-modal retrieval. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 10394–10403 Zhen L, Hu P, Wang X, Peng D (2019) Deep supervised cross-modal retrieval. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 10394–10403
Metadata
Title
Cross-modal self-supervised representation learning for gesture and skill recognition in robotic surgery
Authors
Jie Ying Wu
Aniruddha Tamhane
Peter Kazanzides
Mathias Unberath
Publication date
24-03-2021
Publisher
Springer International Publishing
Published in
International Journal of Computer Assisted Radiology and Surgery / Issue 5/2021
Print ISSN: 1861-6410
Electronic ISSN: 1861-6429
DOI
https://doi.org/10.1007/s11548-021-02343-y

Other articles of this Issue 5/2021

International Journal of Computer Assisted Radiology and Surgery 5/2021 Go to the issue

Premium Partner