Skip to main content

2020 | Buch

Medical Image Computing and Computer Assisted Intervention – MICCAI 2020

23rd International Conference, Lima, Peru, October 4–8, 2020, Proceedings, Part III

herausgegeben von: Prof. Anne L. Martel, Purang Abolmaesumi, Danail Stoyanov, Diana Mateus, Maria A. Zuluaga, S. Kevin Zhou, Daniel Racoceanu, Prof. Leo Joskowicz

Verlag: Springer International Publishing

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

The seven-volume set LNCS 12261, 12262, 12263, 12264, 12265, 12266, and 12267 constitutes the refereed proceedings of the 23rd International Conference on Medical Image Computing and Computer-Assisted Intervention, MICCAI 2020, held in Lima, Peru, in October 2020. The conference was held virtually due to the COVID-19 pandemic.

The 542 revised full papers presented were carefully reviewed and selected from 1809 submissions in a double-blind review process. The papers are organized in the following topical sections:

Part I: machine learning methodologies

Part II: image reconstruction; prediction and diagnosis; cross-domain methods and reconstruction; domain adaptation; machine learning applications; generative adversarial networks

Part III: CAI applications; image registration; instrumentation and surgical phase detection; navigation and visualization; ultrasound imaging; video image analysis

Part IV: segmentation; shape models and landmark detection

Part V: biological, optical, microscopic imaging; cell segmentation and stain normalization; histopathology image analysis; opthalmology

Part VI: angiography and vessel analysis; breast imaging; colonoscopy; dermatology; fetal imaging; heart and lung imaging; musculoskeletal imaging

Part VI: brain development and atlases; DWI and tractography; functional brain networks; neuroimaging; positron emission tomography

Inhaltsverzeichnis

Frontmatter

CAI Applications

Frontmatter
Reconstructing Sinus Anatomy from Endoscopic Video – Towards a Radiation-Free Approach for Quantitative Longitudinal Assessment

Reconstructing accurate 3D surface models of sinus anatomy directly from an endoscopic video is a promising avenue for cross-sectional and longitudinal analysis to better understand the relationship between sinus anatomy and surgical outcomes. We present a patient-specific, learning-based method for 3D reconstruction of sinus surface anatomy directly and only from endoscopic videos. We demonstrate the effectiveness and accuracy of our method on in and ex vivo data where we compare to sparse reconstructions from Structure from Motion, dense reconstruction from COLMAP, and ground truth anatomy from CT. Our textured reconstructions are watertight and enable measurement of clinically relevant parameters in good agreement with CT. The source code is available at https://github.com/lppllppl920/DenseReconstruction-Pytorch .

Xingtong Liu, Maia Stiber, Jindan Huang, Masaru Ishii, Gregory D. Hager, Russell H. Taylor, Mathias Unberath
Inertial Measurements for Motion Compensation in Weight-Bearing Cone-Beam CT of the Knee

Involuntary motion during weight-bearing cone-beam computed tomography (CT) scans of the knee causes artifacts in the reconstructed volumes making them unusable for clinical diagnosis. Currently, image-based or marker-based methods are applied to correct for this motion, but often require long execution or preparation times. We propose to attach an inertial measurement unit (IMU) containing an accelerometer and a gyroscope to the leg of the subject in order to measure the motion during the scan and correct for it. To validate this approach, we present a simulation study using real motion measured with an optical 3D tracking system. With this motion, an XCAT numerical knee phantom is non-rigidly deformed during a simulated CT scan creating motion corrupted projections. A biomechanical model is animated with the same tracked motion in order to generate measurements of an IMU placed below the knee. In our proposed multi-stage algorithm, these signals are transformed to the global coordinate system of the CT scan and applied for motion compensation during reconstruction. Our proposed approach can effectively reduce motion artifacts in the reconstructed volumes. Compared to the motion corrupted case, the average structural similarity index and root mean squared error with respect to the no-motion case improved by 13–21% and 68–70%, respectively. These results are qualitatively and quantitatively on par with a state-of-the-art marker-based method we compared our approach to. The presented study shows the feasibility of this novel approach, and yields promising results towards a purely IMU-based motion compensation in C-arm CT.

Jennifer Maier, Marlies Nitschke, Jang-Hwan Choi, Garry Gold, Rebecca Fahrig, Bjoern M. Eskofier, Andreas Maier
Feasibility Check: Can Audio Be a Simple Alternative to Force-Based Feedback for Needle Guidance?

Accurate needle placement is highly relevant for puncture of anatomical structures. The clinician’s experience and medical imaging are essential to complete these procedures safely. However, imaging may come with inaccuracies due to image artifacts. Sensor-based solutions have been proposed for acquiring additional guidance information. These sensors typically require to be embedded in the instrument tip, leading to direct tissue contact, sterilization issues, and added device complexity, risk, and cost. Recently, an audio-based technique has been proposed for “listening” to needle tip-tissue interactions by an externally placed sensor. This technique has shown promising results for different applications. But the relation between the interaction event and the generated audio excitation is still not fully understood. This work aims to study this relationship, using a force sensor as a reference, by relating events and dynamical characteristics occurring in the audio signal with those occurring in the force signal. We want to show that dynamical information that a well-known sensor as force can provide could also be extracted from a low-cost and simple sensor such as audio. In this aim, the Pearson coefficient was used for signal-to-signal correlation between extracted audio and force indicators. Also, an event-to-event correlation between audio and force was performed by computing features from the indicators. Results show high values of correlation between audio and force indicators in the range of 0.53 to 0.72. These promising results demonstrate the usability of audio sensing for tissue-tool interaction and its potential to improve telemanipulated and robotic surgery in the future.

Alfredo Illanes, Axel Boese, Michael Friebe, Christian Hansen
A Graph-Based Method for Optimal Active Electrode Selection in Cochlear Implants

The cochlear implant (CI) is a neural prosthetic that is the standard-of-care treatment for severe-to-profound hearing loss. CIs consist of an electrode array inserted into the cochlea that electrically stimulates auditory nerve fibers to induce the sensation of hearing. Competing stimuli occur when multiple electrodes stimulate the same neural pathways. This is known to negatively impact hearing outcomes. Previous research has shown that image-processing techniques can be used to analyze the CI position in CT scans to estimate the degree of competition between electrodes based on the CI user’s unique anatomy and electrode placement. The resulting data permits an algorithm or expert to select a subset of electrodes to keep active to alleviate competition. Expert selection of electrodes using this data has been shown in clinical studies to lead to significantly improved hearing outcomes for CI users. Currently, we aim to translate these techniques to a system designed for worldwide clinical use, which mandates that the selection of active electrodes be automated by robust algorithms. Previously proposed techniques produce optimal plans with only 48% success rate. In this work, we propose a new graph-based approach. We design a graph with nodes that represent electrodes and edge weights that encode competition between electrode pairs. We then find an optimal path through this graph to determine the active electrode set. Our method produces results judged by an expert to be optimal in over 95% of cases. This technique could facilitate widespread clinical translation of image-guided cochlear implant programming methods.

Erin Bratu, Robert Dwyer, Jack Noble
Improved Resection Margins in Surgical Oncology Using Intraoperative Mass Spectrometry

PURPOSE: Incomplete tumor resections leads to the presence of cancer cells on the resection margins demanding subsequent revision surgery and poor outcomes for patients. Intraoperative evaluations of the tissue pathology, including the surgical margins, can help decrease the burden of repeat surgeries on the patients and healthcare systems. In this study, we propose adapting multi instance learning (MIL) for prospective and intraoperative basal cell carcinoma (BCC) detection in surgical margins using mass spectrometry. METHODS: Resected specimens were collected and inspected by a pathologist and burnt with iKnife. Retrospective training data was collected with a standard cautery tip and included 63 BCC and 127 normal burns. Prospective data was collected for testing with both the standard and a fine tip cautery. This included 130 (66 BCC and 64 normal) and 99 (32 BCC and 67 normal) burns, respectively. An attention-based MIL model was adapted and applied to this dataset. RESULTS: Our models were able to predict BCC at surgical margins with AUC as high as 91%. The models were robust to changes in cautery tip but their performance decreased slightly. The models were also tested intraoperatively and achieved an accuracy of 94%. CONCLUSION: This is the first study that applies the concept of MIL for tissue characterization in perioperative and intraoperative REIMS data.

Amoon Jamzad, Alireza Sedghi, Alice M. L. Santilli, Natasja N. Y. Janssen, Martin Kaufmann, Kevin Y. M. Ren, Kaitlin Vanderbeck, Ami Wang, Doug McKay, John F. Rudan, Gabor Fichtinger, Parvin Mousavi
Self-Supervised Domain Adaptation for Patient-Specific, Real-Time Tissue Tracking

Estimating tissue motion is crucial to provide automatic motion stabilization and guidance during surgery. However, endoscopic images often lack distinctive features and fine tissue deformation can only be captured with dense tracking methods like optical flow. To achieve high accuracy at high processing rates, we propose fine-tuning of a fast optical flow model to an unlabeled patient-specific image domain. We adopt multiple strategies to achieve unsupervised fine-tuning. First, we utilize a teacher-student approach to transfer knowledge from a slow but accurate teacher model to a fast student model. Secondly, we develop self-supervised tasks where the model is encouraged to learn from different but related examples. Comparisons with out-of-the-box models show that our method achieves significantly better results. Our experiments uncover the effects of different task combinations. We demonstrate that unsupervised fine-tuning can improve the performance of CNN-based tissue tracking and opens up a promising future direction.

Sontje Ihler, Felix Kuhnke, Max-Heinrich Laves, Tobias Ortmaier
An Interactive Mixed Reality Platform for Bedside Surgical Procedures

In many bedside procedures, surgeons must rely on their spatiotemporal reasoning to estimate the position of an internal target by manually measuring external anatomical landmarks. One particular example that is performed frequently in neurosurgery is ventriculostomy, where the surgeon inserts a catheter into the patient’s skull to divert the cerebrospinal fluid and alleviate the intracranial pressure. However, about one-third of the insertions miss the target.We, therefore, assembled a team of engineers and neurosurgeons to develop an interactive surgical navigation system using mixed reality on a head-mounted display that overlays the target, identified in preoperative images, directly on the patient’s anatomy and provides visual guidance for the surgeon to insert the catheter on the correct path to the target.We conducted a user study to evaluate the improvement in the accuracy and precision of the insertions with mixed reality as well as the usability of our navigation system. The results indicate that using mixed reality improves the accuracy by over 35% and that the system ranks high based on the usability score.

Ehsan Azimi, Zhiyuan Niu, Maia Stiber, Nicholas Greene, Ruby Liu, Camilo Molina, Judy Huang, Chien-Ming Huang, Peter Kazanzides
Ear Cartilage Inference for Reconstructive Surgery with Convolutional Mesh Autoencoders

Many children born with ear microtia undergo reconstructive surgery for both aesthetic and functional purposes. This surgery is a delicate procedure that requires the surgeon to carve a “scaffold” for a new ear, typically from the patient’s own rib cartilage. This is an unnecessarily invasive procedure, and reconstruction relies on the skill of the surgeon to accurately construct a scaffold that best suits the patient based on limited data. Work in stem-cell technologies and bioprinting present an opportunity to change this procedure by providing the opportunity to “bioprint” a personalised cartilage scaffold in a lab. To do so, however, a 3D model of the desired cartilage shape is first required. In this paper we optimise the standard convolutional mesh autoencoder framework such that, given only the soft tissue surface of an unaffected ear, it can accurately predict the shape of the underlying cartilage. To prevent predicted cartilage meshes from intersecting with, and protruding through, the soft tissue ear mesh, we develop a novel intersection-based loss function. These combined efforts present a means of designing personalised ear cartilage scaffold for use in reconstructive ear surgery.

Eimear O’ Sullivan, Lara van de Lande, Antonia Osolos, Silvia Schievano, David J. Dunaway, Neil Bulstrode, Stefanos Zafeiriou
Robust Multi-modal 3D Patient Body Modeling

This paper considers the problem of 3D patient body modeling. Such a 3D model provides valuable information for improving patient care, streamlining clinical workflow, automated parameter optimization for medical devices etc. With the popularity of 3D optical sensors and the rise of deep learning, this problem has seen much recent development. However, existing art is mostly constrained by requiring specific types of sensors as well as limited data and labels, making them inflexible to be ubiquitously used across various clinical applications. To address these issues, we present a novel robust dynamic fusion technique that facilitates flexible multi-modal inference, resulting in accurate 3D body modeling even when the input sensor modality is only a subset of the training modalities. This leads to a more scalable and generic framework that does not require repeated application-specific data collection and model retraining, hence achieving an important flexibility towards developing cost-effective clinically-deployable machine learning models. We evaluate our method on several patient positioning datasets and demonstrate its efficacy compared to competing methods, even showing robustness in challenging patient-under-the-cover clinical scenarios.

Fan Yang, Ren Li, Georgios Georgakis, Srikrishna Karanam, Terrence Chen, Haibin Ling, Ziyan Wu
A New Electromagnetic-Video Endoscope Tracking Method via Anatomical Constraints and Historically Observed Differential Evolution

This paper develops a new hybrid electromagnetic-video endoscope 3-D tracking method that introduces anatomical structure constraints and historically observed differential evolution for surgical navigation. Current endoscope tracking approaches still get trapped in image artifacts, tissue deformation, and inaccurate sensor outputs during endoscopic navigation. To deal with these limitations, we spatially constraint inaccurate electromagnetic sensor measurements to the centerlines of anatomical tubular organs (e.g., the airway trees), which can keep the measurements physically inside the tubular organ and tackle the inaccuracy problem caused by respiratory motion and magnetic field distortion. We then propose historically observed differential evolution to precisely fuse the constrained sensor outputs and endoscopic video sequences. The new hybrid tracking framework was evaluated on clinical data, with the experimental results showing that our proposed method fully outperforms current hybrid approaches. In particular, the tracking error was significantly reduced from (5.9 mm, 9.9 $$^\circ $$ ∘ ) to (3.3 mm, 8.6 $$^\circ $$ ∘ ).

Xiongbiao Luo
Malocclusion Treatment Planning via PointNet Based Spatial Transformation Network

Orthodontic malocclusion treatment is a procedure to correct dental and facial morphology by moving teeth or adjusting underlying bones. It concentrates on two key aspects: the treatment planning for dentition alignment; and the plan implementation with the aid of external forces. Existing treatment planning requires significant time and effort for orthodontists and technicians. At present, no work successfully automates the process of tooth movement in orthodontics. In this study, we leverage state-of-the-art deep learning methods and propose an automated treatment planning process to take advantage of the spatial interrelationship between different teeth. Our method enables to exploit a 3-dimensional spatial transformation architecture for malocclusion treatment planning with 4 steps: (1) sub-sampling the dentition point cloud to get a critical point set; (2) extracting local features for each tooth and global features for the whole dentition; (3) obtaining transformation parameters conditioned on the features refined from the combination of both the local and global features and, (4) transforming initial dentition point cloud to the parameter-defined final state. Our approach achieves 84.5% cosine similarity accuracy (CSA) for the transformation matrix in the non-augmented dataset, and 95.3% maximum CSA for the augmented dataset. Our approach’s outcome is proven to be effective in quantitative analysis and semantically reasonable in qualitative analysis.

Xiaoshuang Li, Lei Bi, Jinman Kim, Tingyao Li, Peng Li, Ye Tian, Bin Sheng, Dagan Feng
Simulation of Brain Resection for Cavity Segmentation Using Self-supervised and Semi-supervised Learning

Resective surgery may be curative for drug-resistant focal epilepsy, but only 40% to 70% of patients achieve seizure freedom after surgery. Retrospective quantitative analysis could elucidate patterns in resected structures and patient outcomes to improve resective surgery. However, the resection cavity must first be segmented on the postoperative MR image. Convolutional neural networks (CNNs) are the state-of-the-art image segmentation technique, but require large amounts of annotated data for training. Annotation of medical images is a time-consuming process requiring highly-trained raters, and often suffering from high inter-rater variability. Self-supervised learning can be used to generate training instances from unlabeled data. We developed an algorithm to simulate resections on preoperative MR images. We curated a new dataset, EPISURG, comprising 431 postoperative and 269 preoperative MR images from 431 patients who underwent resective surgery. In addition to EPISURG, we used three public datasets comprising 1813 preoperative MR images for training. We trained a 3D CNN on artificially resected images created on the fly during training, using images from 1) EPISURG, 2) public datasets and 3) both. To evaluate trained models, we calculate Dice score (DSC) between model segmentations and 200 manual annotations performed by three human raters. The model trained on data with manual annotations obtained a median (interquartile range) DSC of 65.3 (30.6). The DSC of our best-performing model, trained with no manual annotations, is 81.7 (14.2). For comparison, inter-rater agreement between human annotators was 84.0 (9.9). We demonstrate a training method for CNNs using simulated resection cavities that can accurately segment real resection cavities, without manual annotations.

Fernando Pérez-García, Roman Rodionov, Ali Alim-Marvasti, Rachel Sparks, John S. Duncan, Sébastien Ourselin
Local Contractive Registration for Quantification of Tissue Shrinkage in Assessment of Microwave Ablation

Microwave ablation is an effective minimally invasive surgery for the treatment of liver cancer. The safety margin assessment is implemented by mapping the coagulation in the postoperative image to the tumor in the preoperative image. However, an accurate assessment is a challenging task because the tissue shrinks caused by dehydration during microwave ablation. This paper proposes a fast automatic assessment method to compensate for the underestimation of the coagulation caused by the tissue shrinks and precisely quantify the tumor coverage. The proposed method is implemented on GPU including two main steps: (1) a local contractive nonrigid registration for registering the liver parenchyma around the coagulation, and (2) the fast Fourier transform-based Helmholtz-Hodge decomposition for quantifying the location of the shrinkage center and the volume of the original coagulation. The method was quantificationally evaluated on 50 groups of synthetic datasets and 9 groups of clinical MR datasets. Compared with five state-of-the-art methods, the lowest distance to the true deformation field (1.56 ± 0.74 mm) and the highest precision of safety margin ( $$88.89\%$$ 88.89 % ) are obtained. The mean computation time is $$111\pm 13$$ 111 ± 13 s. Results show that the proposed method efficiently improves the accuracy of the safety margin assessment and is thus a promising assessment tool for the microwave ablation.

Dingkun Liu, Tianyu Fu, Danni Ai, Jingfan Fan, Hong Song, Jian Yang
Reinforcement Learning of Musculoskeletal Control from Functional Simulations

To diagnose, plan, and treat musculoskeletal pathologies, understanding and reproducing muscle recruitment for complex movements is essential. With muscle activations for movements often being highly redundant, nonlinear, and time dependent, machine learning can provide a solution for their modeling and control for anatomy-specific musculoskeletal simulations. Sophisticated biomechanical simulations often require specialized computational environments, being numerically complex and slow, hindering their integration with typical deep learning frameworks. In this work, a deep reinforcement learning (DRL) based inverse dynamics controller is trained to control muscle activations of a biomechanical model of the human shoulder. In a generalizable end-to-end fashion, muscle activations are learned given current and desired position-velocity pairs. A customized reward functions for trajectory control is introduced, enabling straightforward extension to additional muscles and higher degrees of freedom. Using the biomechanical model, multiple episodes are simulated on a cluster simultaneously using the evolving neural models of the DRL being trained. Results are presented for a single-axis motion control of shoulder abduction for the task of following randomly generated angular trajectories.

Emanuel Joos, Fabien Péan, Orcun Goksel

Image Registration

Frontmatter
MvMM-RegNet: A New Image Registration Framework Based on Multivariate Mixture Model and Neural Network Estimation

Current deep-learning-based registration algorithms often exploit intensity-based similarity measures as the loss function, where dense correspondence between a pair of moving and fixed images is optimized through backpropagation during training. However, intensity-based metrics can be misleading when the assumption of intensity class correspondence is violated, especially in cross-modality or contrast-enhanced images. Moreover, existing learning-based registration methods are predominantly applicable to pairwise registration and are rarely extended to groupwise registration or simultaneous registration with multiple images. In this paper, we propose a new image registration framework based on multivariate mixture model (MvMM) and neural network estimation. A generative model consolidating both appearance and anatomical information is established to derive a novel loss function capable of implementing groupwise registration. We highlight the versatility of the proposed framework for various applications on multimodal cardiac images, including single-atlas-based segmentation (SAS) via pairwise registration and multi-atlas segmentation (MAS) unified by groupwise registration. We evaluated performance on two publicly available datasets, i.e. MM-WHS-2017 and MS-CMRSeg-2019. The results show that the proposed framework achieved an average Dice score of $$0.871\pm 0.025$$ 0.871 ± 0.025 for whole-heart segmentation on MR images and $$0.783\pm 0.082$$ 0.783 ± 0.082 for myocardium segmentation on LGE MR images (Code is available from https://zmiclab.github.io/projects.html ).

Xinzhe Luo, Xiahai Zhuang
Database Annotation with Few Examples: An Atlas-Based Framework Using Diffeomorphic Registration of 3D Trees

Automatic annotation of anatomical structures can help simplify workflow during interventions in numerous clinical applications but usually involves a large amount of annotated data. The complexity of the labeling task, together with the lack of representative data, slows down the development of robust solutions. In this paper, we propose a solution requiring very few annotated cases to label 3D pelvic arterial trees of patients with benign prostatic hyperplasia. We take advantage of Large Deformation Diffeomorphic Metric Mapping (LDDMM) to perform registration based on meaningful deformations from which we build an atlas. Branch pairing is then computed from the atlas to new cases using optimal transport to ensure one-to-one correspondence during the labeling process. To tackle topological variations in the tree, which usually degrades the performance of atlas-based techniques, we propose a simple bottom-up label assignment adapted to the pelvic anatomy. The proposed method achieves 97.6% labeling precision with only 5 cases for training, while in comparison learning-based methods only reach 82.2% on such small training sets.

Pierre-Louis Antonsanti, Thomas Benseghir, Vincent Jugnon, Joan Glaunès
Pair-Wise and Group-Wise Deformation Consistency in Deep Registration Network

Ideally the deformation field from one image to another should be invertible and smooth to register images bidirectionally and preserve topology of anatomical structures. In traditional registration methods, differential geometry constraints could guarantee such topological consistency but are computationally intensive and time consuming. Recent studies showed that image registration using deep neural networks is as accurate as and also much faster than traditional methods. Current popular unsupervised learning-based algorithms aim to directly estimate spatial transformations by optimizing similarity between images under registration; however, the estimated deformation fields are often in one direction and do not possess inverse-consistency if swapping the order of two input images. Notice that the consistent registration can reduce systematic bias caused by the order of input images, increase robustness, and improve reliability of subsequent data analysis. Accordingly, in this paper, we propose a new training strategy by introducing both pair-wise and group-wise deformation consistency constraints. Specifically, losses enforcing both inverse-consistency for image pairs and cycle-consistency for image groups are proposed for model training, in addition to conventional image similarity and topology constraints. Experiments on 3D brain magnetic resonance (MR) images showed that such a learning algorithm yielded consistent deformations even after switching the order of input images or reordering images within groups. Furthermore, the registration results of longitudinal elderly MR images demonstrated smaller volumetric measurement variability in labeling regions of interest (ROIs).

Dongdong Gu, Xiaohuan Cao, Shanshan Ma, Lei Chen, Guocai Liu, Dinggang Shen, Zhong Xue
Semantic Hierarchy Guided Registration Networks for Intra-subject Pulmonary CT Image Alignment

CT scanning has been widely used for diagnosis, staging and follow-up studies of pulmonary nodules, where image registration plays an essential role in follow-up assessment of CT images. However, it is challenging to align subtle structures in the lung CTs often with large deformation. Unsupervised learning-based registration methods, optimized according to the image similarity metrics, become popular in recent years due to their efficiency and robustness. In this work, we consider segmented tissues, i.e., airways, lobules, and pulmonary vessel structures, in a hierarchical way and propose a multi-stage registration workflow to predict deformation fields. The proposed workflow consists of two registration networks. The first network is the label alignment network, used to align the given segmentations. The second network is the vessel alignment network, used to further predict deformation fields to register vessels in lungs. By combining these two networks, we can register lung CT images not only in the semantic level but also in the texture level. In experiments, we evaluated the proposed algorithm on lung CT images for clinical follow-ups. The results indicate that our method has better performance especially in aligning critical structures such as airways and vessel branches in the lung, compared to the existing methods.

Liyun Chen, Xiaohuan Cao, Lei Chen, Yaozong Gao, Dinggang Shen, Qian Wang, Zhong Xue
Highly Accurate and Memory Efficient Unsupervised Learning-Based Discrete CT Registration Using 2.5D Displacement Search

Learning-based registration, in particular unsupervised approaches that use a deep network to predict a displacement field that minimise a conventional similarity metric, has gained huge interest within the last two years. It has, however, not yet reached the high accuracy of specialised conventional algorithms for estimating large 3D deformations. Employing a dense set of discrete displacements (in a so-called correlation layer) has shown great success in learning 2D optical flow estimation, cf. FlowNet and PWC-Net, but comes at excessive memory requirements when extended to 3D medical registration. We propose a highly accurate unsupervised learning framework for 3D abdominal CT registration that uses a discrete displacement layer and a contrast-invariant metric (MIND descriptors) that is evaluated in a probabilistic fashion. We realise a substantial reduction in memory and computational demand by iteratively subdividing the 3D search space into orthogonal planes. In our experimental validation on inter-subject deformable 3D registration, we demonstrate substantial improvements in accuracy (at least $$\approx $$ ≈ 10% points Dice) compared to widely used conventional methods (ANTs SyN, NiftyReg, IRTK) and state-of-the-art U-Net based learning methods (VoxelMorph). We reduce the search space 5-fold, speed-up the run-time twice and are on-par in terms of accuracy with a fully 3D discrete network.

Mattias P. Heinrich, Lasse Hansen
Unsupervised Learning Model for Registration of Multi-phase Ultra-Widefield Fluorescein Angiography

Registration methods based on unsupervised deep learning have achieved good performances, but are often ineffective on the registration of inhomogeneous images containing large displacements. In this paper, we propose an unsupervised learning-based registration method that effectively aligns multi-phase Ultra-Widefield (UWF) fluorescein angiography (FA) retinal images acquired over the time after a contrast agent is applied to the eye. The proposed method consists of an encoder-decoder style network for predicting displacements and spatial transformers to create moved images using the predicted displacements. Unlike existing methods, we transform the moving image as well as its vesselness map through the spatial transformers, and then compute the loss by comparing them with the target image and the corresponding maps. To effectively predict large displacements, displacement maps are estimated at multiple levels of a decoder and the losses computed from the maps are used in optimization. For evaluation, experiments were performed on 64 pairs of early- and late-phase UWF retinal images. Experimental results show that the proposed method outperforms the existing methods.

Gyoeng Min Lee, Kwang Deok Seo, Hye Ju Song, Dong Geun Park, Ga Hyung Ryu, Min Sagong, Sang Hyun Park
Large Deformation Diffeomorphic Image Registration with Laplacian Pyramid Networks

Deep learning-based methods have recently demonstrated promising results in deformable image registration for a wide range of medical image analysis tasks. However, existing deep learning-based methods are usually limited to small deformation settings, and desirable properties of the transformation including bijective mapping and topology preservation are often being ignored by these approaches. In this paper, we propose a deep Laplacian Pyramid Image Registration Network, which can solve the image registration optimization problem in a coarse-to-fine fashion within the space of diffeomorphic maps. Extensive quantitative and qualitative evaluations on two MR brain scan datasets show that our method outperforms the existing methods by a significant margin while maintaining desirable diffeomorphic properties and promising registration speed.

Tony C. W. Mok, Albert C. S. Chung
Adversarial Uni- and Multi-modal Stream Networks for Multimodal Image Registration

Deformable image registration between Computed Tomography (CT) images and Magnetic Resonance (MR) imaging is essential for many image-guided therapies. In this paper, we propose a novel translation-based unsupervised deformable image registration method. Distinct from other translation-based methods that attempt to convert the multimodal problem (e.g., CT-to-MR) into a unimodal problem (e.g., MR-to-MR) via image-to-image translation, our method leverages the deformation fields estimated from both: (i) the translated MR image and (ii) the original CT image in a dual-stream fashion, and automatically learns how to fuse them to achieve better registration performance. The multimodal registration network can be effectively trained by computationally efficient similarity metrics without any ground-truth deformation. Our method has been evaluated on two clinical datasets and demonstrates promising results compared to state-of-the-art traditional and learning-based methods.

Zhe Xu, Jie Luo, Jiangpeng Yan, Ritvik Pulya, Xiu Li, William Wells III, Jayender Jagadeesan
Cross-Modality Multi-atlas Segmentation Using Deep Neural Networks

Both image registration and label fusion in the multi-atlas segmentation (MAS) rely on the intensity similarity between target and atlas images. However, such similarity can be problematic when target and atlas images are acquired using different imaging protocols. High-level structure information can provide reliable similarity measurement for cross-modality images when cooperating with deep neural networks (DNNs). This work presents a new MAS framework for cross-modality images, where both image registration and label fusion are achieved by DNNs. For image registration, we propose a consistent registration network, which can jointly estimate forward and backward dense displacement fields (DDFs). Additionally, an invertible constraint is employed in the network to reduce the correspondence ambiguity of the estimated DDFs. For label fusion, we adapt a few-shot learning network to measure the similarity of atlas and target patches. Moreover, the network can be seamlessly integrated into the patch-based label fusion. The proposed framework is evaluated on the MM-WHS dataset of MICCAI 2017. Results show that the framework is effective in both cross-modality registration and segmentation.

Wangbin Ding, Lei Li, Xiahai Zhuang, Liqin Huang
Longitudinal Image Registration with Temporal-Order and Subject-Specificity Discrimination

Morphological analysis of longitudinal MR images plays a key role in monitoring disease progression for prostate cancer patients, who are placed under an active surveillance program. In this paper, we describe a learning-based image registration algorithm to quantify changes on regions of interest between a pair of images from the same patient, acquired at two different time points. Combining intensity-based similarity and gland segmentation as weak supervision, the population-data-trained registration networks significantly lowered the target registration errors (TREs) on holdout patient data, compared with those before registration and those from an iterative registration algorithm. Furthermore, this work provides a quantitative analysis on several longitudinal-data-sampling strategies and, in turn, we propose a novel regularisation method based on maximum mean discrepancy, between differently-sampled training image pairs. Based on 216 3D MR images from 86 patients, we report a mean TRE of 5.6 mm and show statistically significant differences between the different training data sampling strategies.

Qianye Yang, Yunguan Fu, Francesco Giganti, Nooshin Ghavami, Qingchao Chen, J. Alison Noble, Tom Vercauteren, Dean Barratt, Yipeng Hu
Flexible Bayesian Modelling for Nonlinear Image Registration

We describe a diffeomorphic registration algorithm that allows groups of images to be accurately aligned to a common space, which we intend to incorporate into the SPM software. The idea is to perform inference in a probabilistic graphical model that accounts for variability in both shape and appearance. The resulting framework is general and entirely unsupervised. The model is evaluated at inter-subject registration of 3D human brain scans. Here, the main modeling assumption is that individual anatomies can be generated by deforming a latent ‘average’ brain. The method is agnostic to imaging modality and can be applied with no prior processing. We evaluate the algorithm using freely available, manually labelled datasets. In this validation we achieve state-of-the-art results, within reasonable runtimes, against previous state-of-the-art widely used, inter-subject registration algorithms. On the unprocessed dataset, the increase in overlap score is over 17%. These results demonstrate the benefits of using informative computational anatomy frameworks for nonlinear registration.

Mikael Brudfors, Yaël Balbastre, Guillaume Flandin, Parashkev Nachev, John Ashburner
Are Registration Uncertainty and Error Monotonically Associated?

In image-guided neurosurgery, current commercial systems usually provide only rigid registration, partly because it is harder to predict, validate and understand non-rigid registration error. For instance, when surgeons see a discrepancy in aligned image features, they may not be able to distinguish between registration error and actual tissue deformation caused by tumor resection. In this case, the spatial distribution of registration error could help them make more informed decisions, e.g., ignoring the registration where the estimated error is high. However, error estimates are difficult to acquire. Probabilistic image registration (PIR) methods provide measures of registration uncertainty, which could be a surrogate for assessing the registration error. It is intuitive and believed by many clinicians that high uncertainty indicates a large error. However, the monotonic association between uncertainty and error has not been examined in image registration literature. In this pilot study, we attempt to address this fundamental problem by looking at one PIR method, the Gaussian process (GP) registration. We systematically investigate the relation between GP uncertainty and error based on clinical data and show empirically that there is a weak-to-moderate positive monotonic correlation between point-wise GP registration uncertainty and non-rigid registration error.

Jie Luo, Sarah Frisken, Duo Wang, Alexandra Golby, Masashi Sugiyama, William Wells III
MR-to-US Registration Using Multiclass Segmentation of Hepatic Vasculature with a Reduced 3D U-Net

Accurate hepatic vessel segmentation and registration using ultrasound (US) can contribute to beneficial navigation during hepatic surgery. However, it is challenging due to noise and speckle in US imaging and liver deformation. Therefore, a workflow is developed using a reduced 3D U-Net for segmentation, followed by non-rigid coherent point drift (CPD) registration. By means of electromagnetically tracked US, 61 3D volumes were acquired during surgery. Dice scores of 0.77, 0.65 and 0.66 were achieved for segmentation of all vasculature, hepatic vein and portal vein respectively. This compares to inter-observer variabilities of 0.85, 0.88 and 0.74 respectively. Target registration error at a tumor lesion of interest was lower (7.1 mm) when registration is performed either on the hepatic or the portal vein, compared to using all vasculature (8.9 mm). Using clinical data, we developed a workflow consisting of multi-class segmentation combined with selective non-rigid registration that leads to sufficient accuracy for integration in computer assisted liver surgery.

Bart R. Thomson, Jasper N. Smit, Oleksandra V. Ivashchenko, Niels F. M. Kok, Koert F. D. Kuhlmann, Theo J. M. Ruers, Matteo Fusaglia
Detecting Pancreatic Ductal Adenocarcinoma in Multi-phase CT Scans via Alignment Ensemble

Pancreatic ductal adenocarcinoma (PDAC) is one of the most lethal cancers among the population. Screening for PDACs in dynamic contrast-enhanced CT is beneficial for early diagnosis. In this paper, we investigate the problem of automated detecting PDACs in multi-phase (arterial and venous) CT scans. Multiple phases provide more information than single phase, but they are unaligned and inhomogeneous in texture, making it difficult to combine cross-phase information seamlessly. We study multiple phase alignment strategies, i.e., early alignment (image registration), late alignment (high-level feature registration), and slow alignment (multi-level feature registration), and suggest an ensemble of all these alignments as a promising way to boost the performance of PDAC detection. We provide an extensive empirical evaluation on two PDAC datasets and show that the proposed alignment ensemble significantly outperforms previous state-of-the-art approaches, illustrating the strong potential for clinical use.

Yingda Xia, Qihang Yu, Wei Shen, Yuyin Zhou, Elliot K. Fishman, Alan L. Yuille
Biomechanics-Informed Neural Networks for Myocardial Motion Tracking in MRI

Image registration is an ill-posed inverse problem which often requires regularisation on the solution space. In contrast to most of the current approaches which impose explicit regularisation terms such as smoothness, in this paper we propose a novel method that can implicitly learn biomechanics-informed regularisation. Such an approach can incorporate application-specific prior knowledge into deep learning based registration. Particularly, the proposed biomechanics-informed regularisation leverages a variational autoencoder (VAE) to learn a manifold for biomechanically plausible deformations and to implicitly capture their underlying properties via reconstructing biomechanical simulations. The learnt VAE regulariser then can be coupled with any deep learning based registration network to regularise the solution space to be biomechanically plausible. The proposed method is validated in the context of myocardial motion tracking on 2D stacks of cardiac MRI data from two different datasets. The results show that it can achieve better performance against other competing methods in terms of motion tracking accuracy and has the ability to learn biomechanical properties such as incompressibility and strains. The method has also been shown to have better generalisability to unseen domains compared with commonly used L2 regularisation schemes.

Chen Qin, Shuo Wang, Chen Chen, Huaqi Qiu, Wenjia Bai, Daniel Rueckert
Fluid Registration Between Lung CT and Stationary Chest Tomosynthesis Images

Registration is widely used in image-guided therapy and image-guided surgery to estimate spatial correspondences between organs of interest between planning and treatment images. However, while high-quality computed tomography (CT) images are often available at planning time, limited angle acquisitions are frequently used during treatment because of radiation concerns or imaging time constraints. This requires algorithms to register CT images based on limited angle acquisitions. We, therefore, formulate a 3D/2D registration approach which infers a 3D deformation based on measured projections and digitally reconstructed radiographs of the CT. Most 3D/2D registration approaches use simple transformation models or require complex mathematical derivations to formulate the underlying optimization problem. Instead, our approach entirely relies on differentiable operations which can be combined with modern computational toolboxes supporting automatic differentiation. This then allows for rapid prototyping, integration with deep neural networks, and to support a variety of transformation models including fluid flow models. We demonstrate our approach for the registration between CT and stationary chest tomosynthesis (sDCT) images and show how it naturally leads to an iterative image reconstruction approach.

Lin Tian, Connor Puett, Peirong Liu, Zhengyang Shen, Stephen R. Aylward, Yueh Z. Lee, Marc Niethammer
Anatomical Data Augmentation via Fluid-Based Image Registration

We introduce a fluid-based image augmentation method for medical image analysis. In contrast to existing methods, our framework generates anatomically meaningful images via interpolation from the geodesic subspace underlying given samples. Our approach consists of three steps: 1) given a source image and a set of target images, we construct a geodesic subspace using the Large Deformation Diffeomorphic Metric Mapping (LDDMM) model; 2) we sample transformations from the resulting geodesic subspace; 3) we obtain deformed images and segmentations via interpolation. Experiments on brain (LPBA) and knee (OAI) data illustrate the performance of our approach on two tasks: 1) data augmentation during training and testing for image segmentation; 2) one-shot learning for single atlas image segmentation. We demonstrate that our approach generates anatomically meaningful data and improves performance on these tasks over competing approaches. Code is available at https://github.com/uncbiag/easyreg .

Zhengyang Shen, Zhenlin Xu, Sahin Olut, Marc Niethammer
Generalizing Spatial Transformers to Projective Geometry with Applications to 2D/3D Registration

Differentiable rendering is a technique to connect 3D scenes with corresponding 2D images. Since it is differentiable, processes during image formation can be learned. Previous approaches to differentiable rendering focus on mesh-based representations of 3D scenes, which is inappropriate for medical applications where volumetric, voxelized models are used to represent anatomy. We propose a novel Projective Spatial Transformer module that generalizes spatial transformers to projective geometry, thus enabling differentiable volume rendering. We demonstrate the usefulness of this architecture on the example of 2D/3D registration between radiographs and CT scans. Specifically, we show that our transformer enables end-to-end learning of an image processing and projection model that approximates an image similarity function that is convex with respect to the pose parameters, and can thus be optimized effectively using conventional gradient descent. To the best of our knowledge, we are the first to describe the spatial transformers in the context of projective transmission imaging, including rendering and pose estimation. We hope that our developments will benefit related 3D research applications. The source code is available at https://github.com/gaocong13/Projective-Spatial-Transformers .

Cong Gao, Xingtong Liu, Wenhao Gu, Benjamin Killeen, Mehran Armand, Russell Taylor, Mathias Unberath

Instrumentation and Surgical Phase Detection

Frontmatter
TeCNO: Surgical Phase Recognition with Multi-stage Temporal Convolutional Networks

Automatic surgical phase recognition is a challenging and crucial task with the potential to improve patient safety and become an integral part of intra-operative decision-support systems. In this paper, we propose, for the first time in workflow analysis, a Multi-Stage Temporal Convolutional Network (MS-TCN) that performs hierarchical prediction refinement for surgical phase recognition. Causal, dilated convolutions allow for a large receptive field and online inference with smooth predictions even during ambiguous transitions. Our method is thoroughly evaluated on two datasets of laparoscopic cholecystectomy videos with and without the use of additional surgical tool information. Outperforming various state-of-the-art LSTM approaches, we verify the suitability of the proposed causal MS-TCN for surgical phase recognition.

Tobias Czempiel, Magdalini Paschali, Matthias Keicher, Walter Simson, Hubertus Feussner, Seong Tae Kim, Nassir Navab
Surgical Video Motion Magnification with Suppression of Instrument Artefacts

Video motion magnification can make blood vessels in surgical video more apparent by exaggerating their pulsatile motion and could prevent inadvertent damage and bleeding due to their increased prominence. It could also indicate the success of restricting blood supply to an organ when using a vessel clamp. However, the direct application to surgical video could result in aberration artefacts caused by its sensitivity to residual motion from the surgical instruments and would impede its practical usage in the operating theatre. By storing the previously obtained jerk filter response of each spatial component of each image frame - both prior to surgical instrument introduction and adhering to a Eulerian frame of reference - it is possible to prevent such aberrations from occurring. The comparison of the current readings to the prior readings of a single cardiac cycle at the corresponding cycle point, are used to determine if motion magnification should be active for each spatial component of the surgical video at that given point in time. In this paper, we demonstrate this technique and incorporate a scaling variable to loosen the effect which accounts for variabilities and misalignments in the temporal domain. We present promising results on endoscopic transnasal transsphenoidal pituitary surgery with a quantitative comparison to recent methods using Structural Similarity (SSIM), as well as qualitative analysis by comparing spatio-temporal cross sections of the videos and individual frames.

Mirek Janatka, Hani J. Marcus, Neil L. Dorward, Danail Stoyanov
Recognition of Instrument-Tissue Interactions in Endoscopic Videos via Action Triplets

Recognition of surgical activity is an essential component to develop context-aware decision support for the operating room. In this work, we tackle the recognition of fine-grained activities, modeled as action triplets $$\langle instrument, verb, target \rangle $$ ⟨ i n s t r u m e n t , v e r b , t a r g e t ⟩ representing the tool activity. To this end, we introduce a new laparoscopic dataset, CholecT40, consisting of 40 videos from the public dataset Cholec80 in which all frames have been annotated using 128 triplet classes. Furthermore, we present an approach to recognize these triplets directly from the video data. It relies on a module called class activation guide, which uses the instrument activation maps to guide the verb and target recognition. To model the recognition of multiple triplets in the same frame, we also propose a trainable 3D interaction space, which captures the associations between the triplet components. Finally, we demonstrate the significance of these contributions via several ablation studies and comparisons to baselines on CholecT40.

Chinedu Innocent Nwoye, Cristians Gonzalez, Tong Yu, Pietro Mascagni, Didier Mutter, Jacques Marescaux, Nicolas Padoy
AutoSNAP: Automatically Learning Neural Architectures for Instrument Pose Estimation

Despite recent successes, the advances in Deep Learning have not yet been fully translated to Computer Assisted Intervention (CAI) problems such as pose estimation of surgical instruments. Currently, neural architectures for classification and segmentation tasks are adopted ignoring significant discrepancies between CAI and these tasks. We propose an automatic framework (AutoSNAP) for instrument pose estimation problems, which discovers and learns architectures for neural networks. We introduce 1) an efficient testing environment for pose estimation, 2) a powerful architecture representation based on novel Symbolic Neural Architecture Patterns (SNAPs), and 3) an optimization of the architecture using an efficient search scheme. Using AutoSNAP, we discover an improved architecture (SNAPNet) which outperforms both the hand-engineered i3PosNet and the state-of-the-art architecture search method DARTS.

David Kügler, Marc Uecker, Arjan Kuijper, Anirban Mukhopadhyay
Automatic Operating Room Surgical Activity Recognition for Robot-Assisted Surgery

Automatic recognition of surgical activities in the operating room (OR) is a key technology for creating next generation intelligent surgical devices and workflow monitoring/support systems. Such systems can potentially enhance efficiency in the OR, resulting in lower costs and improved care delivery to the patients. In this paper, we investigate automatic surgical activity recognition in robot-assisted operations. We collect the first large-scale dataset including 400 full-length multi-perspective videos from a variety of robotic surgery cases captured using Time-of-Flight cameras. We densely annotate the videos with 10 most recognized and clinically relevant classes of activities. Furthermore, we investigate state-of-the-art computer vision action recognition techniques and adapt them for the OR environment and the dataset. First, we fine-tune the Inflated 3D ConvNet (I3D) for clip-level activity recognition on our dataset and use it to extract features from the videos. These features are then fed to a stack of 3 Temporal Gaussian Mixture layers which extracts context from neighboring clips, and eventually go through a Long Short Term Memory network to learn the order of activities in full-length videos. We extensively assess the model and reach a peak performance of $${\sim }88\%$$ ∼ 88 % mean Average Precision.

Aidean Sharghi, Helene Haugerud, Daniel Oh, Omid Mohareri

Navigation and Visualization

Frontmatter
Can a Hand-Held Navigation Device Reduce Cognitive Load? A User-Centered Approach Evaluated by 18 Surgeons

During spinal fusion surgery, the orientation of the pedicle screw in the right angle plays a crucial role for the outcome of the operation. Local separation of navigation information and the surgical situs, in combination with intricate visualizations, can limit the benefits of surgical navigation systems. The present study addresses these problems by proposing a hand-held navigation device (HND) for pedicle screw placement. The in-situ visualization of graphically reduced interfaces, and the simple integration of the device into the surgical work flow, allow the surgeon to position the tool while keeping sight of the anatomical target. 18 surgeons participated in a study comparing the HND to the state-of-the-art visualization on an external screen. Our approach revealed significant improvements in mental demand and overall cognitive load, measured using NASA-TLX ( $$p < 0.05$$ p < 0.05 ). Moreover, surgical time (One-Way ANOVA $$p < 0.001$$ p < 0.001 ) and system usability (Kruskal-Wallis test $$p < 0.05$$ p < 0.05 ) were significantly improved.

Caroline Brendle, Laura Schütz, Javier Esteban, Sandro M. Krieg, Ulrich Eck, Nassir Navab
Symmetric Dilated Convolution for Surgical Gesture Recognition

Automatic surgical gesture recognition is a prerequisite of intra-operative computer assistance and objective surgical skill assessment. Prior works either require additional sensors to collect kinematics data or have limitations on capturing temporal information from long and untrimmed surgical videos. To tackle these challenges, we propose a novel temporal convolutional architecture to automatically detect and segment surgical gestures with corresponding boundaries only using RGB videos. We devise our method with a symmetric dilation structure bridged by a self-attention module to encode and decode the long-term temporal patterns and establish the frame-to-frame relationship accordingly. We validate the effectiveness of our approach on a fundamental robotic suturing task from the JIGSAWS dataset. The experiment results demonstrate the ability of our method on capturing long-term frame dependencies, which largely outperform the state-of-the-art methods on the frame-wise accuracy up to $$\sim $$ ∼ 6 points and the F1@50 score $$\sim $$ ∼ 6 points.

Jinglu Zhang, Yinyu Nie, Yao Lyu, Hailin Li, Jian Chang, Xiaosong Yang, Jian Jun Zhang
Deep Selection: A Fully Supervised Camera Selection Network for Surgery Recordings

Recording surgery in operating rooms is an essential task for education and evaluation of medical treatment. However, recording the desired targets, such as the surgery field, surgical tools, or doctor’s hands, is difficult because the targets are heavily occluded during surgery. We use a recording system in which multiple cameras are embedded in the surgical lamp, and we assume that at least one camera is recording the target without occlusion at any given time. As the embedded cameras obtain multiple video sequences, we address the task of selecting the camera with the best view of the surgery. Unlike the conventional method, which selects the camera based on the area size of the surgery field, we propose a deep neural network that predicts the camera selection probability from multiple video sequences by learning the supervision of the expert annotation. We created a dataset in which six different types of plastic surgery are recorded, and we provided the annotation of camera switching. Our experiments show that our approach successfully switched between cameras and outperformed three baseline methods.

Ryo Hachiuma, Tomohiro Shimizu, Hideo Saito, Hiroki Kajita, Yoshifumi Takatsume
Interacting with Medical Volume Data in Projective Augmented Reality

Medical volume data is usually explored on monoscopic monitors. Displaying this data in three-dimensional space facilitates the development of mental maps and the identification of anatomical structures and their spatial relations. Using augmented reality (AR) may further enhance these effects by spatially aligning the volume data with the patient. However, conventional interaction methods, e.g. mouse and keyboard, may not be applicable in this environment. Appropriate interaction techniques are needed to naturally and intuitively manipulate the image data. To this end, a user study comparing four gestural interaction techniques with respect to both clipping and windowing tasks was conducted. Image data was directly displayed on a phantom using stereoscopic projective AR and direct volume visualization. Participants were able to complete both tasks with all interaction techniques with respectively similar clipping accuracy and windowing efficiency. However, results suggest advantages of gestures based on motion-sensitive devices in terms of reduced task completion time and less subjective workload. This work presents an important first step towards a surgical AR visualization system enabling intuitive exploration of volume data. Yet, more research is required to assess the interaction techniques’ applicability for intraoperative use.

Florian Heinrich, Kai Bornemann, Kai Lawonn, Christian Hansen
VR Simulation of Novel Hands-Free Interaction Concepts for Surgical Robotic Visualization Systems

In microsurgery, visualization systems such as the traditional surgical microscope are essential, as surgeons rely on the highly magnified stereoscopic view for performing their operative tasks. For well-aligned visual perspectives onto the operating field during surgery, precise adjustments of the positioning of the system are frequently required. This, however, implies that the surgeon has to reach for the device and each time remove their hand(s) from the operating field, i.e. a disruptive event to the operative task at hand. To address this, we propose two novel hands-free interaction concepts based on head-, and gaze-tracking, that should allow surgeons to efficiently control the 6D positioning of a robotic visualization system with little interruptions to the main operative task. The new concepts were purely simulated in a virtual reality (VR) environment using a HTC Vive for a robotic visualization system. The two interaction concepts were evaluated within the virtual reality simulation in a quantitative user study with 11 neurosurgeons at the Charité hospital and compared to conventional interaction with a surgical microscope. After a brief introduction to the interaction concepts in the virtual reality simulation, neurosurgeons were 29% faster in reaching a set of virtual targets (position and orientation) in simulation as compared to reaching equivalent physical targets on a 3D-printed reference object.

Fang You, Rutvik Khakhar, Thomas Picht, David Dobbelstein
Spatially-Aware Displays for Computer Assisted Interventions

We present a novel display and visual interaction paradigm, which aims at reducing the complexity of understanding the spatial transformations between the surgeon’s viewpoint, the patient, the pre- or intra-operative 2D and 3D data, and surgical tools during computer assisted interventions. To the best of our knowledge, this is the first work in which the traditional interventional display, for example in surgical navigation systems, is registered both to the patient and to the surgeon’s view. The closest concept is that of traditional Augmented Reality windows in which a semitransparent or video see-through display is positioned between the surgeon and the patient. In such cases, the system was providing an AR view into the patient. In the new concept introduced here, the surgeon keeps his/her own direct view to the patient without any need for additional display or direct view augmentation, but the monitor used in the operating room is now registered to the patient and surgeon’s viewpoint. The display could act as fixed viewing frustum or as a mirror frustum relative to the surgeon’s view. This allows the physicians to effortlessly relate their view of tools and patient to the virtual representation of the patient data. In this paper, the first realization and implementation of such a concept is presented and three clinical partners have tested the system and their first feedback is discussed in detail. They unanimously believe that this concept opens the path for facilitating interactive exploration of data and more intuitive navigation guidance in computer assisted interventions.

Alexander Winkler, Ulrich Eck, Nassir Navab

Ultrasound Imaging

Frontmatter
Sensorless Freehand 3D Ultrasound Reconstruction via Deep Contextual Learning

Transrectal ultrasound (US) is the most commonly used imaging modality to guide prostate biopsy and its 3D volume provides even richer context information. Current methods for 3D volume reconstruction from freehand US scans require external tracking devices to provide spatial position for every frame. In this paper, we propose a deep contextual learning network (DCL-Net), which can efficiently exploit the image feature relationship between US frames and reconstruct 3D US volumes without any tracking device. The proposed DCL-Net utilizes 3D convolutions over a US video segment for feature extraction. An embedded self-attention module makes the network focus on the speckle-rich areas for better spatial movement prediction. We also propose a novel case-wise correlation loss to stabilize the training process for improved accuracy. Highly promising results have been obtained by using the developed method. The experiments with ablation studies demonstrate superior performance of the proposed method by comparing against other state-of-the-art methods. Source code of this work is publicly available at https://github.com/DIAL-RPI/FreehandUSRecon .

Hengtao Guo, Sheng Xu, Bradford Wood, Pingkun Yan
Ultra2Speech - A Deep Learning Framework for Formant Frequency Estimation and Tracking from Ultrasound Tongue Images

Thousands of individuals need surgical removal of their larynx due to critical diseases every year and therefore, require an alternative form of communication to articulate speech sounds after the loss of their voice box. This work addresses the articulatory-to-acoustic mapping problem based on ultrasound (US) tongue images for the development of a silent-speech interface (SSI) that can provide them with an assistance in their daily interactions. Our approach targets automatically extracting tongue movement information by selecting an optimal feature set from US images and mapping these features to the acoustic space. We use a novel deep learning architecture to map US tongue images from the US probe placed beneath a subject’s chin to formants that we call, Ultrasound2Formant (U2F) Net. It uses hybrid spatio-temporal 3D convolutions followed by feature shuffling, for the estimation and tracking of vowel formants from US images. The formant values are then utilized to synthesize continuous time-varying vowel trajectories, via Klatt Synthesizer. Our best model achieves R-squared ( $$R^2$$ R 2 ) measure of 99.96% for the regression task. Our network lays the foundation for an SSI as it successfully tracks the tongue contour automatically as an internal representation without any explicit annotation.

Pramit Saha, Yadong Liu, Bryan Gick, Sidney Fels
Ultrasound Video Summarization Using Deep Reinforcement Learning

Video is an essential imaging modality for diagnostics, e.g. in ultrasound imaging, for endoscopy, or movement assessment. However, video hasn’t received a lot of attention in the medical image analysis community. In the clinical practice, it is challenging to utilise raw diagnostic video data efficiently as video data takes a long time to process, annotate or audit. In this paper we introduce a novel, fully automatic video summarization method that is tailored to the needs of medical video data. Our approach is framed as reinforcement learning problem and produces agents focusing on the preservation of important diagnostic information. We evaluate our method on videos from fetal ultrasound screening, where commonly only a small amount of the recorded data is used diagnostically. We show that our method is superior to alternative video summarization methods and that it preserves essential information required by clinical diagnostic standards.

Tianrui Liu, Qingjie Meng, Athanasios Vlontzos, Jeremy Tan, Daniel Rueckert, Bernhard Kainz
Predicting Obstructive Hydronephrosis Based on Ultrasound Alone

Prenatal hydronephrosis (HN) makes up nearly 30% of pediatric Urology Department visits, yet remains challenging to prognosticate without repeated ultrasounds and invasive clinical tests. We build a deep learning model, which uses still images from kidney ultrasound as input and predicts whether HN is due to an obstruction that will receive surgical intervention. We compare our custom convolutional neural network performance against other existing state-of-the-art models. Our best model predicts obstruction with an AUC of 0.93 and an AUPRC of 0.75 in a prospective test set of 89 patients (286 repeated kidney ultrasounds). We show that while maintaining a 5% false negative rate, our classifier identifies 58% of those who will have surgery due to obstruction yet received a functional renogram, indicating that this model could feasibly reduce the amount of testing done in more than half of non-surgical cases. This work demonstrates the ability of deep learning to predict obstructive HN with clinically relevant accuracy based on kidney ultrasound alone, without requiring other clinical variables as input. This algorithm has the potential to change clinical practice by stratifying HN patient risk, reducing repeated follow ups and invasive testing for less severe cases, and bringing more consistency to clinical management.

Lauren Erdman, Marta Skreta, Mandy Rickard, Carson McLean, Aziz Mezlini, Daniel T. Keefe, Anne-Sophie Blais, Michael Brudno, Armando Lorenzo, Anna Goldenberg
Semi-supervised Training of Optical Flow Convolutional Neural Networks in Ultrasound Elastography

Convolutional Neural Networks (CNN) have been found to have great potential in optical flow problems thanks to an abundance of data available for training a deep network. The displacement estimation step in UltraSound Elastography (USE) can be viewed as an optical flow problem. Despite the high performance of CNNs in optical flow, they have been rarely used for USE due to unique challenges that both input and output of USE networks impose. Ultrasound data has much higher high-frequency content compared to natural images. The outputs are also drastically different, where displacement values in USE are often smooth without sharp motions or discontinuities. The general trend is currently to use pre-trained networks and fine-tune them on a small simulation ultrasound database. However, realistic ultrasound simulation is computationally expensive. Also, the simulation techniques do not model complex motions, nonlinear and frequency-dependent acoustics, and many sources of artifact in ultrasound imaging. Herein, we propose an unsupervised fine-tuning technique which enables us to employ a large unlabeled dataset for fine-tuning of a CNN optical flow network. We show that the proposed unsupervised fine-tuning method substantially improves the performance of the network and reduces the artifacts generated by networks trained on computer vision databases.

Ali K. Z. Tehrani, Morteza Mirzaei, Hassan Rivaz
Three-Dimensional Thyroid Assessment from Untracked 2D Ultrasound Clips

The diagnostic quantification of thyroid gland, mostly based on its volume, is commonly done by ultrasound. Typically, three orthogonal length measurements on 2D images are used to estimate the thyroid volume from an ellipsoid approximation, which may vary substantially from its true shape. In this work, we propose a more accurate direct volume determination using 3D reconstructions from two freehand clips in transverse and sagittal directions. A deep learning based trajectory estimation on individual clips is followed by an image-based 3D model optimization of the overlapping transverse and sagittal image data. The image data and automatic thyroid segmentation are then reconstructed and compared in 3D space. The algorithm is tested on 200 pairs of sweeps, and shows that it can provide fully automated, but also more accurate and consistent volume estimations than the standard ellipsoid method, with a median volume error of $$11\%$$ 11 % .

Wolfgang Wein, Mattia Lupetti, Oliver Zettinig, Simon Jagoda, Mehrdad Salehi, Viktoria Markova, Dornoosh Zonoobi, Raphael Prevost
Complex Cancer Detector: Complex Neural Networks on Non-stationary Time Series for Guiding Systematic Prostate Biopsy

Ultrasound is a common imaging modality used for targeting suspected cancerous tissue in prostate biopsy. Since ultrasound images have very low specificity and sensitivity for visualizing the cancer foci, a significant body of literature have aimed to develop ultrasound tissue characterization solutions to alleviate this issue. Major challenges are the substantial heterogeneity in data, and the noisy, limited number of labeled data available from pathology of biopsy samples. A recently proposed tissue characterization method uses spectral analysis of time series of ultrasound data taken during the biopsy procedure combined with deep networks. However, the real-value transformations in these networks neglect the phase information of the signal. In this paper, we study the importance of phase information and compare different ways of extracting reliable features including complex neural networks. These networks can help with analyzing the phase information to use the full capacity of the data. Our results show that the phase content can stabilize training specially with non-stationary time series. The proposed approach is generic and can be applied to several other scenarios where the phase information is important and noisy labels are present.

Golara Javadi, Minh Nguyen Nhat To, Samareh Samadi, Sharareh Bayat, Samira Sojoudi, Antonio Hurtado, Silvia Chang, Peter Black, Parvin Mousavi, Purang Abolmaesumi
Self-Supervised Contrastive Video-Speech Representation Learning for Ultrasound

In medical imaging, manual annotations can be expensive to acquire and sometimes infeasible to access, making conventional deep learning-based models difficult to scale. As a result, it would be beneficial if useful representations could be derived from raw data without the need for manual annotations. In this paper, we propose to address the problem of self-supervised representation learning with multi-modal ultrasound video-speech raw data. For this case, we assume that there is a high correlation between the ultrasound video and the corresponding narrative speech audio of the sonographer. In order to learn meaningful representations, the model needs to identify such correlation and at the same time understand the underlying anatomical features. We designed a framework to model the correspondence between video and audio without any kind of human annotations. Within this framework, we introduce cross-modal contrastive learning and an affinity-aware self-paced learning scheme to enhance correlation modelling. Experimental evaluations on multi-modal fetal ultrasound video and audio show that the proposed approach is able to learn strong representations and transfers well to downstream tasks of standard plane detection and eye-gaze prediction.

Jianbo Jiao, Yifan Cai, Mohammad Alsharid, Lior Drukker, Aris T. Papageorghiou, J. Alison Noble
Assisted Probe Positioning for Ultrasound Guided Radiotherapy Using Image Sequence Classification

Effective transperineal ultrasound image guidance in prostate external beam radiotherapy requires consistent alignment between probe and prostate at each session during patient set-up. Probe placement and ultrasound image interpretation are manual tasks contingent upon operator skill, leading to interoperator uncertainties that degrade radiotherapy precision. We demonstrate a method for ensuring accurate probe placement through joint classification of images and probe position data. Using a multi-input multi-task algorithm, spatial coordinate data from an optically tracked ultrasound probe is combined with an image classifier using a recurrent neural network to generate two sets of predictions in real-time. The first set identifies relevant prostate anatomy visible in the field of view using the classes: outside prostate, prostate periphery, prostate centre. The second set recommends a probe angular adjustment to achieve alignment between the probe and prostate centre with the classes: move left, move right, stop. The algorithm was trained and tested on 9,743 clinical images from 61 treatment sessions across 32 patients. We evaluated classification accuracy against class labels derived from three experienced observers at 2/3 and 3/3 agreement thresholds. For images with unanimous consensus between observers, anatomical classification accuracy was 97.2% and probe adjustment accuracy was 94.9%. The algorithm identified optimal probe alignment within a mean (standard deviation) range of 3.7° (1.2°) from angle labels with full observer consensus, comparable to the 2.8° (2.6°) mean interobserver range. We propose such an algorithm could assist radiotherapy practitioners with limited experience of ultrasound image interpretation by providing effective real-time feedback during patient set-up.

Alex Grimwood, Helen McNair, Yipeng Hu, Ester Bonmati, Dean Barratt, Emma J. Harris
Searching Collaborative Agents for Multi-plane Localization in 3D Ultrasound

3D ultrasound (US) is widely used due to its rich diagnostic information, portability and low cost. Automated standard plane (SP) localization in US volume not only improves efficiency and reduces user-dependence, but also boosts 3D US interpretation. In this study, we propose a novel Multi-Agent Reinforcement Learning (MARL) framework to localize multiple uterine SPs in 3D US simultaneously. Our contribution is two-fold. First, we equip the MARL with a one-shot neural architecture search (NAS) module to obtain the optimal agent for each plane. Specifically, Gradient-based search using Differentiable Architecture Sampler (GDAS) is employed to accelerate and stabilize the training process. Second, we propose a novel collaborative strategy to strengthen agents’ communication. Our strategy uses recurrent neural network (RNN) to learn the spatial relationship among SPs effectively. Extensively validated on a large dataset, our approach achieves the accuracy of 7.05 $$^{\circ }$$ ∘ /2.21 mm, 8.62 $$^{\circ }$$ ∘ /2.36 mm and 5.93 $$^{\circ }$$ ∘ /0.89 mm for the mid-sagittal, transverse and coronal plane localization, respectively. The proposed MARL framework can significantly increase the plane localization accuracy and reduce the computational cost and model size.

Yuhao Huang, Xin Yang, Rui Li, Jikuan Qian, Xiaoqiong Huang, Wenlong Shi, Haoran Dou, Chaoyu Chen, Yuanji Zhang, Huanjia Luo, Alejandro Frangi, Yi Xiong, Dong Ni
Contrastive Rendering for Ultrasound Image Segmentation

Ultrasound (US) image segmentation embraced its significant improvement in deep learning era. However, the lack of sharp boundaries in US images still remains an inherent challenge for segmentation. Previous methods often resort to global context, multi-scale cues or auxiliary guidance to estimate the boundaries. It is hard for these methods to approach pixel-level learning for fine-grained boundary generating. In this paper, we propose a novel and effective framework to improve boundary estimation in US images. Our work has three highlights. First, we propose to formulate the boundary estimation as a rendering task, which can recognize ambiguous points (pixels/voxels) and calibrate the boundary prediction via enriched feature representation learning. Second, we introduce point-wise contrastive learning to enhance the similarity of points from the same class and contrastively decrease the similarity of points from different classes. Boundary ambiguities are therefore further addressed. Third, both rendering and contrastive learning tasks contribute to consistent improvement while reducing network parameters. As a proof-of-concept, we performed validation experiments on a challenging dataset of 86 ovarian US volumes. Results show that our proposed method outperforms state-of-the-art methods and has the potential to be used in clinical practice.

Haoming Li, Xin Yang, Jiamin Liang, Wenlong Shi, Chaoyu Chen, Haoran Dou, Rui Li, Rui Gao, Guangquan Zhou, Jinghui Fang, Xiaowen Liang, Ruobing Huang, Alejandro Frangi, Zhiyi Chen, Dong Ni
An Unsupervised Approach to Ultrasound Elastography with End-to-end Strain Regularisation

Quasi-static ultrasound elastography (USE) is an imaging modality that consists of determining a measure of deformation (i.e. strain) of soft tissue in response to an applied mechanical force. The strain is generally determined by estimating the displacement between successive ultrasound frames acquired before and after applying manual compression. The computational efficiency and accuracy of the displacement prediction, also known as time-delay estimation, are key challenges for real-time USE applications. In this paper, we present a novel deep-learning method for efficient time-delay estimation between ultrasound radio-frequency (RF) data. The proposed method consists of a convolutional neural network (CNN) that predicts a displacement field between a pair of pre- and post-compression ultrasound RF frames. The network is trained in an unsupervised way, by optimizing a similarity metric between the reference and compressed image. We also introduce a new regularization term that preserves displacement continuity by directly optimizing the strain smoothness. We validated the performance of our method by using both ultrasound simulation and in vivo data on healthy volunteers. We also compared the performance of our method with a state-of-the-art method called OVERWIND [17]. Average contrast-to-noise ratio (CNR) and signal-to-noise ratio (SNR) of our method in 30 simulation and 3 in vivo image pairs are 7.70 and 6.95, 7 and 0.31, respectively. Our results suggest that our approach can effectively predict accurate strain images. The unsupervised aspect of our approach represents a great potential for the use of deep learning application for the analysis of clinical ultrasound data.

Rémi Delaunay, Yipeng Hu, Tom Vercauteren
Automatic Probe Movement Guidance for Freehand Obstetric Ultrasound

We present the first system that provides real-time probe movement guidance for acquiring standard planes in routine freehand obstetric ultrasound scanning. Such a system can contribute to the worldwide deployment of obstetric ultrasound scanning by lowering the required level of operator expertise. The system employs an artificial neural network that receives the ultrasound video signal and the motion signal of an inertial measurement unit (IMU) that is attached to the probe, and predicts a guidance signal. The network termed US-GuideNet predicts either the movement towards the standard plane position (goal prediction), or the next movement that an expert sonographer would perform (action prediction). While existing models for other ultrasound applications are trained with simulations or phantoms, we train our model with real-world ultrasound video and probe motion data from 464 routine clinical scans by 17 accredited sonographers. Evaluations for 3 standard plane types show that the model provides a useful guidance signal with an accuracy of 88.8% for goal prediction and 90.9% for action prediction.

Richard Droste, Lior Drukker, Aris T. Papageorghiou, J. Alison Noble

Video Image Analysis

Frontmatter
ISINet: An Instance-Based Approach for Surgical Instrument Segmentation

We study the task of semantic segmentation of surgical instruments in robotic-assisted surgery scenes. We propose the Instance-based Surgical Instrument Segmentation Network (ISINet), a method that addresses this task from an instance-based segmentation perspective. Our method includes a temporal consistency module that takes into account the previously overlooked and inherent temporal information of the problem. We validate our approach on the existing benchmark for the task, the Endoscopic Vision 2017 Robotic Instrument Segmentation Dataset [2], and on the 2018 version of the dataset [1], whose annotations we extended for the fine-grained version of instrument segmentation. Our results show that ISINet significantly outperforms state-of-the-art methods, with our baseline version duplicating the Intersection over Union (IoU) of previous methods and our complete model triplicating the IoU.

Cristina González, Laura Bravo-Sánchez, Pablo Arbelaez
Reliable Liver Fibrosis Assessment from Ultrasound Using Global Hetero-Image Fusion and View-Specific Parameterization

Ultrasound (US) is a critical modality for diagnosing liver fibrosis. Unfortunately, assessment is very subjective, motivating automated approaches. We introduce a principled deep convolutional neural network (CNN) workflow that incorporates several innovations. First, to avoid overfitting on non-relevant image features, we force the network to focus on a clinical region of interest (ROI), encompassing the liver parenchyma and upper border. Second, we introduce global hetero-image fusion (GHIF), which allows the CNN to fuse features from any arbitrary number of images in a study, increasing its versatility and flexibility. Finally, we use “style”-based view-specific parameterization (VSP) to tailor the CNN processing for different viewpoints of the liver, while keeping the majority of parameters the same across views. Experiments on a dataset of 610 patient studies (6979 images) demonstrate that our pipeline can contribute roughly 7% and 22% improvements in partial area under the curve and recall at 90% precision, respectively, over conventional classifiers, validating our approach to this crucial problem.

Bowen Li, Ke Yan, Dar-In Tai, Yuankai Huo, Le Lu, Jing Xiao, Adam P. Harrison
Toward Rapid Stroke Diagnosis with Multimodal Deep Learning

Stroke is a challenging disease to diagnose in an emergency room (ER) setting. While an MRI scan is very useful in detecting ischemic stroke, it is usually not available due to space constraint and high cost in the ER. Clinical tests like the Cincinnati Pre-hospital Stroke Scale (CPSS) and the Face Arm Speech Test (FAST) are helpful tools used by neurologists, but there may not be neurologists immediately available to conduct the tests. We emulate CPSS and FAST and propose a novel multimodal deep learning framework to achieve computer-aided stroke presence assessment over facial motion weaknesses and speech inability for patients with suspicion of stroke showing facial paralysis and speech disorders in an acute setting. Experiments on our video dataset collected on actual ER patients performing specific speech tests show that the proposed approach achieves diagnostic performance comparable to that of ER doctors, attaining a 93.12% sensitivity rate while maintaining 79.27% accuracy. Meanwhile, each assessment can be completed in less than four minutes. This demonstrates the high clinical value of the framework. In addition, the work, when deployed on a smartphone, will enable self-assessment by at-risk patients at the time when stroke-like symptoms emerge.

Mingli Yu, Tongan Cai, Xiaolei Huang, Kelvin Wong, John Volpi, James Z. Wang, Stephen T. C. Wong
Learning and Reasoning with the Graph Structure Representation in Robotic Surgery

Learning to infer graph representations and performing spatial reasoning in a complex surgical environment can play a vital role in surgical scene understanding in robotic surgery. For this purpose, we develop an approach to generate the scene graph and predict surgical interactions between instruments and surgical region of interest (ROI) during robot-assisted surgery. We design an attention link function and integrate with a graph parsing network to recognize the surgical interactions. To embed each node with corresponding neighbouring node features, we further incorporate SageConv into the network. The scene graph generation and active edge classification mostly depend on the embedding or feature extraction of node and edge features from complex image representation. Here, we empirically demonstrate the feature extraction methods by employing label smoothing weighted loss. Smoothing the hard label can avoid the over-confident prediction of the model and enhances the feature representation learned by the penultimate layer. To obtain the graph scene label, we annotate the bounding box and the instrument-ROI interactions on the robotic scene segmentation challenge 2018 dataset with an experienced clinical expert in robotic surgery and employ it to evaluate our propositions.

Mobarakol Islam, Lalithkumar Seenivasan, Lim Chwee Ming, Hongliang Ren
Vision-Based Estimation of MDS-UPDRS Gait Scores for Assessing Parkinson’s Disease Motor Severity

Parkinson’s disease (PD) is a progressive neurological disorder primarily affecting motor function resulting in tremor at rest, rigidity, bradykinesia, and postural instability. The physical severity of PD impairments can be quantified through the Movement Disorder Society Unified Parkinson’s Disease Rating Scale (MDS-UPDRS), a widely used clinical rating scale. Accurate and quantitative assessment of disease progression is critical to developing a treatment that slows or stops further advancement of the disease. Prior work has mainly focused on dopamine transport neuroimaging for diagnosis or costly and intrusive wearables evaluating motor impairments. For the first time, we propose a computer vision-based model that observes non-intrusive video recordings of individuals, extracts their 3D body skeletons, tracks them through time, and classifies the movements according to the MDS-UPDRS gait scores. Experimental results show that our proposed method performs significantly better than chance and competing methods with an $$F_1$$ F 1 -score of 0.83 and a balanced accuracy of 81%. This is the first benchmark for classifying PD patients based on MDS-UPDRS gait severity and could be an objective biomarker for disease severity. Our work demonstrates how computer-assisted technologies can be used to non-intrusively monitor patients and their motor impairments. The code is available at https://github.com/mlu355/PD-Motor-Severity-Estimation .

Mandy Lu, Kathleen Poston, Adolf Pfefferbaum, Edith V. Sullivan, Li Fei-Fei, Kilian M. Pohl, Juan Carlos Niebles, Ehsan Adeli
Searching for Efficient Architecture for Instrument Segmentation in Robotic Surgery

Segmentation of surgical instruments is an important problem in robot-assisted surgery: it is a crucial step towards full instrument pose estimation and is directly used for masking of augmented reality overlays during surgical procedures. Most applications rely on accurate real-time segmentation of high-resolution surgical images. While previous research focused primarily on methods that deliver high accuracy segmentation masks, majority of them can not be used for real-time applications due to their computational cost. In this work, we design a light-weight and highly-efficient deep residual architecture which is tuned to perform real-time inference of high-resolution images. To account for reduced accuracy of the discovered light-weight deep residual network and avoid adding any additional computational burden, we perform a differentiable search over dilation rates for residual units of our network. We test our discovered architecture on the EndoVis 2017 Robotic Instruments dataset and verify that our model is the state-of-the-art in terms of speed and accuracy tradeoff with a speed of up to 125 FPS on high resolution images.

Daniil Pakhomov, Nassir Navab
Unsupervised Surgical Instrument Segmentation via Anchor Generation and Semantic Diffusion

Surgical instrument segmentation is a key component in developing context-aware operating rooms. Existing works on this task heavily rely on the supervision of a large amount of labeled data, which involve laborious and expensive human efforts. In contrast, a more affordable unsupervised approach is developed in this paper. To train our model, we first generate anchors as pseudo labels for instruments and background tissues respectively by fusing coarse handcrafted cues. Then a semantic diffusion loss is proposed to resolve the ambiguity in the generated anchors via the feature correlation between adjacent video frames. In the experiments on the binary instrument segmentation task of the 2017 MICCAI EndoVis Robotic Instrument Segmentation Challenge dataset, the proposed method achieves 0.71 IoU and 0.81 Dice score without using a single manual annotation, which is promising to show the potential of unsupervised learning for surgical tool segmentation.

Daochang Liu, Yuhui Wei, Tingting Jiang, Yizhou Wang, Rulin Miao, Fei Shan, Ziyu Li
Towards Accurate and Interpretable Surgical Skill Assessment: A Video-Based Method Incorporating Recognized Surgical Gestures and Skill Levels

Nowadays, surgical skill assessment becomes increasingly important for surgical training, given the explosive growth of automation technologies. Existing work on skill score prediction is limited and deserves more promising outcomes. The challenges lie on complicated surgical tasks and new subjects as trial performers. Moreover, previous work mostly provides local feedback involving each individual video frame or clip that does not manifest human-interpretable semantics itself. To overcome these issues and facilitate more accurate and interpretable skill score prediction, we propose a novel video-based method incorporating recognized surgical gestures (segments) and skill levels (for both performers and gestures). Our method consists of two correlated multi-task learning frameworks. The main task of the first framework is to predict final skill scores of surgical trials and the auxiliary tasks are to recognize surgical gestures and to classify performers’ skills into self-proclaimed skill levels. The second framework, which is based on gesture-level features accumulated until the end of each previously identified gesture, incrementally generates running intermediate skill scores for feedback decoding. Experiments on JIGSAWS dataset show our first framework on C3D features pushes state-of-the-art prediction performance further to 0.83, 0.86 and 0.69 of Spearman’s correlation for the three surgical tasks under LOUO validation scheme. It even achieves 0.68 when generalizing across these tasks. For the second framework, additional gesture-level skill levels and captions are annotated by experts. The trend of predicted intermediate skill scores indicating problematic gestures is demonstrated as interpretable feedback. It turns out such trend resembles human’s scoring process.

Tianyu Wang, Yijie Wang, Mian Li
Learning Motion Flows for Semi-supervised Instrument Segmentation from Robotic Surgical Video

Performing low hertz labeling for surgical videos at intervals can greatly releases the burden of surgeons. In this paper, we study the semi-supervised instrument segmentation from robotic surgical videos with sparse annotations. Unlike most previous methods using unlabeled frames individually, we propose a dual motion based method to wisely learn motion flows for segmentation enhancement by leveraging temporal dynamics. We firstly design a flow predictor to derive the motion for jointly propagating the frame-label pairs given the current labeled frame. Considering the fast instrument motion, we further introduce a flow compensator to estimate intermediate motion within continuous frames, with a novel cycle learning strategy. By exploiting generated data pairs, our framework can recover and even enhance temporal consistency of training sequences to benefit segmentation. We validate our framework with binary, part, and type tasks on 2017 MICCAI EndoVis Robotic Instrument Segmentation Challenge dataset. Results show that our method outperforms the state-of-the-art semi-supervised methods by a large margin, and even exceeds fully supervised training on two tasks (Our code is available at https://github.com/zxzhaoeric/Semi-InstruSeg/ ).

Zixu Zhao, Yueming Jin, Xiaojie Gao, Qi Dou, Pheng-Ann Heng
Spectral-spatial Recurrent-Convolutional Networks for In-Vivo Hyperspectral Tumor Type Classification

Early detection of cancerous tissue is crucial for long-term patient survival. In the head and neck region, a typical diagnostic procedure is an endoscopic intervention where a medical expert manually assesses tissue using RGB camera images. While healthy and tumor regions are generally easier to distinguish, differentiating benign and malignant tumors is very challenging. This requires an invasive biopsy, followed by histological evaluation for diagnosis. Also, during tumor resection, tumor margins need to be verified by histological analysis. To avoid unnecessary tissue resection, a non-invasive, image-based diagnostic tool would be very valuable. Recently, hyperspectral imaging paired with deep learning has been proposed for this task, demonstrating promising results on ex-vivo specimens. In this work, we demonstrate the feasibility of in-vivo tumor type classification using hyperspectral imaging and deep learning. We analyze the value of using multiple hyperspectral bands compared to conventional RGB images and we study several machine learning models’ ability to make use of the additional spectral information. Based on our insights, we address spectral and spatial processing using recurrent-convolutional models for effective spectral aggregating and spatial feature learning. Our best model achieves an AUC of $${76.3\,\mathrm{\%}}$$ 76.3 % , significantly outperforming previous conventional and deep learning methods.

Marcel Bengs, Nils Gessert, Wiebke Laffers, Dennis Eggert, Stephan Westermann, Nina A. Mueller, Andreas O. H. Gerstner, Christian Betz, Alexander Schlaefer
Synthetic and Real Inputs for Tool Segmentation in Robotic Surgery

Semantic tool segmentation in surgical videos is important for surgical scene understanding and computer-assisted interventions as well as for the development of robotic automation. The problem is challenging because different illumination conditions, bleeding, smoke and occlusions can reduce algorithm robustness. At present labelled data for training deep learning models is still lacking for semantic surgical instrument segmentation and in this paper we show that it may be possible to use robot kinematic data coupled with laparoscopic images to alleviate the labelling problem. We propose a new deep learning based model for parallel processing of both laparoscopic and simulation images for robust segmentation of surgical tools. Due to the lack of laparoscopic frames annotated with both segmentation ground truth and kinematic information a new custom dataset was generated using the da Vinci Research Kit (dVRK) and is made available.

Emanuele Colleoni, Philip Edwards, Danail Stoyanov
Perfusion Quantification from Endoscopic Videos: Learning to Read Tumor Signatures

Intra-operative (this work was partially supported by Disruptive Technologies Innovation Fund, Ireland, project code DTIF2018 240 CA) identification of malignant versus benign or healthy tissue is a major challenge in fluorescence guided cancer surgery. We propose a perfusion quantification method for computer-aided interpretation of subtle differences in dynamic perfusion patterns which can be used to distinguish between normal tissue and benign or malignant tumors intra-operatively by using multispectral endoscopic videos. The method exploits the fact that vasculature arising from cancer angiogenesis gives tumors differing perfusion patterns from the surrounding normal tissues. Experimental evaluation of our method on a cohort of colorectal cancer surgery endoscopic videos suggests that it discriminates between healthy, cancerous and benign tissues with 95% accuracy.

Sergiy Zhuk, Jonathan P. Epperlein, Rahul Nair, Seshu Tirupathi, Pól Mac Aonghusa, Donal F. O’Shea, Ronan Cahill
Asynchronous in Parallel Detection and Tracking (AIPDT): Real-Time Robust Polyp Detection

Automatic polyp detection during colonoscopy screening test is desired to reduce polyp miss rate and thus lower patients’ risk of developing colorectal cancer. Previous works mainly focus on detection accuracy, however, real-time and robust polyp detection is as important to be adopted in clinical workflow. To maintain accuracy, speed and robustness for polyp detection at the same time, we propose a framework featuring two novel concepts: (1) decompose the task into detection and tracking steps to take advantage of both high resolution static images for accurate detection and the temporal information between frames for fast tracking and robustness. (2) run detector and tracker in two parallel threads asynchronously so that a heavy but accurate detector and a light tracker can efficiently work together. We also propose a robustness metric to evaluate performance in realistic clinical setting. Experiments demonstrated that our method outperformed the state-of-the-art results in terms of accuracy, robustness and speed.

Zijian Zhang, Hong Shang, Han Zheng, Xiaoning Wang, Jiajun Wang, Zhongqian Sun, Junzhou Huang, Jianhua Yao
OfGAN: Realistic Rendition of Synthetic Colonoscopy Videos

Data-driven methods usually require a large amount of labelled data for training and generalization, especially in medical imaging. Targeting the colonoscopy field, we develop the Optical Flow Generative Adversarial Network (OfGAN) to transform simulated colonoscopy videos into realistic ones while preserving annotation. The advantages of our method are three-fold: the transformed videos are visually much more realistic; the annotation, such as optical flow of the source video is preserved in the transformed video, and it is robust to noise. The model uses a cycle-consistent structure and optical flow for both spatial and temporal consistency via adversarial training. We demonstrate that the performance of our OfGAN overwhelms the baseline method in relative tasks through both qualitative and quantitative evaluation.

Jiabo Xu, Saeed Anwar, Nick Barnes, Florian Grimpen, Olivier Salvado, Stuart Anderson, Mohammad Ali Armin
Two-Stream Deep Feature Modelling for Automated Video Endoscopy Data Analysis

Automating the analysis of imagery of the Gastrointestinal (GI) tract captured during endoscopy procedures has substantial potential benefits for patients, as it can provide diagnostic support to medical practitioners and reduce mistakes via human error. To further the development of such methods, we propose a two-stream model for endoscopic image analysis. Our model fuses two streams of deep feature inputs by mapping their inherent relations through a novel relational network model, to better model symptoms and classify the image. In contrast to handcrafted feature-based models, our proposed network is able to learn features automatically and outperforms existing state-of-the-art methods on two public datasets: KVASIR and Nerthus. Our extensive evaluations illustrate the importance of having two streams of inputs instead of a single stream and also demonstrates the merits of the proposed relational network architecture to combine those streams.

Harshala Gammulle, Simon Denman, Sridha Sridharan, Clinton Fookes
Rethinking Anticipation Tasks: Uncertainty-Aware Anticipation of Sparse Surgical Instrument Usage for Context-Aware Assistance

Intra-operative anticipation of instrument usage is a necessary component for context-aware assistance in surgery, e.g. for instrument preparation or semi-automation of robotic tasks. However, the sparsity of instrument occurrences in long videos poses a challenge. Current approaches are limited as they assume knowledge on the timing of future actions or require dense temporal segmentations during training and inference. We propose a novel learning task for anticipation of instrument usage in laparoscopic videos that overcomes these limitations. During training, only sparse instrument annotations are required and inference is done solely on image data. We train a probabilistic model to address the uncertainty associated with future events. Our approach outperforms several baselines and is competitive to a variant using richer annotations. We demonstrate the model’s ability to quantify task-relevant uncertainties. To the best of our knowledge, we are the first to propose a method for anticipating instruments in surgery.

Dominik Rivoir, Sebastian Bodenstedt, Isabel Funke, Felix von Bechtolsheim, Marius Distler, Jürgen Weitz, Stefanie Speidel
Deep Placental Vessel Segmentation for Fetoscopic Mosaicking

During fetoscopic laser photocoagulation, a treatment for twin-to-twin transfusion syndrome (TTTS), the clinician first identifies abnormal placental vascular connections and laser ablates them to regulate blood flow in both fetuses. The procedure is challenging due to the mobility of the environment, poor visibility in amniotic fluid, occasional bleeding, and limitations in the fetoscopic field-of-view and image quality. Ideally, anastomotic placental vessels would be automatically identified, segmented and registered to create expanded vessel maps to guide laser ablation, however, such methods have yet to be clinically adopted. We propose a solution utilising the U-Net architecture for performing placental vessel segmentation in fetoscopic videos. The obtained vessel probability maps provide sufficient cues for mosaicking alignment by registering consecutive vessel maps using the direct intensity-based technique. Experiments on 6 different in vivo fetoscopic videos demonstrate that the vessel intensity-based registration outperformed image intensity-based registration approaches showing better robustness in qualitative and quantitative comparison. We additionally reduce drift accumulation to negligible even for sequences with up to 400 frames and we incorporate a scheme for quantifying drift error in the absence of the ground-truth. Our paper provides a benchmark for fetoscopy placental vessel segmentation and registration by contributing the first in vivo vessel segmentation and fetoscopic videos dataset.

Sophia Bano, Francisco Vasconcelos, Luke M. Shepherd, Emmanuel Vander Poorten, Tom Vercauteren, Sebastien Ourselin, Anna L. David, Jan Deprest, Danail Stoyanov
Deep Multi-view Stereo for Dense 3D Reconstruction from Monocular Endoscopic Video

3D reconstruction from monocular endoscopic images is a challenging task. State-of-the-art multi-view stereo (MVS) algorithms based on image patch similarity often fail to obtain a dense reconstruction from weakly-textured endoscopic images. In this paper, we present a novel deep-learning-based MVS algorithm that can produce a dense and accurate 3D reconstruction from a monocular endoscopic image sequence. Our method consists of three key steps. Firstly, a number of depth candidates are sampled around the depth prediction made by a pre-trained CNN. Secondly, each candidate is projected to the other images in the sequence, and the matching score is measured using a patch embedding network that maps each image patch into a compact embedding. Finally, the candidate with the highest score is selected for each pixel. Experiments on colonoscopy videos demonstrate that our patch embedding network outperforms zero-normalized cross-correlation and a state-of-the-art stereo matching network in terms of matching accuracy and that our MVS algorithm produces several degrees of magnitude denser reconstruction than the competing methods when same accuracy filtering is applied.

Gwangbin Bae, Ignas Budvytis, Chung-Kwong Yeung, Roberto Cipolla
Endo-Sim2Real: Consistency Learning-Based Domain Adaptation for Instrument Segmentation

Surgical tool segmentation in endoscopic videos is an important component of computer assisted interventions systems. Recent success of image-based solutions using fully-supervised deep learning approaches can be attributed to the collection of big labeled datasets. However, the annotation of a big dataset of real videos can be prohibitively expensive and time consuming. Computer simulations could alleviate the manual labeling problem, however, models trained on simulated data do not generalize to real data. This work proposes a consistency-based framework for joint learning of simulated and real (unlabeled) endoscopic data to bridge this performance generalization issue. Empirical results on two data sets (15 videos of the Cholec80 and EndoVis’15 dataset) highlight the effectiveness of the proposed Endo-Sim2Real method for instrument segmentation. We compare the segmentation of the proposed approach with state-of-the-art solutions and show that our method improves segmentation both in terms of quality and quantity.

Manish Sahu, Ronja Strömsdörfer, Anirban Mukhopadhyay, Stefan Zachow
Backmatter
Metadaten
Titel
Medical Image Computing and Computer Assisted Intervention – MICCAI 2020
herausgegeben von
Prof. Anne L. Martel
Purang Abolmaesumi
Danail Stoyanov
Diana Mateus
Maria A. Zuluaga
S. Kevin Zhou
Daniel Racoceanu
Prof. Leo Joskowicz
Copyright-Jahr
2020
Electronic ISBN
978-3-030-59716-0
Print ISBN
978-3-030-59715-3
DOI
https://doi.org/10.1007/978-3-030-59716-0

Premium Partner