Topical Review The following article is Open access

In silico simulation: a key enabling technology for next-generation intelligent surgical systems

, , , and

Published 15 May 2023 © 2023 The Author(s). Published by IOP Publishing Ltd
, , Advances in in silico trials of medical products: evidence, methods and tools Citation Benjamin D Killeen et al 2023 Prog. Biomed. Eng. 5 032001 DOI 10.1088/2516-1091/acd28b

2516-1091/5/3/032001

Abstract

To mitigate the challenges of operating through narrow incisions under image guidance, there is a desire to develop intelligent systems that assist decision making and spatial reasoning in minimally invasive surgery (MIS). In this context, machine learning-based systems for interventional image analysis are receiving considerable attention because of their flexibility and the opportunity to provide immediate, informative feedback to clinicians. It is further believed that learning-based image analysis may eventually form the foundation for semi- or fully automated delivery of surgical treatments. A significant bottleneck in developing such systems is the availability of annotated images with sufficient variability to train generalizable models, particularly the most recently favored deep convolutional neural networks or transformer architectures. A popular alternative to acquiring and manually annotating data from the clinical practice is the simulation of these data from human-based models. Simulation has many advantages, including the avoidance of ethical issues, precisely controlled environments, and the scalability of data collection. Here, we survey recent work that relies on in silico training of learning-based MIS systems, in which data are generated via computational simulation. For each imaging modality, we review available simulation tools in terms of compute requirements, image quality, and usability, as well as their applications for training intelligent systems. We further discuss open challenges for simulation-based development of MIS systems, such as the need for integrated imaging and physical modeling for non-optical modalities, as well as generative patient models not dependent on underlying computed tomography, MRI, or other patient data. In conclusion, as the capabilities of in silico training mature, with respect to sim-to-real transfer, computational efficiency, and degree of control, they are contributing toward the next generation of intelligent surgical systems.

Export citation and abstract BibTeX RIS

Original content from this work may be used under the terms of the Creative Commons Attribution 4.0 licence. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.

1. Introduction

Minimally invasive surgery (MIS) represents a notable shift in surgical methodology, from open procedures to optimized techniques with minimal incisions that reduce secondary effects [1]. Benefits for patients include reduced postoperative pain [24]; shorter hospital stays and correspondingly lower costs [46], faster recovery times [4, 7]; fewer major complications [4, 8]; reduced scarring from the smaller incision [4, 9]; and reduced stress on the immune system [4]. At the same time, MIS presents nontrivial procedural challenges stemming from operating through such narrow incisions, such as safely navigating subcutaneous pathways [1013], identifying percutaneous approaches [1315], obtaining clinically relevant views [1619], minimizing exposure in radiation imaging [2022], and performing complex motions under unique spatial constraints [12, 23]. Although many of these challenges are common in other surgical domains, they can be especially prominent in MIS [24]. Moving forward, intelligent surgical systems promise to preserve the distinct advantages of MIS while offering to alleviate remaining challenges in its delivery, for example by improving intra-operative image quality [25], providing real-time image analysis [26], actuating robotic imaging devices [18, 19, 27], and performing skilled motor actions in the surgical setting [28, 29].

Learning-based intelligent systems have gained increasing prominence in recent years as computational resources and the maturity of artificial intelligence (AI) algorithms enable deployment in multiple areas of medicine [30, 31]. AI already enjoys wide application in radiology, where training benefits from the fact that data needed for diagnostic reasoning—radiographs as input and clinical annotations as output—coincide with information already recorded in standard practice [32, 33]. Unfortunately, this is not the case in MIS, since the purpose of intra-operative imaging is to inform both spatial reasoning and clinical decision making [34, 35]. Although specific spatial annotations, such as camera pose, tool-to-tissue relationships, and anatomical shape, may be obtained from real images through post hoc analysis [36], by incorporating robot kinematic models [37, 38], or by leveraging the shape of known components [39, 40], the assumptions made by these approaches do not extend beyond their specific use-case. The fact that intra-operative images are not routinely recorded nor stored further complicates the curation of large data sets that are needed for training machine learning algorithms, including deep convolutional neural networks, transformer architectures, which are the backbone of modern intelligent systems.

In silico training circumvents these obstacles through the use of computer simulation as an ethical, controlled, and scalable source of data for intelligent systems in MIS. This approach is already prominent in related problems of general robotic manipulation [41, 42] and self-driving [43], each of which contends with challenges specific to the given task, environment, and data modality. In the context of MIS, numerous simulation frameworks have been developed by researchers either with machine learning specifically in mind [28, 44, 45] or as training tools for clinicians and subsequently adapted for machine learning applications [29, 37]. One of the most mature examples is the SimNow software suite, developed for the da Vinci Surgical Robot, which includes training routines for 33 surgical skills [46], as well as complete procedures (Intuitive Surgical Operations, Inc.; Sunnyvale, California). The increasing maturity of this and other simulation frameworks for MIS has heightened interest in in silico training for intelligent systems as well as clinicians.

In this article, we review the progress in and challenges facing in silico training of intelligent MIS systems. Given the differences in application, reviews of simulation training for general robotics, such as [47], focus on the use of RGB(-D) imaging, which generally does not apply in the context of MIS. Further, although previous reviews include recent advances in MIS [4859], robotic-assisted MIS [55, 6063], machine learning in surgical interventions [34, 35, 6472], or surgical simulation for human training purposes [7376], in silico training specifically for intelligent MIS systems remains an emerging area deserving of an introduction. We focus this review on frameworks and successful applications in three imaging modalities which have received the bulk of researchers' attention, namely endoscopy, ultrasound (US), and x-ray. For each simulation framework, we review the image quality in terms of sim-to-real transfer, simulation dynamism, computational resources, and time cost.

The outline of the review is as follows. In section 2, we introduce overarching concepts relevant to simulation. Diving into recent progress, sections 36 explore the various simulation frameworks and their applications for training intelligent systems in MIS, for endoscopic, stereo microscopic, US, computed tomography (CT), and x-ray imaging, respectively. Table 1 summarizes the available simulation frameworks for each modality. We first introduce frameworks which have received the most attention and subsequently highlight recent alternatives that may be of interest, focusing on these modalities because they have received the most attention from the community for in silico training. In section 7, we call attention to the capabilities of systems developed for different modalities and speculate on future directions in this exciting area.

Table 1. Simulation frameworks.

ModalityFrameworkPlatformsOpenDescription
Visible Light AMBF [45]C++, Python, ROS $\checkmark$ Physics-based real-time dynamic multi-body simulator
AMBF+ [98]Python, ROS $\checkmark$ Physics-based endoscopic simulator
AMBF Drilling Simulator [28]C++, Python, ROS $\checkmark$ Physics-based microscopic simulator
AMBF-RL [99]Python, ROS $\checkmark$ Physics-based endoscopic simulator
VisionBlender [97]Blender [80] plug-in $\checkmark$ Physics-based monoscopic/stereo simulator
SurRol [29]Python $\checkmark$ Physics-based endoscopic simulator
Ultrasound Field II [100102]MATLAB, Python, Octave, C×Physics-based US imaging simulator
RLUS [103]Python $\checkmark$ RL environments using Field II
k-Wave [104]MATLAB $\checkmark$ Physics-based US imaging simulator
FOCUS [105]MATLAB $\checkmark$ Physics-based US imaging simulator
PLUS [106]Slicer [107] plugin $\checkmark$ Translational suite with simple US simulation
Verasonics Application×Commercial platform for US research
SIMUS/MUST [108, 109]MATLAB $\checkmark$ Physics-based US imaging simulator
X-ray CONRAD [84]Java + OpenCL $\checkmark$ Physics-based x-ray + CT simulation
MC-GPU [110]CUDA $\checkmark$ Volumetric Monte Carlo simulation
DeepDRR [44, 111]Python + CUDA $\checkmark$ Physics-based x-ray simulation
DRRGenerator [112]Slicer Extension $\checkmark$ Conventional DRR visualization plugin
Insight Toolkit (ITK) C++, Python $\checkmark$ Scientific image processing, segmentation, and registration

2. Simulation methods for MIS

The required capabilities of a simulation framework for in silico training depend on the task at hand. For surgical guidance, the simulator must incorporate dynamic spatial relationships of the tools, anatomy, and imaging device, while performing surgical tasks requires tool-to-tissue interactions with the target anatomy to be represented in the simulation. For example, an in silico training framework for vertebroplasty may target the needle insertion task, which is primarily a navigation problem, or the cement injection process, in which interventionalists must decide whether to continue injecting bone cement or stop [77, 78]. The former involves simulating radiographs with surgical tools present at various stages of insertion and with varied poses relative to the anatomy, as in [79], while the latter—in addition to simulating images and tools—also requires the simulation of the fluid dynamics and imaging characteristics of bone cement in osteoporotic bone. In other words, the first challenge can be likened to a computer graphics problem, ensuring simulated images match their real-world counterparts as much as possible, while the second challenge further requires integration of a physics engine, estimating how physical processes unfold over time. We distinguish between these two facets of simulation, the visual and the physical, since many simulation frameworks at this moment focus on one or the other.

Within both physical and visual simulators, there are two major approaches for computation, based on the representation of the underlying data. Volumetric methods rely on dense 3D arrays for data representation, such as those provided by CT and MRI scans, and are typically used to simulate image formation for x-ray and US images. Sparse methods rely on point-cloud or surface data to represent objects, allowing for memory-efficient representations of spatially diverse scenes with moving objects. Rendering of surface meshes falls under this category, as well as most rigid body physics engines, which represent objects as surface meshes. Figure 1 illustrates the difference between these two approaches for image rendering in the optical domain. In figure 1(a), the computation of a given pixel value involves 'marching' along the corresponding ray and processing the contributions at each point in the volume [82]. When considering interaction effects, like scattering, this technique can result in photo-realistic rendering of complicated scenes, but until recently it was computationally prohibitive for generating large datasets. The use of increasingly powerful graphical processing unit (GPU) devices has made volumetric methods more feasible for a variety of applications. Figure 1(b), by contrast, considers the point where the ray intersects with a surface mesh, a more efficient problem in terms of computational time and memory [83].

Figure 1.

Figure 1. Simulation methods for 2D image formation generally adhere to two major approaches based on the data structure: (a) volumetric simulation and (b) sparse simulation, which may use point clouds, surface meshes (as shown here), spline-based object models, or other spatial data structure. In (a), dense grid data makes up a volume, which individual rays 'march' through to model energy-matter interactions. In (b), a sparse triangle mesh models the surface of an object, which one or more rays intersect at a finite set of points. The latter approach is commonly used in computer graphics, with open source tools such as Blender [80] available for creating 3D scenes with complicated meshes, or they can be sampled from statistical shape models [81].

Standard image High-resolution image

The benefits of sparse simulation methods primarily stem from the increased computational efficiency, which can enable simulations with more capable dynamic interactions, while volumetric methods contain more detail for rendering realistic images. A sparse patient model, for example, consists of enclosed meshes segmenting each organ, as in [81]. This enables fast simulation of physical interactions, such as collision detection and realistic tissue deformation, and for visible light imaging modalities, in which reflective properties of surfaces are responsible for a large portion of image formation, can provide the basis for photorealistic image rendering [80]. Simulating transmissive imaging modalities, however, such as x-ray and US, requires modeling the energy-matter interactions throughout a medium, including the interior of surface meshes. For sparse models, this can be accomplished by modeling the distribution of attenuation properties within each region, e.g. by assuming uniform density and material composition or using splines [84], but volumetric patient models are inherently better suited for this task because they contain dense material property data. Thus for transmissive modalities, volumetric models (with sufficient spatial resolution) enable more realistic simulation of image formation, while sparse methods are sufficient for realistic rendering of visible light images and support more powerful simulation dynamics.

The quality of imaging simulation directly affects the sim-to-real performance of the system, that is its ability to train in silico but perform on the corresponding domain of real images. In machine learning, this problem is referred to as domain adaptation. One domain adaptation approach, which we find to be commonly used in the papers described here, is the use of generative adversarial networks (GANs) to translate images from the simulation domain to the real, using an unsupervised cycle-consistent loss [85]. Separately, it should be noted that the sim-to-real capability of a model differs from its generalizability. The former is a matter of image appearance and tool-to-tissue interaction (if considered), while the latter describes an intelligent system's performance on samples and situations not seen during training, either in the simulation or real domain. As an example, a bone segmentation algorithm may achieve high sim-to-real performance but still fail when confronted with anamalous anatomy or fractures, regardless of the domain. The differences in imaging parameters and techniques between institutions also introduces the need for generalizability [86]. In general, it is understood that training a generalizable model requires sufficient variation in the simulation parameters, including image formation characteristics as well as patient demographics, pathologies, and anatomical features.

In order to facilitate in silico training, it is necessary to obtain ground truth annotations of the image. For 2D imaging modalities, this often involves propagating information from the 3D simulation, where positions and orientations for objects of interest are known, to the 2D image being simulated. For example, in endoscopic, stereoscopic, and x-ray imaging, the intrinsic parameters of the camera geometry—its focal length and pixel spacing—can be modeled in the intrinsic matrix $\mathbf{K} \in \mathbb{R}^{3 \times 3}$, while the position and orientation of the camera are encoded in the extrinsic matrix $[\mathbf{R} | \mathbf{t}]$. Thus for a homogeneous 3D point $\tilde{\mathbf{x}} = [x, y, z, 1]$ in the simulated world, such as an anatomical landmark, the corresponding homogenous point in the image can be determined as

Equation (1)

as in figure 2. Similar operations can determine image-space projections for slice-to-volume transformations, as they arise in US, and extend to different annotations such as lines, volumes, and 3D segmentations, enabling richly annotated, large-scale datasets to be generated from relatively few annotations of the underlying 3D data.

Figure 2.

Figure 2. Pinhole camera geometry often used for modeling endoscopic, stereoscopic, and x-ray imaging. The camera projection matrix $\mathbf{P} = \mathbf{K} [\mathbf{R} | \mathbf{t}]$ can be used to propagate 3D annotations, like the landmark $\tilde{\mathbf{x}}$ to the image plane.

Standard image High-resolution image

2.1. Instrument and patient models

Given a simulator's data representation, the question arises of how to obtain compatible instrument and patient models for MIS. For instruments, this is relatively straightforward since computer-aided design of individual components utilizes sparse representations, which can be converted to volumetric representations by voxelization. Since instruments generally consist of homogeneous components with a small number of materials, sparse representations are often capable of simulating them realistically for either visible light or transmissive imaging modalities. Patient data, on the other hand, can be derived from 3D medical imaging or digital phantoms. CT and MRI imaging are commonly used to create volumetric patient models, and they can be segmented to create sparse representations as well. They are limited, however, in terms of the structures they can model, which must be visible in the original image, and the generality of anatomical features, which pertain to specific individuals.

Digital phantoms, on the other hand, are computer-generated models that can provide geometric and biophysical properties of the anatomy of interest [81, 87]. They are also referred to as virtual phantoms [81, 87, 88], digital reference objects [89], or computational anthropomorphic phantoms [90]. A key advantage of digital phantoms is their capability to represent complex shapes and material properties for populations rather than specific individuals [9193]. Moreover, in contrast with ex vivo medical imaging, digital phantoms are not subject to tissue shrinkage, deformation, or any physical changes caused by invasive or post-mortem procedures [94]. They are analogous to instrument models generated through computer-aided design in that they can be converted directly into sparse patient models or voxelized into volumetric models and in that their suitability for a given simulation depends on the features included and the level of detail.

Several digital phantoms have been developed to facilitate in silico medical imaging and support related research. The Virtual Imaging Clinical Trial Regulatory Evaluation phantom, for example, replicates the tissue and large vessels of the breast for recapitulation of clinical features and is used in the regulatory evaluation of imaging products [87]. The 4D extended cardiac-torso (XCAT) phantom models the torso anatomy with cardiac and respiratory motions, using nonuniform rational B-splines to define organ shapes [81]. These models can be manipulated to represent a range of patient characteristics, such as height, weight, and anatomical anomalies, for both male and female anatomies. Future research efforts in digital phantoms aim to improve their accuracy and resolution to support more advanced applications, potentially by incorporating recent advances in machine learning [90].

2.2. Physics engine

A 'physics engine' is a type of simulation that determines how virtual objects will behave over time, according to realistic modeling of collisions, friction, fluids, deformable solids, and other physical phenomena [47]. This is different from the visualization of these objects, which is referred to as rendering or image formation, and occurs for a single timestep. The time and computational complexity of a physics engine depends on the physical processes at play and simulation quality. For example, fluid dynamics and deformable tissue modeling are more computationally intensive than rigid body simulation, and recent work has focused on accelerating these capabilities to enable real-time intra-operative modeling for in silico surgical trials [95, 96]. For in silico training, physics engines have been developed prominently for applications in endoscopy and stereo microscopy [28, 45, 97], where visible sensors more closely resemble RGB(-D) cameras commonly deployed in general robotics.

3. Visible light imaging

Endoscopic or laparoscopic imaging is carried out via an endoscope, a flexible instrument with a light source, which can be inserted through natural openings or narrow incisions [113]. Early implementations of this technology made use of fiber optic bundles that transmitted visible light to a focusing lens at the top of the bundle, but frequent ruptures in the bundle obscured visual clarity. In the 1990s, the advent of videoendoscopy replaced fiber optic devices with charge-coupled devices [113], improving image quality and reliability while also enabling digital processing of endoscopic images. From its advent, endoscopic imaging has been strongly associated with MIS. Laparoscopic procedures, in which the endoscope and surgical instruments are inserted through narrow incisions in the midsection, include partial gastrectomy [114], cholecystectomy [115], splenectomy [116], pancreatic resection [117], colorectal resections, and polyptectomies [118], among others [119]. Endoscopic imaging is also commonly used for surgical interventions in the ear [120], mouth [121], and sinus cavity [122]. The advantages of this modality are high resolution and radiation-free imaging of soft tissue, although it requires an anatomical pathway for the endoscope, often involving an incision.

In the context of in silico training, endoscopic imaging is appealing because its reliance on visible light cameras allows biomedical engineers to take advantage of related research in general computer vision and computer graphics. In particular, computer simulation frameworks for endoscopic surgery may be built on top of more general software [97], taking advantage of photorealistic ray-tracing [80] or commercial physics engines [123]. Prevalent challenges in this setting have to do with simulation of deformable surfaces [124], with much work focused on developing intelligent control policies for robotic surgery. Surgeons and intelligent systems must also contend with smoke from energized devices [25, 38] and blood. Finally, in order to facilitate machine learning, it is desirable for simulation frameworks to provide various ground truth data by default, including segmentation maps, depth maps, object poses, camera pose, and more [98].

3.1. Asynchronous multi-body framework (AMBF)

In 2019, Munawar et al [45] introduce the AMBF for simulating complex robots with realistic environmental dynamics to support robots with closed-link kinematic chains and redundant mechanisms, which are common in surgical robotics but not supported by popular robotics frameworks. In addition, the simulator supports non-rigid bodies, such as soft tissue, and built-in models of real-world surgical robots. These properties have supported immersive simulation of lateral skull-based surgery for simultaneous training and data generation, suitable for training machine learning models [28].

Building upon the AMBF, Munawar et al [98] propose a unified simulation platform for robotic surgery with the da Vinci robot to facilitate the prototyping of various scenes, including an example suturing task. Their contributions enable real-time control and detection collision within the simulation, as well as generation of ground truth data for machine learning, enabling human-driven simulation to provide kinematic as well as image data over extended sequences.

To support simultaneous human training and in silico collection of data for machine learning applications, Munawar et al [28] further extends the aforementioned functionality to include haptic feedback for lateral skull-based surgery as well as a simulated stereo microscope to enable depth perception in the operating field. They demonstrate the utility of this approach for the purposes of training intelligent systems by training a stereo depth network (STTR [125]) on the simulation images.

Varier et al [99] propose AMBF-reinforcement learning (RL) to support real-time RL for surgical robotic tasks. They validate their approach using in silico training of an RL agent for debris removal with the da Vinci Research Kit (dVRK) patient side manipulator and successfully transfer the optimal policy to a real robot.

3.2. Blender

Cartucho et al [97] introduce VisionBlender, a software plugin allows users to create highly realistic computer vision datasets with segmentation maps, depth maps, and other ground truth, using the 3D modeling software, Blender [80], as a backbone. VisionBlender is designed with robotic surgical applications specifically in mind, supporting playback through the Robot Operating System (ROS) to simulate real-time data collection from a da Vinci robot via dVRK. Cartucho et al [126] utilize VisionBlender [97] to generate a simulated laparoscopic dataset for a performance evaluation of their improved marker in surgical tool tracking.

Chen et al [25] uses Blender [80] to simulate smoke in endoscopic video, such as is commonly caused by energized surgical instruments. In this case, in silico training enables them to generate scalar ground truth masks for the amount of smoke, decomposing the learning problem into a two-step process and enabling smoke removal via generative cooperative networks. Zhou et al [127] develop an open-source simulation tool based on Blender [80] for data generation of synthetic endoscopic videos and relevant ground truths that can allow vigorous evaluation of 3D reconstruction methods.

3.3. Alternative frameworks

Emulating recent successes in general robotic manipulation using deep RL, Xu et al [29] introduces an open-source, RL-centered platform for training intelligent systems on skills using the da Vinci Surgical Robot. Their 'SurRol' platform relies on the PyBullet [128] physics engine for real-time interaction and can be deployed on real da Vinci robots using the dVRK [129]. Ten surgery-related tasks are included as RL environments, compatible with OpenAI Gym [130], including fundamental tasks in laparoscopy like peg transfer [131], needle transfer, and camera manipulation to obtain desired views.

Garcia-Peraza-Herrera et al [132] take a semi-in silico approach, simulating images of robotic laparoscopic surgery by overlaying sample surgical instruments in the foreground (captured using a green screen) of instrument-free background images. Although both background and foreground are captured rather than simulated, the separation enables straightforward simulation of the instruments in multiple arbitrary poses, with known segmentation maps. This approach resembles computational simulation of the entire process, although it lacks rich annotations such as depth maps and surface normals, which are readily available from fully in silico approaches.

Colleoni et al [133] propose a deep neural net for processing of laparoscopic and simulation images for robust segmentation of surgical tools. They collect real image and kinematic data using the dVRK and utilize the kinematic data to produce image data of virtual tools using a dVRK simulator [134]. Similarly, Ding et al [38] introduces a vision- and kinematics-based approach to robot tool segmentation based on a complementary causal model that preserves accuracy under domain shift to unseen domains. They collect a counterfactual dataset, where individual instances only differ by the presence or absence of a specific source of corruption, using a technique similar to that in [133], further demonstrating the utility of this type of data collection while reporting similar challenges, namely the lack of tool-to-tissue interaction through the re-play paradigm.

Wu et al [135] introduce a soft tissue simulator that is unified with the dVRK framework and uses SOFA [136] framework as the physics simulator. With robotic vision and kinematic data, they train a network to learn correction factors for finite element method simulations with the discrepancy of simulations and real observations. In their follow-on work [137], the authors present a faster approach, where they implement a step-wise framework in the network for interactive soft-tissue simulation and real-time observations.

Several datasets are available for visible light imaging modalities, based on in silico simulation, although the simulation framework itself may not be available. The Surgical Visual Domain Challenge dataset had participants transfer skills from simulated da Vinci surgery (based on the SimNow simulator) to real endoscopic video [138]. It is available as the 'SurgVisDom dataset.' Madapana et al [139] investigate sim-to-real skill transfer in the context of dextrous surgical skills, simulating a dataset for the aforementioned peg transfer task and collecting corresponding instances on real robots, namely the Taurus II and YuMi. Their DESK dataset is publicly available. Rahman et al [140] use simulated OpenAI Gym environment and real-world DESK dataset to evaluate robotic activity classification methods.

4. US imaging

In US imaging, a transducer transmits compression and rarefactions through tissue and reconstructs an image based on the reflections of these waves [54]. The heterogeneous elastic properties of internal tissues determine characteristic reflectance patterns, enabling soft and hard tissue structures to be resolved. A typical US probe operates in the MHz range [141] and can be small, low-cost, and non-invasive [142]. US guidance is standard practice for minimally invasive biopsies for breast [143] and prostate cancer [144], and it has been used in tandem with robotic laparoscopy [145147]. Other US guided procedures include minimally invasive prostatectomy [148], nephrectomy, adrenalectomy [149], and cardiac surgery [150]. Opportunities for intelligent systems in this area include image interpretation and autonomous probe movement, both of which have been explored using in silico training.

US simulation methods range from simple image manipulations to physics-based models of wave propagation. They can rely on sparse representations of data, with finite point sources at known locations, or volumetric patient data based on CT or MRI. In the latter case, it is necessary to process the original image, which measures either x-ray attenuation in the case of CT or atomic spin in MRI, to infer tissue elasticity and reflectance properties relevant to US. If an organ segmentation is available, the US absorption and attenuation can be estimated separately for each tissue type based on values in the literature [151]. A distinctive trait of US images is speckle, which arises from micro-inhomogeneities in tissue [152]. Although such structure is smaller than the spatial resolution of CT or MRI data, it can likewise be modeled based on work done for tissue characterization purposes [153, 154].

In general, the US community prefers MATLAB toolkits, although Python wrappers exist for several frameworks (Python being preferred by the machine learning community due to [155, 156]). Below, we review these tools and their applications for image-guided MIS.

4.1. Field II

Originally developed starting in 1991, Field [100102] represents one of the early complete solutions for computational simulation of US image formation. Based on the Tupholme–Stepanishen method [157159], Field can simulate any kind of transducer geometry and excitation, with fast computation made possible by a far-field approximation. In 1997, Field II [160] parallelizes execution to reduce simulation time significantly, eventually enabling simulation of 128 images in 42 s with full precision (0.33 s per image) in 2014 [161]. As of December 2022, Field II continues to enjoy a wide user base and regular software updates, compatible with modern processors and major operating systems (Windows, Mac OS, and Linux). Parallel execution and a native Python implementation are available for commercial use as a part of Field II Pro 1 .

The fast inference time of deep neural networks (DNNs) means that real-time applications can improve image quality in the operating room. Hyun et al [162] use Field II to simulate a training set for the purpose of speckle reduction in B-mode images, introducing log-domain loss functions tailored for US. They showed that speckle reduction using their DNN trained in silico outperformed the traditional delay-and-sum and nonlocal means method on real images, in terms of preservation of resolution and detail.

Automatic segmentation of US images is of high interest for MIS applications, enabling reconstruction of subcutaneous structures for surgical navigation. Nair et al [163] leverage the power of the Field II simulator, which can provide the raw transducer signal as well as the final reconstruction, to perform simultaneous image reconstruction and segmentation in separate channels. They demonstrate that segmentation of anechoic targets, based on point source simulation with Field II, enables reasonable sim-to-real performance on in vivo test data, achieving a DICE score of $0.77 \pm 0.07$. Amiri et al [164] explore the transfer learning problem in the context of US images, questioning the popular approach to fine tuning final layers of a U-Net [165]. Their findings indicate that fine-tuning of shallow DNN layers, which are responsible for recognizing low-level image features, is a more effective strategy in the US domain than the common practice of fine-tuning deep layers. Finally, [166] propose a DNN-based method for biopsy needle segmentation in US-guided MIS, using the Field II simulator to create a training set of 809 images. Despite this small training size, the trained U-Net with attention blocks was able to localize needles with 96% precision and angular error of 0.40 in vivo.

4.1.1. RLUS

Obtaining optimal positioning and imaging settings can be a challenging task for US guidance in MIS [167]. To facilitate the development of autonomous US imaging systems, Jarosik and Lewandowski [103] propose a set of RL environments for positioning an US probe and finding an appropriate scanning plane, termed RLUS. Using Field II [100], they simulate a linear-array transducing acquiring information from a simulated phantom, with cysts forming an object of interest. The RL agent was capable of manipulating the probe while observing the resulting B-mode frames, with positive feedback based on centering the object in the image. Their results indicate that RL agents are able to learn effective policies in simulation for US imaging, although more experiments are needed to determine in vivo performance.

One of the drawbacks of the Field II simulator is computation time required for realistic image simulation. Peng et al [168] approximate Field II images by translating MRI images to the US domain, training a GAN [85]. They achieve a real-time frame rate of 15 frames s−1 on a GPU laptop, with images both quantitatively and visually comparable to those from the Field II simulator. Faster image simulation is highly relevant to training intelligent systems, especially RL agents, which typically train simultaneous with environment simulation.

4.2.  k-wave

One way to reduce the computation necessary for US imaging is the k-space pseudospectral method, which discretizes the equations modeling nonlinear wave propagation [169]. The open-source tool k-Wave [104] is a MATLAB toolbox, capable of simulating physically realistic images quickly, using a planar detector geometry based on the fast Fourier transform. Unlike other tools, k-Wave improves computational speed through parallel execution on graphics processing units, simulating 1000 time-steps for a grid size of 1283 in 7 min. (approximately 0.4 s per timestep), more than 2.5 times speedup compared to multi-threaded CPU execution.

US image guidance often requires visualization of point-like targets, such as needle cross sections or catheters, which can be confused with point-like artifacts commonly present in the surgical setting. Allman et al [170] show the advantage of DNNs in this area, which are able to distinguish true point sources with 96.67% accuracy in phantom, after training in silico with k-Wave [104]. They achieved sub-millimeter point source localization error ($0.38 \pm 0.25$ mm), enabling visualization of a novel artifact-free image in the context of MIS. A similar approach uses a simulated used k-Wave to precisely localize point target vessel structures for guidance in MIS [171].

The goal of distinguishing point sources with high accuracy may benefit from photoacoustic imaging, where the absorption of optical or radio-frequency electromagnetic excitations cause tissue to generate acoustic waves [172]. Combining in silico and in vivo data for training, [173, 174] explore the use of light-emitting diodes as excitation sources for photoacoustic visualization of clinical needles in MIS, developing a DNN-based system to enhance the visualization, achieving 4.3-times higher signal-to-noise ratio compared to conventional reconstructions. Their semi-synthetic approach allows for complete knowledge of the desired ground truth while reducing the sim-to-real gap.

4.3. FOCUS

The Fast Object-Oriented C++ US Simulator (FOCUS) is a fast US simulator available for MATLAB [105]. It resolves large errors in US simulation in the nearfield and at the transducer face using the fast nearfield method [175] and time-space decomposition [176]. Comparing the simulations of FOCUS and Field II reveals this difference for a sampling frequency as low as 25 MHz, where the impulse response calculation in Field II introduces aliasing artifacts [177].

Because US images typically contain only 2D data from a linear array, resolving 3D pose based on the work in the previous section can be nontrivial. Arjas et al [178] propose a solution based on in silico training combined with a Kalman filter to improve localization over continuous US acquisition, as is common in the surgical setting. They train a DNN based on the FOCUS simulator [105], although applications to inhomogeneous tissue would require simulation with the k-Wave framework. Nevertheless, their approach shows promise for the critical task of reconstructing 3D tool poses from 2D US images, achieving 0.3 mm maximum error when transferring to real US images on submerged needles in water.

4.4. PLUS toolkit

Translating in silico and even in vivo performance to real-world systems can prove a significant implementation challenge. In 2014, Lasso et al [106] introduce the Public software Library for Ultrasound (PLUS) toolkit, which is aimed at translating image analysis methods for US-guided interventions into clinical practice. The open source platform includes tools for tool tracking, US image acquisition, spatial and temporal calibration, volume reconstruction, data streaming, and US image simulation based on surface meshes, using methods proposed in [179], making it applicable for translational research in other imaging modalities as well as US. US simulation with PLUS achieves impressive performance of 50 frames per second, at a resolution of $564 \times 597$ and 256 scan-lines per frame on a 3.4 GHz CPU processor, although it lacks the artifacts and speckle present in real US images.

Despite lacking these distinctive characteristics, the PLUS framework can be effectively used for in silico training. Patel and Hacihaliloglu [180] show that a network trained with PLUS-generated images was able to transfer bone segmentation skills to real US, with applications in US-guided percutaneous orthopedic surgery. They compare the performance of a DNN trained with small-scale real image training to large-scale training only possible via simulation, demonstrating the superiority of a transfer learning network that leverages the latter.

4.5. Alternative frameworks

Although the above frameworks constitute the bulk of tools used for in silico US training, alternative methods have been proposed. In 2015, [181] introduce an efficient US simulation method utilizing volumetric data (in this case an MRI scan), based on convolutional ray-tracing. Depending on the number of scan-lines, reflection depth, and axial revolution, their per-image simulation time lies between 0.1 and 1 s on contemporary hardware (NVIDIA GTX 850M). Image quality is improved using non-linear optimization of the simulation parameters with respect to real US images, although applications for in silico training have yet to be explored.

As mentioned in the previous section, simulating the underlying physical process of US may not be necessary for in silico training, if the sim-to-real gap can be overcome through augmentation or transfer learning. In the same vein, Sharifzadeh et al [182] introduce an ultra-fast dataset generation method specifically for training DNNs, although it would not be capable of US simulation in a fully controlled sense. In their approach, a real US image $I_\textrm{real}$ and an arbitrary segmentation mask $I_\textrm{mask}$ are transformed using the Fourier transform $\mathcal{F}$, alpha-blended in the frequency domain, and converted back to the image domain with the inverse Fourier transform:

Equation (2)

Although an open source framework for this approach is not available, their method is straightforward enough to be re-implemented with standard tools. Moreover, they demonstrate that this approach, combined with image augmentation, achieved a 15.6% higher DICE score on real US images than networks trained with Field II, while simulation of these images was 36 000 times faster, leveraging the fast Fourier transform [183]. However, like Garcia-Peraza-Herrera et al [132], this method only simulates the added objects; complete ground truth for the underlying real images are not available.

Rather than compute 2D US images from digital phantoms, Li et al [27] propose to use volumetric US acquisitions to sample slices for in silico training of an autonomous system for achieving standard US views in spinal sonography. This approach is efficient and results in highly realistic images, which can be sampled dynamically from many viewpoints, although real patient scans are required. They demonstrate that this in silico environment is suitable for DRL, achieving standard views of the spine to within reasonable margins (5.18 mm/5.25 in the intra-subject setting).

The Verasonics Vantage US scanners are a US hardware family well-suited to US research, due to the easy access to raw US data. Along with this hardware, Verasonics supplies an US simulator that is integrated with their imaging platform, making it an attractive research tool that could streamline translation onto real devices, similar to PLUS [106, 109]. Although no theoretical work exists describing their proprietary software, information regarding their approach is available on their website 2 .

Recently, in 2022, [108, 109] propose SIMUS, an open-source US simulator for MATLAB, as a part of the MATLAB ultrasound toolbox (MUST) [184]. SIMUS simulates the acoustic pressure fields, performing comparably to Field II, k-Wave, FOCUS, and Verasonics in terms of image realism. The simplicity and online availability of this framework makes it useful for pedagogical purposes as well as research.

Finally, [185] demonstrate the sim-to-real capabilities of a robotic US imaging system, learning based on RGB images and sensor readings rather than the US images themselves. This approach reduces the simulation problem to one that is more aligned with general robotic manipulation, although it will be an essential component of more complete simulation of an intelligent MIS system. A full simulation of the operating room as a dynamic in silico environment, i.e. a digital twin, will require simulation of the given medical imaging modality as well as the non-medical data like RGB cameras and force sensors.

5. X-ray imaging

X-ray imaging leverages high-energy photons to penetrate tissue, measuring the attenuation of rays dependent on material composition and density. An x-ray tube or 'source' is responsible for photon generation primarily by means of Bremsstrahlung [186]. Modern x-ray imaging devices, including C-arms commonly used in intra-operative imaging, measure the attenuation of these rays with a flat-panel detector [187], creating a 2D projective image. Many C-arm devices also support intra-operative 3D imaging via cone-beam CT (CBCT), although these acquisitions require significant time and radiation compared to individual radiographs. So 'x-ray image guidance' refers to 2D projective image-guidance. Together, the flexibility, image quality, and resolution of x-ray image guidance have contributed to the development of minimally invasive alternatives to procedures in orthopaedics [2022, 188193], interventional radiology [21, 77, 78, 194196], and angiology [196199].

Simulation frameworks for in silico training primarily focus on simulating x-ray image formation, since this requires simulation of attenuation through the physical medium, rather than reflection off of surfaces, which dominates visible spectrum imaging. This is historically computationally intensive, although the advent of high-performance, high capacity GPU devices have enabled fast, realistic x-ray simulation frameworks to be used for training DNNs [44]. Tool-to-tissue interaction has so far been limited to visual interaction in the projection [111], rather than physical interactions. The target of learning in this context is often the 2D/3D registration of the projective image with a pre-operative CT [35, 193] or statistical atlas [200], enabling 3D information to be computed from 2D images. A consistent challenge, due to the ionizing radiation associated with x-ray imaging, is the reduction in the number of acquisitions, according to the as low as reasonably achievable principle [18, 19, 201].

5.1. CONRAD

The CONRAD framework is an early framework for simulating realistic radiographs, providing the first unified library for cone-beam imaging that incorporates nonlinear physical effects and projective geometry, written in Java and accelerated with OpenCL [84]. Primarily intended for applications in CT reconstruction, CONRAD relies on a spline-based object model to represent the attenuation properties of patients and tools in space. This sparse data representation results in images that are highly controllable but may lack the realism needed for sim-to-real transfer in the x-ray domain. Nevertheless, in 2018 CONRAD was subsequently adapted for in silico training of intelligent systems in cardiac statistical shape modeling [202], digital subtraction angiography (DSA) [203], and motion reduction in diagnostic knee imaging [204]. DSA refers to the acquisition of subsequent fluoroscopic images with and without contrast agent, the subtraction of which yields a background-free image focusing on the blood vessel [205]. Virtual, single-frame DSA removes the need for a non-contrast acquisition by segmenting the foreground vessel from the background, which Unberath et al [203] accomplish automatically with a U-Net trained on CONRAD. Finally, although Bier et al [204] considers diagnostic applications, specifically compensating for patient motion during load-bearing acquisition of CBCT, their proposal to automatically detect anatomical landmarks from projective images, based on in silico training, has since gained ground in MIS applications.

5.2. MC-GPU

Monte Carlo simulation enables realistic, physics-based simulation of radiation transport, but requires significant compute time. Developed with the aim of accelerating simulation of x-ray and CT images, MC-GPU features massively parallel Monte Carlo simulation of photon interactions with volumetric patient models, leveraging advancements in GPUs [110, 206]. Compared to single-core CPU execution, MC-GPU achieves a maximum 27-fold speedup in simulation time, using hardware available in 2009. Despite these advantages, MC-GPU still requires sufficient compute time to be impractical for generating datasets on the scale needed for training deep learning algorithms.

5.3. DeepDRR

To catalyze in silico training in the x-ray domain, Unberath et al [44, 111] contribute a Python framework for fast, physics-based DRR synthesis with sufficient realism for sim-to-real transfer. While previous approaches that focused on image realism employed Monte Carlo simulation of photon absorption and scatter, DeepDRR approximates these effects in a physically realistic manner by projecting through segmented CT volumes and subsequently estimating photon scatter with a DNN, trained on Monte Carlo ground truth, which enables generation of tens of thousands of images in a matter of hours [44]. Separately, the popularity of Python in the machine learning community, due to open source tools like PyTorch [155] and TensorFlow [156], makes DeepDRR an appealing research tool with a ready community for quickly generating synthetic x-ray datasets. Like previous DRR methods, DeepDRR relies on patient CT to estimate the material density and attenuation properties, with segmentations of air, soft tissue, and bone achieved in a patch-wise manner with a V-Net [207]. This allows for attenuation modeling based on a realistic x-ray spectrum, compared to single material, mono-energetic DRRs. They demonstrate the effectiveness of this increased realism by training a DNN to detect anatomical landmarks of the pelvis in real images, which was not found to be possible when using conventional DRRs for in silico training, due to the sim-to-real gap.

The original pelvis landmark annotation task in Unberath et al [44] continues to be a major application of this research, since landmarks can be used to initialize a variety of 2D/3D registration problems through feature-based methods, supporting percutaneous approaches [193]. Considering arbitrary viewpoints, Bier et al [208] find that in silico training with DeepDRR enabled successful 2D/3D registration on real radiographs [26, 208]. For the same task, [209] develop view-dependent weights for anatomical landmark detection in the pelvis, supporting registration with a success rate of 86.8% and error of $4.2 \pm 3.9$ mm.

Further work on 2D/3D registration of the pelvis propose alternative initialization strategies to landmark detection. Gu et al [210] propose a learning-based approach to estimating the transformation between two projections, creating a training set of 21 555 DRRs from five high-resolution CTs from NIH Cancer Imaging Archive [211]. Their approach extends the capture range of conventional 2D/3D registration, removing the need for careful initialization. Alternatively, [212] train a DNN to detect contours in pelvic radiographs, increasing the robustness of contour-based 2D/3D registration and achieving best-base error of 1.02 mm.

In x-ray guided MIS, quantitatively assessing tool-to-tissue spatial relationships from radiographs can be difficult for featureless or flexible tools, such as K-wires or continuum robots [213]. To support this task, Unberath et al [111] extend DeepDRR to support modeling of surgical tools, specifically robotic end effectors, in DRRs, along with simultaneous keypoint localization and segmentation. Individual components of a flexible continuum manipulator are projected to provide segmentation ground truth [213]. Esfandiari et al [214] extend this paradigm by synthesizing DRRs from the same view with and without metal implants in the spine, enabling a DNN to inpaint radiographs and remove metal implants. This can improve 2D/3D registration of the spine, wherein the sharp contrast of metal implants tends to inhibit intensity-based registration methods.

Building on physics-based DRR synthesis, Toth et al [215] show that domain randomization (DR) can improve DRR-to-x-ray performance for cardiac 2D/3D registration. They train an RL agent to iteratively update a cardiac model to align with real radiographs, demonstrating higher stability when DR is applied during training. The purpose of DR, which applies unrealistic image transformations to the training set, is to apply such large variation that the DNN avoids local, domain-specific minima in the loss function. Previously mentioned work in the pelvis likewise uses have similarly utilized DR for fully automatic 2D/3D registration in the pelvis, based on anatomical landmark detection [209].

The advantages of DR are precisely proved in Gao et al [216], which shows that the combination of physics-based x-ray synthesis, using DeepDRR, combined with strong DR are comparable to GAN-based domain adaptation, and they outperform GAN-based domain adaptation with conventional DRRs, although this work is not yet peer-reviewed. This is advantageous because the image transformations involved in 'strong DR,' such as image inversion, blurring, warping, and coarse dropout, among others, are computationally inexpensive, whereas GANs require additional training with sufficient real images as an unlabeled reference. They demonstrate this approach, coined 'SyntheX,' on three representative tasks: pelvic landmark detection for 2D/3D registration, detection and segmentation of a continuum manipulator, and COVID-19 diagnosis from chest x-rays [216].

Inspired by [204], Kausch et al [18] propose an intelligent system for automatic C-arm repositioning, regressing a pose update based on intermediate landmark detection to obtain standard views of the pelvis. Without additional fine-tuning, their system obtained desired clinical views when evaluated on real x-rays, to within inter-rater variability [18]. They refine this work in the domain of spine surgery, introducing K-wire and pedicle screw implants as augmentations to the training set [217]. K-wire simulation was accomplished through post-processing of the x-ray image with quadratic Bézier curves, while pedicle screws were simulated in the DRR synthesis process similar to Unberath et al [111]. Related work in this area uses a DNN to estimate the C-arm pose in simulation [218].

Recent applications of in silico training with DeepDRR continue to focus on minimally invasive orthopedic procedures. In order to anticipate complications during percutaneous pelvic fixation, [219] train a DNN to assess the positioning of a K-wire with respect to safe corridors through bone. Although they do not demonstrate sim-to-real performance, their approach outlines a roadmap to detecting unfavorable K-wire trajectories through the superior pubic ramus and potentially providing CT-free guidance for pelvic fixation. Separately, Sukesh et al [220] detect bounding boxes of vertebral bodies in 2D x-rays, demonstrating the advantage of adding synthetic x-rays over purely real-image training.

5.4. Alternative frameworks

Despite their drawbacks with regard to realism, conventional DRRs have been used for in silico training, and additional frameworks rely on these tools due to their simplicity, especially in combination with other sim-to-real techniques. For example, Dhont et al [221] propose combining conventional DRRs with GANs [85] to synthesize photorealistic DRR, demonstrating the performance of their RealDRR framework in terms of quantitative image similarity. Utilizing this approach for in silico training, Zhou et al [222] target automatic landmark detection on cranial fluoroscopic images, for the purpose of 2D/3D registration. They overcome the sim-to-real gap using a GAN with real radiographs as the target domain [85].

Sometimes, though, the sim-to-real gap is small enough that additional techniques are not necessary. This is perhaps the case when modeling metal tools, which are single-material, homogeneous objects and align well with the assumptions of conventional DRRs. Considering this, Kügler et al [223] demonstrate that conventional DRRs can facilitate automatic 2D/3D registration of surgical instruments, including screws, drills, and robot components, by recognizing pseudolandmarks similar to [208]. Their approach, coined i3PosNet, generalizes from training with DRRs to real x-ray images with no extra steps to achieve registration errors less than 0.05 mm.

Recently, alternatives to volumetric and sparse data representations have been proposed, using multi-layer perceptrons to represent scenes for rendering purposes [224]. Huy and Quan [225] propose apply this methodology to x-ray simulation, proposing Neural Radiance Projection (NeRP) to produce 'variationally reconstructed radiographs'. In this approach, a differentiable renderer allows gradients to backpropagate through the projection step so that the patient volume is learned based on real x-rays. As in RealDRR and XPGAN [226], they use a GAN [85] to improve image realism as a final step. Although NeRPs are only used for in silico training of a diagnostic intelligent system, this simulation framework has potential applications in MIS, where novel view synthesis is a desirable capability. A framework for their approach is not yet available.

To minimize radiation dose during CT imaging, Abadi et al [227] propose a volumetric, Monte Carlo simulation framework with device- and institution-specific parameters for imaging, called DukeSim. Although intended for CT simulation, DukeSim's individual projections have been used in combination with voxelized statistical shape models [81] to reduce variability in x-ray imaging due to exactly these factors, which can affect the consistency and reliability of diagnostic imaging [228].

Finally, although it is not specifically developed with machine learning in mind, DRRGenerator [112] may yet be of interest to the community because of its intuitive user interface and integration with the open-source medical imaging software, 3D Slicer [107]. Currently, the popular DeepDRR tool requires users to develop a sampling strategy of sufficiently varied views in order to guarantee view-invariant DNN performance, as in [208]. With additional capabilities focused on in silico training, DRRGenerator would be to x-ray image-guided interventions as VisionBlender and AMBF+ are to endoscopic and stereo microscopic image guided procedures, respectively.

6. Additional imaging modalities

Much of the interest in in silico training for MIS has focused on 2D imaging modalities, endoscopy, US, and x-ray. We speculate that this is because large datasets suitable for training DNNs are easier to obtain in the 2D domain, where a single annotation of a 3D image can be propogated to thousands of samples. Nevertheless, simulating 3D images and 3D physical interactions is of interest to develop intelligent systems focused on intra-operative CT, MRI, and 3D US.

Toward this end, Lee et al [229] propose a simulated environment for training autonomous needle insertion robots, using RL. They model the deformation of a beveled tip needle in a dynamic environment based on stochastic processes, providing negative rewards for collisions with obstacles such as bone, and positive rewards when it reaches the biopsy target. In the future, physics-based simulation of needle insertion may provide simulation for both CT-guided or, through a platform like DeepDRR [44], x-ray guided needle insertion.

CT guidance must contend with severe image artifacts introduced by metal implants, such as pedicle screws. In order to minimize metal artifacts in intra-operative CBCT, Zaech et al [230] and Thies et al [231] train a DNN to adjust C-arm trajectories during image acquisition, using DeepDRR for in silico training. Their method iteratively adjusts the out-of-plane angle of a robotic C-arm to avoid 'poor images' characterized by beam hardening, photon starvation, and noise.

Recently, image-free guidance for MIS has been explored. Årsvold et al [232] simulate electrical properties of target tissue types in order to train an intelligent system for minimally invasive lymphadenecomy, a surgical treatment for cancer. Their determines whether a lymph node is present underneath an electrical impedance scanner, which can be deployed as part of a robotic assisted MIS system, achieving 93.49% accuracy in ex vivo tissue phantom experiments.

7. Discussion and conclusion

There is significant potential for in silico training to produce the next generation of MIS systems, deploying AI to assist providers in alleviating the inherent challenges of minimally invasive approaches, in particular by improving the acquisition and interpretation of intra-operative images. In silico simulation provides a training ground limited only by the constraints of generating realistic-looking images and the existing techniques therefor. In principle, as data representations and physics-based simulations continue to mature, in silico training can expand to include near limitless experience for learning-based algorithms, from supervised learning to RL, by providing rich annotations from the perfectly controlled virtual environment.

One notable constraint on in silico simulation for MIS is the availability of patient data on which to base simulated images and interactions. Much of this review has focused on 2D imaging modalities for exactly this reason, since the generation of endoscopic, x-ray, or US images allows for hundreds or thousands of training samples to originate from a single patient model [44, 97, 161]. For example, DRRs vary widely in visual appearance based on the position and orientation of the virtual C-arm, and further techniques such as DR increase the variance of training data to further improve sim-to-real transfer [209, 215, 216]. However, existing techniques for generating these realistic-looking images rely on 3D patient models based on patient data, including CT, MRI, or prior endoscopic reconstructions, for example. This introduces a potential bottleneck to in silico training, where the long tailed distribution of real-world situations presents too much variation for large but still finite data to train models with sufficient generalizability. Moreover, simulation of 3D intra-operative images, such as CT, MRI, and 3D US, must rely on existing digital phantoms such as XCAT [81], which tend to produce images with lower variation and realism.

As previously discussed, generalizability differs from sim-to-real domain adaptation in that it is concerned not just with image appearance or tissue characteristics, for example, but with any number of variations that may arise in the course of surgery, where anomalous anatomy and complications produce images outside the training domain. Imaging techniques, surgical setups, and patient demographics vary significantly from one institution to another so it is challenging to train AI models which are resilient to such variations [86, 233]. The reliance on finite patient data underlying current in silico simulation methods implies that not all patient variation can be represented, and unseen conditions, such as the presence of foreign surgical instruments from a manufacturer not considered during training, may result in deteriorated performance.

An integrated physics engine is a crucial way that in silico training platforms can increase the utility of each patient model, introducing variation based on tissue deformation and tool-to-tissue physical interaction. In the visible light domain, a great deal of attention has focused on simulation for robotic laparoscopy, including complex interactions like suturing [98] and camera manipulation [29]. These rely either on hand-built virtual environments or patient-based reconstructions, which enable more realistic image formation [97]. In order to realize the full potential of in silico training beyond the visible light domain, there is a need to develop simulation frameworks for image formation of CT, MRI, x-ray, and US based on sparse data representations that are conducive to physics engine modeling. Leveraging the existing physics and robotic modeling capabilities of simulators like AMBF [45] will require developing rendering frameworks like DeepDRR [44] (x-ray) and Field II [100] (US) that produce images capable of overcoming the sim-to-real gap for each modality, although these existing frameworks for non-optical simulation rely on volumetric methods to produce realistic images. Since converting between volumetric and sparse virtual object models is computationally prohibitive, advances in simulation techniques are required to generate realistic images from sparse models.

Aside from simulating physics-based tool-to-tissue interactions, the next generation of in silico simulation frameworks should overcome the dependence on finite CT or MRI scans by generating realistic patient models with specific pathologies, demographics, and characteristics, rather than selecting or adapting existing images. In this review, we have mostly discussed generative models as one way to overcome the sim-to-real gap, using GANs [85], but such models can also be used to generate realistic CT or MRI images [234]. Conditioning generative models based on desired patient properties would enable large-scale, fully virtual cohort sampling. If designed so as to be compatible with physics engines as discussed above, this level of sophistication would enable in silico training to acquire vastly more virtual experience that can be transferred to real images than would be possible based on finite real images or conventional simulation.

Overall, the opportunities for innovation in in silico training and its application to MIS constitute an exciting area of inquiry, especially as machine learning algorithms continue to mature at an impressive rate. Based on the rapid pace of current progress, in fact, future reviews re-examining in silico training may be necessary before too long, in order to describe the next generation of intelligent surgical systems that arise from currently accelerating advances in this key enabling technology.

Data availability statement

No new data were created or analyzed in this study.

Footnotes

Please wait… references are loading.
10.1088/2516-1091/acd28b