nach oben

Medical & Biological Engineering & Computing

Erschienen in:

Open Access 17.05.2023 | Original Article

Super-resolution of 2D ultrasound images and videos

verfasst von: Simone Cammarasana, Paolo Nicolardi, Giuseppe Patanè

Erschienen in: Medical & Biological Engineering & Computing | Ausgabe 10/2023

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Patentsuche

Aus

Abstract

This paper proposes a novel deep-learning framework for super-resolution ultrasound images and videos in terms of spatial resolution and line reconstruction. To this end, we up-sample the acquired low-resolution image through a vision-based interpolation method; then, we train a learning-based model to improve the quality of the up-sampling. We qualitatively and quantitatively test our model on different anatomical districts (e.g., cardiac, obstetric) images and with different up-sampling resolutions (i.e., 2X, 4X). Our method improves the PSNR median value with respect to SOTA methods of $1.7\%$ on obstetric 2X raw images, $6.1\%$ on cardiac 2X raw images, and $4.4\%$ on abdominal raw 4X images; it also improves the number of pixels with a low prediction error of $9.0\%$ on obstetric 4X raw images, $5.2\%$ on cardiac 4X raw images, and $6.2\%$ on abdominal 4X raw images. The proposed method is then applied to the spatial super-resolution of 2D videos, by optimising the sampling of lines acquired by the probe in terms of the acquisition frequency. Our method specialises trained networks to predict the high-resolution target through the design of the network architecture and the loss function, taking into account the anatomical district and the up-sampling factor and exploiting a large ultrasound data set. The use of deep learning on large data sets overcomes the limitations of vision-based algorithms that are general and do not encode the characteristics of the data. Furthermore, the data set can be enriched with images selected by medical experts to further specialise the individual networks. Through learning and high-performance computing, the proposed super-resolution is specialised to different anatomical districts by training multiple networks. Furthermore, the computational demand is shifted to centralised hardware resources with a real-time execution of the network’s prediction on local devices.

Graphical abstract

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

1 Introduction

Ultrasound (US, for short) acquisition applies high-frequency sound waves to visualise soft tissues and internal organs, and support medical diagnosis for muscle-skeletal, cardiac, and obstetrical diseases. US acquisition has many advantages with respect to magnetic resonance and tomographies, such as its portability, cheapness, and non-invasiveness. Furthermore, its real-time acquisition provides instantaneous feedback to the physician, e.g., during regional anaesthesia. Through US videos, the physician analyses the temporal variation of an anatomical feature (e.g., the movement of a muscle, the volume of the ventricle), which can be generated either by the shift of the probe or by the movement of the anatomical part. 2D US videos are acquired through 2D probes, which capture sequences of images at a given frequency.

The resolution of each image is affected by the required frequency of the video, since some anatomical districts (e.g., cardiac) require a high acquisition frequency, to accurately acquire the behaviour of anatomical features that quickly change over time. For example, US videos of the cardiac district require high temporal frequency, since they need to acquire anatomical parts (e.g., the mitral valve) that move quickly over time; a higher temporal frequency allows the radiologist to better characterise the movement of the anatomical part.

Our goal is the design of a novel deep-learning framework for the super-resolution of 2D US images, by increasing the image resolution and reconstructing non-acquired beamlines. We define the non-acquired beam lines as the intermediate lines to those acquired by the probe. These intermediate lines are not acquired to increase the acquisition time frequency but are approximated by the super-resolution scheme. Applying our approach to US videos with a low spatial resolution and a high frequency (e.g., for the cardiac district), we can generate high-frequency 2D US video with an increased spatial resolution of each frame, thus overcoming the main limits of current US probes, whose spatial resolution decreases as the acquisition frequency increases. Acquiring a 2D video with a low spatial frequency of the single image (i.e., each frame), our method reconstructs the spatially high-resolution video in real-time.

First, we compare several state-of-the-art up-sampling algorithms (Section 2) and identify the best method in terms of quantitative metrics and visual evaluation. Then, we train a neural network to improve the results of the up-sampling to match the target image (i.e., the high-resolution image). Our network does not perform the interpolation of the missing lines; in fact, this task is already performed by up-sampling. In contrast, our network learns how to transform the up-sampled lines into the target lines. To improve the quality of the up-sampling, we train multiple networks, each one specialised to the input anatomical district (e.g., cardiac, abdominal) and its low-resolution image (e.g., 0.5X, 0.25X). This specialisation improves the quality of the up-sampling since we specialise the network to a specific prediction. The execution time of the super-resolution depends on the up-sampling and the network prediction; the prediction is achieved in real-time on standard medical hardware. We summarise the proposed framework (Fig. 1), where we generate the data sets within the pre-processing phase, the learning-based models within the training phase, and the real-time super-resolution prediction within the test phase.

As the main contribution (Section 3), we propose a novel learning-based architecture that accounts for convolutional layers and rectified linear unit (ReLU) activation functions (Fig. 2) and improves the Wide Activation for Efficient and Accurate Image Super-Resolution (WDSR) [45]. The kernel size is selected according to the dimension of the low-resolution image to guarantee that at least two original lines (i.e., two lines that are acquired by the probe) are always included in the convolution operation. Then, we modify the loss function to improve the visual accuracy of the prediction. Our logarithmic-based loss includes only up-sampled lines, excluding lines acquired by the probe.

The proposed approach is general in terms of the building blocks of the framework; in fact, we can select different up-sampling algorithms, e.g., Single Image Super Resolution (SISR) [32], Enhanced Super Resolution Generative Adversarial Network (ESRGAN) [42] and deep learning architectures, e.g., Pix2Pix [18] and VGG19 [37]. As experimental validation (Section 4), we perform a quantitative and qualitative evaluation of our framework on a large collection of US images acquired from different anatomical (e.g., muscle-skeletal, obstetric, abdominal) districts. Then, we apply our method to the spatial super-resolution of US 2D US videos, and we evaluate the effects of denoising the raw images as pre-processing of our framework. Finally, we present a discussion on main outcomes (Section 5), conclusions, and future work (Section 6). Trained models and training/test code are available at https://github.com/cammarasana123/US-SuperResolution.

Learning-based US super-resolution

The main novelties of the enhanced deep super-resolution network [24] are a simplification of the conventional residual network architectures and a multi-scale super-resolution network that reduces the model size. Exploiting the sparsity of the signal in the Fourier domain, the interpolation of missing data [43] allows reconstructing the high-resolution ultrasound (HR US) image with a low computational cost. A super-resolution generative adversarial network (SRGAN) [23] applies a deep residual network with skip-connection and a perceptual loss between generated and target images. The reduction of artefacts of the previous method is addressed by the Enhanced SRGAN [42], which improves the network architecture, the adversarial and the perceptual loss, removes the batch normalisation layer, and applies the residual scaling and smaller initialisation values.

The perceptual quality of ESRGAN is improved by the ESRGAN+ method [9] through a novel Residual-in-Residual Dense Residual block, which increases the network capacity without affecting its complexity. The application of the SRGAN to US images [9] preserves both the anatomical structures and the speckle noise pattern, thus improving the perceptual quality of the upsampled images. Dilated convolution [27] extracts the internal recurrence information from the test image; this method upsamples low-resolution (LR) images when LR-HR examples are reduced. Fully convolutional U-net [40] obtains high-resolution vascular images from high-density contrast-enhanced US signals. In [39], the deep learning method exploiting feature extraction blocks, repeating blocks, and upsampling layers apply an up-sampling factor in the range 2-8. A Self-supervised CycleGAN [26] only requires the LR US image, and generates perceptually consistent up-sampling results. Combining CycleGAN, two-stage GAN, and the zero-shot super resolution [13], it is possible to obtain super-resolution images with low blurring artefacts.

Vision-based US super-resolution

Learning-based methods suffer from artefacts and blurring when dealing with noisy signals. Several vision-based methods have been proposed, through the years. The interpolating up-sampling with cubic kernels [20] offers high accuracy with low computational cost, through appropriate boundary conditions and constraints on the kernel functions. In [3], a novel deconvolution-based method applies the maximum a posteriori estimation to the restoration of the tissue response and is validated with several tissue-mimicking phantoms with specific scatterer concentrations. The Alternating Direction Method of Multipliers [30] is applied to the super-resolution of US images including deblurring and denoising [28] through a combination of $\ell _{1}$ and $\ell _{2}$ minimisation. In [44], a deconvolution method models the envelope radiofrequency and point spread function is robust to noise and does not require the knowledge of the centre frequency of the acquired signal. Assuming a Gaussian distribution for both the unknown signal to be restored and the point spread function, in [46] the reconstructed image is built through a posterior model with hybrid Gibbs sampling [15]. The properties of the decimation matrix in the Fourier domain [47] are exploited to solve the super-resolution problem with a $\ell _{p}$-norm regulariser, with $p \in [1,2]$. The envelope of radio frequency signal [21] applies repetitive data in the non-local neighbourhood of samples.

Device-based US super-resolution

The second harmonic image [38] contains less noise and blur than the first harmonic image. Furthermore, the lateral resolution is increased, as the harmonic pulse is auto-focused because the higher harmonics are generated in the centre of the beam. Then, the image super-resolution is achieved by combining the first and second harmonic images. Both spatial and temporal deconvolution operations [25] are achieved by accounting for the transmit and receive processes, electrical transducer characteristics, and transmit focusing laws. Combining phase-contrast imaging, angular spectral decomposition, and a super-resolution reconstruction technique [10], it is possible to recover the location and dimensions of objects smaller than the imaging wavelength. The reconstruction through generalised Tikhonov regularisation [22] is evaluated as a function of transmit-receive bandwidth and a focal number of the transducer, by comparing the results with traditional B-mode imaging. The Time-domain Optimized Nearfield Estimator [41] assumes an observation model based on the superposition of spatial responses; then, a maximum a-posteriori estimation finds the distribution and amplitude of hypothetical targets that match the observed data with minimal target energy. As a further improvement, the Diffuse Time-domain Optimised Near-field Estimator [14] represents each hypothetical target in the system model as a diffuse region of targets rather than a single discrete target, thus inducing a better signal approximation. The cellular microscopy technique of multi-focal imaging [12] is applied to localise the unique position of the scatterer of the signal; three foci receive multiple overlapping curves, and a maximum likelihood estimation allows the identification of the source of the scatter.

Multi-frame US super-resolution

The Bilinear Deformable Block Matching [4], which is a registration method that accounts for the complex and deformable motion of soft tissues, is applied to reconstruct the HR image by exploiting the shifting property of the Fourier transform and the aliasing relationship between the continuous Fourier transform of the HR image and the discrete Fourier transform of LR images [29]. The use of deep learning for motion estimation among different frames [2] reduces the effect of noise and artefacts and reconstructs HR images from a sequence of LR images. The modelling of the spatial correlation of the speckle noise [11] is applied to standard reconstruction methods, with tissue-mimicking phantom and co-registered multi-images.

3 Proposed super-resolution of US signals

The ultrasonic waves are emitted by the probe and straightly penetrate the body structures along their path; when they pass through adjacent parts of the body with a different acoustic impedance, a fraction of the ultrasound pulse returns as a reflected wave, generating an echo that returns to the probe, while the rest of the wave continues to penetrate along the beam to greater tissue depths. The amplitude of the reflected echo is proportional to the difference in acoustic impedance between two adjacent media. For example, interfaces between soft tissue and dense organs (e.g., bones) generate very strong echoes due to a large acoustic impedance gradient. The acoustic impedance is a physical property of a medium defined as the density of the medium times the velocity of the wave propagation. Human body tissues have different acoustic impedances: for example, air-containing organs (such as the lung) have the lowest acoustic impedance, while dense organs have a higher acoustic impedance.

The echo signals are processed and combined to generate the underlying image, which has a resolution of $l \times d$, where l is the number of beamlines (i.e., the lateral resolution), and d is the depth of the acquisition of each beam line (i.e., the axial resolution). Axial resolution refers to the ability to discern two separate objects that are longitudinally adjacent to each other in the ultrasound image; lateral resolution refers to the ability to discern two separate objects that are adjacent to each other; the lateral resolution is usually lower than the axial resolution in ultrasound. The resolution of the image in terms of lateral direction (i.e., the direction perpendicular to the US propagation along the beam line) is primarily determined by the width of the ultrasound beam and the number of elements (i.e., the piezoelectric crystals) that are activated to generate the US waves. Current probes vary the number of beamlines acquired by activating/deactivating piezoelectric crystals thus reducing lateral resolution and image acquisition time. The axial resolution can be varied by changing the length and frequency of the pulses, which affect the penetration of the ultrasound wave.

In this context, we focus on lateral low-resolution image acquisition, reduce the acquisition time, and subsequently reconstruct the high-resolution image without losing information in terms of data depth. This aspect is relevant to improve the quality of the US image, its visual interpretation by the physician, and post-processing steps, e.g., as classification [1], segmentation [5], and morphological analysis [36].

Proposed super-resolution of US images

Our framework is composed of two steps: first, we up-sample the low-resolution image through an interpolating method. After the comparison of state-of-the-art methods (Section 4.2), we select Cubic Convolution as the up-sampling algorithm. Then, we apply a learning-based network to improve the visual accuracy of the up-sampling.

For the experimental part, we consider the Esaote data set, which contains more than 10K US images at different resolutions, and is acquired from different anatomical districts (e.g., obstetric, cardiac). Given a high-resolution image (i.e., the target) acquired by the probe, we build the corresponding low-resolution image by removing one line each 2 (0.5X) or 4 (0.25X). This approach is consistent with the acquisition of the US image, where the probe can acquire at the full, half, or a quarter of the maximum number of beamlines, depending on the activation of the piezoelectric crystals. We up-sample the low-resolution images through Cubic Convolution at 2X (applied to 0.5X low-resolution) or 4X (applied to 0.25X low-resolution). Then, we use the couples of up-sampled and target high-resolution images to analyse the proposed framework, through the training and the prediction of the learning-based network, with a specialisation in anatomic districts.

We generate a separate training data set of 1.5K images for each anatomical district and two different up-sampling resolutions of 2X and 4X. Then, the same images are denoised through a low-rank denoising algorithm [8] to build the training data set of 1.5K denoised images for each anatomical district. In total, we train 12 networks (i.e., 3 anatomical districts, 2 up-sampling factors, 2 (raw/denoised) images), each with 1.5K images as a training data set. In addition, for each anatomical district, up-sampling factor, and raw/denoised images we generate a validation data set of 400 images and a test data set of 200 images, using each image in only one of the three data sets.

Our approach requires the interpolation of the missing rows to the up-sampling method, while the learning model deals with the prediction of the target values from the interpolated values of the up-sampling method. Given two images $\textbf{A}$ and $\textbf{B}$ both of size $m \times n$, as quantitative metrics we consider the peak-signal-to-noise ratio $PSNR(\textbf{A},\textbf{B})=10\log _{10}\frac{ (\max (\textbf{A}))^{2} }{MSE(\textbf{A},\textbf{B})}$, where we define the mean squared error $MSE(\textbf{A},\textbf{B})=\frac{1}{m\times n} \sum _{i=1}^{m} \sum _{j=1}^{n} (\textbf{A}_{ij} - \textbf{B}_{ij})^{2}$, the structural similarity index measure

$$\begin{aligned} \begin{array}{l} SSIM(\textbf{A},\textbf{B}) = l(\textbf{A},\textbf{B}) \times c(\textbf{A},\textbf{B}) \times s(\textbf{A},\textbf{B}),\\ l(\textbf{A},\textbf{B})=\frac{2\mu _{\textbf{A}}\mu _{\textbf{B}}+C_{1}}{\mu _{\textbf{A}}^{2} + \mu _{\textbf{A}}^{2} + C_{1} }, \end{array} \end{aligned}$$

$$\begin{aligned} \begin{array}{l} c(\textbf{A},\textbf{B}) = \frac{2\sigma _{\textbf{A}}\sigma _{\textbf{B}}+C_{2}}{\sigma _{\textbf{A}}^{2} + \sigma _{\textbf{A}}^{2} + C_{2}},\qquad s(\textbf{A},\textbf{B}) = \frac{\sigma _{\textbf{AB}} + C_{3}}{ \sigma _{\textbf{A}} \sigma _{\textbf{B}} + C_{3}}, \end{array} \end{aligned}$$

where $\mu (\cdot )$ is the mean of $(\cdot )$, $\sigma (\cdot )$ is the standard deviation of $(\cdot )$, $\sigma _{\textbf{AB}}$ is the covariance between $\textbf{A}$ and $\textbf{B}$, the positive constants $C_{1}$, $C_{2}$ and $C_{3}$ are used to avoid a null denominator. We also consider the mean absolute error $MAE(\textbf{A},\textbf{B})=\frac{1}{m\times n}\sum _{i=1}^{m} \sum _{j=1}^{n}|\textbf{A}_{ij} - \textbf{B}_{ij}|$, and the pointwise absolute error image $| \textbf{A}- \textbf{B}|$ for the comparison of the high-resolution target with both the up-sampled image and the prediction of the network. We also compare the histogram of the absolute value of the prediction error to analyse the number of pixels whose error is lower than a certain threshold.

Deep learning network

We select WDSR [45], an architecture that exploits residual blocks since it improves the prediction of images where the difference between the input and the target is small. We propose a customised version of this network: custom-WDSR. In particular, our network architecture is a variant of WDSR-A, where the expansion of the features before the rectified linear unit (ReLU) activation allows more information to pass through while preserving the non-linearity of the network. After the normalisation of the data, we apply a 2D convolution and a weighted normalisation that improves the conditioning of the optimisation problem and thus the convergence. Then, we apply 8 residual blocks with wide activation where each residual block is composed of two convolution layers with ReLU activation and a final 2D convolution with a weighted normalisation layer; finally, we combine residual blocks and convolution layers and apply the denormalisation. The kernel filter size depends on the up-sampling factor: $(3 \times 3)$ in 2X up-sampling, and $(5 \times 5)$ in 4X up-sampling. The convolution layer does not need to perform the interpolation of the missing values, since this operation has already been performed by the up-sampling algorithm. For this reason, we did not implement the WDSR-B network which adds a linear low-rank convolution and neither pixel shuffling for the deconvolution operation. With this setting, the total number of trained parameters is 889K for 2X network and 253K for 4X network.

Given an $y=L \times D$ target image, and its approximation $\hat{y}$, our loss function is defined as

$$\begin{aligned} Loss(y,\hat{y}) = \left\{ \begin{array}{ll} \sum \limits _{l,d=1}^{L,D}\log \frac{\vert y_{ld}-\hat{y}_{ld} \vert + \epsilon }{k}, \quad \bmod (l,s)=0, \\ 0, \quad otherwise, \end{array}\right. \end{aligned}$$

where s controls the number of lines acquired by the sensor (e.g., $s=4$ when 4X up-sampling is applied) and neglecting their contribution to the training loss; $\epsilon =10^{-4}$ avoids a null error for the logarithmic loss, and $k=5$ determines the curvature of the logarithmic loss function. We enhance the pixels where the loss is less than 5 on the 0-255 range, to improve the visual similarity between the prediction and the target image. We underline that data are normalised in the range 0-1, and consequently the k value is set to $5/255 \approx 0.019$. The value of $\epsilon $ is selected sufficiently smaller than $1/255\approx 4\cdot 10^{-3}$, which is the normalisation of the smallest possible difference value between two pixels; we have experimentally set $\epsilon =1\cdot 10^{-4}$. The size of the kernel of the convolution filter depends on the up-sampling factor; in the case of a 2X up-sampling, we apply a $3\times 3$ filter; for a 4X up-sampling, we apply a $5\times 5$ filter. This choice allows us to include at least two lines acquired by the probe, in the convolution operator. Finally, we set the number of layers to 16 and the number of kernels to 10. The learning rate iteratively decreases, up to $10^{-6}$, and the number of epochs is set to 200. The input and output layers of the network are $\#batch \times L \times D$ size.

4 Experimental results

We discuss the results of the proposed super-resolution of 2D US images (Section 4.1) and compare our results with previous work (Section 4.2); we present the results with 2D US videos (Section 4.3) and noisy images (Section 4.4), and discuss the execution time (Section 4.5).

4.1 Proposed super-resolution of US images

We train each learning-based network (custom-WDSR) with 1.5K images, where the input is the outcome of the selected up-sampling method (i.e., Cubic convolution), and the target is the original high-resolution image. Indeed, input and target images have the same resolution, as the reconstruction of the missing lines has been already performed by Cubic convolution. Figures 3, 4, and 5 show the results of the network prediction, compared with the input and the target images. Target images correspond to spatial high-resolution images; input images are the outcome of the up-sampling interpolation, which is applied to spatial low-resolution images (i.e., the down-sampling along the lateral direction of high-resolution images); prediction images represent the output of the neural network.

Our framework visually improves the results, in terms of blurring and artefacts. This result is more evident in the magnification of the ear of the foetus (Fig. 3), the mitral valve (Fig. 4), and the mass edges (Fig. 5). Figure 6 shows the error image of the three anatomical districts with both 2X and 4X up-sampling factors, with the maximum error in the scale $0-255$. The error is more evident in the contours of the anatomical structures; moreover, the abdominal district shows a smaller error than the cardiac and obstetric ones. We underline that the view for each image is scaled to its maximum, to improve the visualisation of the error.

Figure 7(a-b-c, left) shows the box plot of the statistics of the PSNR on three different anatomical districts, comparing the target images with the prediction and the cubic convolution, respectively. The metrics are computed on a data set of 200 images of the same district and with the same up-sampling factor. We report that the PSNR median value improves of $1.7\%$ on obstetric 2X raw images, $6.1\%$ on cardiac 2X raw images, and $4.4\%$ on abdominal raw 4X images.

Figure 7(a-b-c, right) shows the histogram of the absolute value of the error with respect to the target image, of the prediction and Cubic convolution results, respectively. The histograms show the number of pixels where the prediction error is lower than 5 (i.e., the first bin of the histogram), which means very similar to the target when visually analysing the images. From the Cubic convolution to the predicted images, this value increases of $9.0\%$ on obstetric 4X raw images, $5.2\%$ on cardiac 4X raw images, and $6.2\%$ on abdominal 4X raw images.

Figure 8 shows the box plot of the SSIM (a-b-c, left) and MAE (a-b-c, right) quantitative metrics, as performed for PSNR metric. Also, these metrics show that our method improves the results of Cubic convolution both in terms of average value and variability. For example, the SSIM median value improves of $2.5\%$ on obstetric 4X images and the MAE median value improves of $4.7\%$ on cardiac 2X images.

The analysis of the absolute value of the difference between the input and the prediction of the network (Fig. 9) shows that the alteration of our prediction to the pixel values ranges from 0 to a maximum absolute value of 20, mainly located on the edges of the anatomical structures; furthermore, the black uniform areas are less affected by the prediction. In terms of the distance between the input and the prediction, we do not observe a significant difference among anatomical districts and between 2X and 4X up-sampling.

We also verify the robustness of our method on images at different brightness. Characterising the brightness of an image as the average value of all pixels, we test images with high and low brightness on different anatomical districts and up-sampling factors. Figures 10 and 11 show that the prediction performed with our trained network is robust to different values of image brightness, never lowering the output accuracy or generating artefacts. Comparing the input and the prediction of our network with the target image, we improve the PSNR value from 43.46 to 43.55 with high brightness images from the abdominal district 2X up-sampling, and from 31.01 to 31.48 with low brightness images from the obstetric district 4X up-sampling.

4.2 Comparison with previous work

We address both the comparison among state-of-the-art algorithms that are used for the selection of the up-sampling method of our framework and the comparison of our results with previous work. Among up-sampling STAR methods, we test four methods belonging to different classes: Cubic Convolution [20], a kernel-based interpolating method; Enhanced Deep Residual Networks - EDSR [24], a learning-based method trained on generic images; Enhanced Super-Resolution Generative Adversarial Network Plus - ESRGAN+ [32], a learning-based GAN method, specialised on US images with a dedicated training; Single Image Super Resolution - SISR [32], an up-sampling method which exploits sparse representations. We evaluate the up-sampling results of the selected methods on different anatomical districts and resolutions: obstetric district with 0.5X down-sampling (Fig. 12); abdominal district with 0.25X down-sampling (Fig. 13). Figure 14 shows the error image between target and SOTA super-resolution on both 2X and 4X up-sampling, with the maximum error value in the range $0-255$: Cubic convolution has visually the best results in terms of approximation error. Furthermore, our method improves the error image results with respect to Cubic convolution, improving the approximation of the target image, including the maximum error. All the error images of each up-sampling factor are represented with the same colour scale to better visualise the differences among the methods.

Tables 1, 2 and 3 summarise the comparison with the PSNR, SSIM, and MAE metrics on a test data set of 200 images. Cubic convolution has a mean PSNR value of 36.52 and 42.17 for 2X and 4X upsampling, respectively. According to these results, we select Cubic convolution as the best method for the up-sampling of US images. This method interpolates the missing lines, without generating artefacts. In comparison, our method improves the results of previous work (Figs. 12 and 13, Table 1), with a mean PSNR value of 37.00 and 44.35 for 2X and 4X super-resolution, respectively. Finally, we underline that 4X super-resolution on the abdominal district has better results than 2X super-resolution on the obstetric district, due to the complexity and variety of each anatomic district data set.

4.3 Proposed super-resolution of US videos

Applying our approach to US videos with a low spatial resolution and a high frequency (e.g., for the cardiac district), we can generate high-frequency 2D US video with an increased spatial resolution of each frame, thus overcoming the main limits of current US probes, whose spatial resolution decreases as the acquisition frequency increases. The relationship between image resolution and video frequency f is given by $f=c/(2\cdot d \cdot l)$, where c is the speed of sound. The acquisition of low-resolution US images allows the physician to increase the acquisition frequency. The probe acquires a reduced number of lines: we refer to 0.5X and 0.25X low-resolution images, as $l/2 \times d$ and $l/4 \times d$ resolution, respectively. We refer the reader to the uploaded video for the experimental tests on the spatial super-resolution of 2D US videos (see URL below). In the video, the input signal is a 2D US video at full resolution $L \times D \times T$ with L lines, D depth, and T frames. We down-sample each image at L/2 or L/4, and apply our framework for the spatial super-resolution, to reconstruct the full-resolution 2D video. Video URL: https://www.dropbox.com/s/p42pzxxvgf9gacl/SuperResolution-US.mp4?dl=0.

4.4 Denoising and super-resolution

To evaluate the effect of denoising for the super-resolution of US images, we apply to input raw images a learning-based low-rank denoising which allows us to select a soft intensity of the smoothing. This approach generates denoised images that are visually similar to raw images, and simultaneously more uniform. Then, we generate down-sampled images (0.5X and 0.25X) and apply the Cubic convolution up-sampling. These couples of images (i.e., denoised at full resolution and up-sampled) are used to train the learning-based network (Section 3). With this approach, we verify the performance of both the up-sampling algorithm and our learning-based prediction when applied to input denoised images.

Table 1

Concerning Figs. 12 and 13, we report the PSNR metric computed between target and up-sampling methods, as the mean value among the 200 test images. Best results in bold

Test	Obstetric 2X	Abdominal 4X
Cubic Convolution	36.52	42.17
EDSR	32.08	34.91
SRGAN	33.70	36.35
SISR	34.75	38.58
OURS	$\mathbf {37.00}$	$\mathbf {44.35}$

Table 2

Concerning Figs. 12 and 13, we report the SSIM metric computed between target and up-sampling methods, as the mean value among the 200 test images. Best results in bold

Test	Obstetric 2X	Abdominal 4X
Cubic convolution	0.935	0.904
EDSR	0.878	0.61
SRGAN	0.902	0.632
SISR	0.927	0.773
OURS	$\mathbf {0.941}$	$\mathbf {0.906}$

Table 3

Concerning Figs. 12 and 13, we report the MAE metric $[\cdot 10^{-2}]$ computed between target and up-sampling methods, as the mean value among the 200 test images. Best results in bold

Test	Obstetric 2X	Abdominal 4X
Cubic convolution	0.8	1.19
EDSR	1.08	7.55
SRGAN	1.21	4.13
SISR	0.92	3.31
OURS	$\mathbf {0.75}$	$\mathbf {1.16}$

Figure 15 shows the results of the prediction of the network, compared with the input and the target denoised images of the obstetric district. Our framework visually improves the results, in terms of blurring and artefacts. Figure 16 shows the error image of our prediction with respect to the target denoised image, for both 2X and 4X up-sampling. The error is mainly distributed on the edges of the anatomical structure. Furthermore, the maximum error of the 2X up-sampling is 6 in the range of $0-255$, showing us that our method accurately predicts the target if soft denoising is applied before up-sampling.

Figure 17 (left) shows the box plot of the quantitative metrics, comparing the target images with the prediction and the Cubic convolution, respectively. The PSNR metric is computed on a data set of 200 images, belonging to the same district, and with the same up-sampling factor. Analysing the obstetric anatomical district and concerning the corresponding raw images (Fig. 7 (a, left)), the denoising allows the network to significantly improve the results of the up-sampling and the prediction. In particular, comparing the target images with the predicted images, the median PSNR value of obstetric 2X denoised images is 51.8, compared to the median PSNR value of obstetric 2X raw images which is 36.9.

Figure 17 (right) shows the histogram of the absolute value of the error with respect to the target, of the prediction and Cubic convolution respectively. This result shows that our framework increase of $1.7\%$ and $14\%$ (2X and 4X, respectively) the number of pixels where the prediction error is lower than 5, which is very similar to the target when visually analysing the images, and improved with respect to the learning framework applied to raw images. According to Fig. 18, our method improves the accuracy of Cubic convolution. For example, the SSIM increases of $1.3\%$ on cardiac 2X and the MAE increases of $8.2\%$ on abdominal 4X.

4.5 Execution time and computational cost

We define an HPC implementation of the proposed framework on the CINECA-Marconi100 cluster, exploiting both CPUs (IBM POWER9 AC922) and GPUs (NVIDIA Volta V100). We design a parallel and distributed implementation in TensorFlow2, and we train multiple networks with large data sets for the target medical application. To test the training phase of the learning-based networks in the HPC environment, we exploit 8 nodes, each one composed of 32 cores and 4 accelerators, for a theoretical computational performance of 260 TFLOPS, and 220 GB of memory per node. The parallel implementation of the deep learning framework and the high hardware performance reduce the computation time of the training phase by at least 100 orders less than a serial implementation on a standard workstation. Figure 19 shows the training loss and validation PSNR. Both metrics show convergence property within 100 epochs iteration. In particular, the validation PSNR goes from a value of 41 to a value of 58 after 100 epochs.

The computational cost of the prediction depends on the resolution of the input image and the architecture of the network: in particular, the computational cost of a convolution operation is $\mathcal {O} (r/s_{r} \cdot c/s_{c})\cdot (f_{r} \cdot f_{c})\cdot f$; in our application, the input images have variable resolutions, with a maximum value of $r=c=600$, the kernel-filter size on rows and columns is $f_{r} = f_{c} = 3$ on 2X applications and $f_{r} = f_{c} = 5$ on 4X applications, the stride on rows and columns is $s_{r} = s_{c} = 1$, we use 16 convolution operators and 10 kernel filters.

We test the prediction on GPU-based hardware, which replicates the hardware of a US scanner currently in use. Given a set of US input images from different districts at different resolutions, the average execution time is 8 milliseconds. Finally, the denoise pre-processing can be performed in real-time through a learning-based method [7].

5 Discussion

SOTA super-resolution methods approximate the unknown values with deep-learning models that take advantage of large data sets or interpolating models that account for the neighbouring points through kernel functions. Learning-based models tend to generate artefacts while interpolating algorithms are general-purpose models that may be less accurate on anatomical districts with complex geometries and may not be robust to noise images. Our method combines the two approaches: first, we up-sample the low-resolution image through an interpolating method; then, we apply a learning-based network to improve the visual accuracy of the up-sampling on the specific anatomical district without generating artefacts and improving super-resolution results in comparison with the SOTA methods. Furthermore, neural network models such as WDSR can be used in super-resolution problems not only for interpolation but also for the specialisation and fine-tuning of the results. As the main requirement for our two-steps approach, the up-sampling algorithm (i.e., the first step) must have a low execution time, to keep the entire pipeline in real-time.

Through learning and high-performance computing, the proposed super-resolution is specialised to different anatomical districts by training multiple networks. Furthermore, we can improve the offline training with new data, a-priori and/or additional information on the input data (e.g., anatomical district, image resolution, acquisition methodology/protocol). The training data set can be periodically updated with the up-sampled images after the expert validation of the super-resolution results or with new data to further specialise in the individual networks. The use of deep learning on large data sets overcomes the limitations of vision-based algorithms that are general and do not fully encode the characteristics of the data. Each network is separately trained from scratch on each anatomical district and up-sampling factor. If a small data set is available for a certain anatomical district, we can train a general-purpose network and then specialise dedicated networks with a fine-tuning stage to the specific anatomical districts.

HPC is widespread for the training of learning models in US processing; for example, for the localisation of common carotid artery transverse section through RCNN [19], automatic segmentation of the carotid artery and internal jugular veins [17], fetal standard planes recognition [33], and segmentation and classification of anatomical structures [31]. HPC and cloud computing also poses new challenges in terms of reorganisation of the medical analysis pipeline, where the computational demand is shifted to centralised hardware resources with a real-time execution of the network’s prediction on local devices [6].

6 Conclusions

We introduce a novel deep learning framework for the super-resolution of US images, which improves the quality of the up-sampling of a selected state-of-the-art algorithm, by training a neural network to match the target high-resolution image. Our method is tested on different anatomical districts (e.g., obstetric, cardiac, obstetric) and up-sampling factors (e.g., 2X, 4X), and it is general with respect to the up-sampling algorithm and the learning model, as long as it complies with the real-time prediction requirement. We analyse the results on 2D images and videos, on both raw and denoised signals, discussing the improvement of the denoising in terms of up-sampling accuracy, at the cost of a small loss of details on the US signal. Our method specialises trained networks to predict the high-resolution target through the design of the network architecture and the loss function, taking into account the anatomical district and the up-sampling factor and exploiting a large ultrasound data set.

In future work, we want to extend the framework to US 3D images and perform with Esaote quality department and expert radiologists a clinical validation of the method through more formalised qualitative survey and evaluation methods [16, 34] through an interdisciplinary approach that involves engineering, medical science, physics, and computer science.

Acknowledgements

We thank the reviewers for their thorough review and constructive comments, which helped us to improve the technical part and presentation of the revised paper. This work has been partially supported by the European Commission, NextGenerationEU, Missione 4 Componente 2, “Dalla ricerca all’impresa”, Innovation Ecosystem RAISE “Robotics and AI for Socio-economic Empowerment”, ECS00000035. Tests on CINECA Cluster are supported by the ISCRA-C Project HP10CVHIXD.

Declarations

Conflicts of interest

The authors declare no competing interests.

Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Vorheriger Artikel Review of microwave imaging algorithms for stroke detection

Nächster Artikel A biomarker discovery of acute myocardial infarction using feature selection and machine learning

Abdel-Nasser M, Melendez J, Moreno A, Omer OA, Puig D (2017) Breast tumor classification in ultrasound images using texture analysis and super-resolution methods. Eng Applic Artif Intell 59:84–92CrossRef

Abdel-Nasser M, Omer OA (2016) Ultrasound image enhancement using a deep learning architecture. In: International conference on advanced intelligent systems and informatics, Springer, pp 639–649

Alessandrini M, Maggio S, Porée J, De Marchi L, Speciale N, Franceschini E, Bernard O, Basset O (2011) A restoration framework for ultrasonic tissue characterization. Transactions on Ultrasonics, Ferroelectrics, and Frequency Control 58(11):2344–2360CrossRefPubMed

Basarab A, Liebgott H, Morestin F, Lyshchik A, Higashi T, Asato R, Delachartre P (2008) A method for vector displacement estimation with ultrasound imaging and its application for thyroid nodular disease. Med Image Anal 12(3):259–274CrossRefPubMed

Brown KG, Ghosh D, Hoyt K (2020) Deep learning of spatiotemporal filtering for fast super-resolution ultrasound imaging. Transactions on Ultrasonics, Ferroelectrics, and Frequency Control 67(9):1820–1829CrossRefPubMed

Caballero M, Gómez JA, Bantouna A (2019) Deep-learning and hpc to boost biomedical applications for health (deephealth). In: 2019 IEEE 32nd international symposium on computer-based medical systems (CBMS), IEEE, pp 150–155

Cammarasana S, Nicolardi P, Patanè G (2022) Real-time denoising of ultrasound images based on deep learning. Medical & Biological Engineering & Computing pp 1–16

Cammarasana S, Patane G (2022) Learning-based low-rank denoising. Signal, Image and Video Processing pp 1–7

Choi W, Kim M, HakLee J, Kim J, BeomRa J (2018) Deep cnn-based ultrasound super-resolution for high-speed high-resolution b-mode imaging. In: International ultrasonics symposium, IEEE, pp 1–4

10.

Clement G, Huttunen J, Hynynen K (2005) Superresolution ultrasound imaging using back-projected reconstruction. The Journal of the Acoustical Society of America 118(6):3953–3960CrossRefPubMed

11.

Cüneyitoğlu Özkul M, Mumcuoğlu ÜE, Sancak İT (2019) Single-image bayesian restoration and multi-image super-resolution restoration for b-mode ultrasound using an accurate system model involving correlated nature of the speckle noise. Ultrasonic Imaging 41(6):368–386CrossRefPubMed

12.

Diamantis K, Greenaway AH, Anderson T, Jensen JA, Dalgarno PA, Sboros V (2017) Super-resolution axial localization of ultrasound scatter using multi-focal imaging. IEEE Trans Biomed Eng 65(8):1840–1851CrossRefPubMed

13.

Ding J, Zhao S, Tang F, Ning C (2021) Ultrasound image super-resolution with two-stage zero-shot cyclegan. In: Journal of Physics: Conference Series, vol 2031. IOP Publishing, pp 012015

14.

Ellis MA, Viola F, Walker WF (2010) Super-resolution image reconstruction using diffuse source models. Ultrasound Medic Biol 36(6):967–977CrossRef

15.

Geman S, Geman D (1984) Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. Trans Pattern Anal Mach Intell 6:721–741CrossRef

16.

Ghulam QM, Kilaru S, Ou SS, Sillesen H (2020) Clinical validation of three-dimensional ultrasound for abdominal aortic aneurysm. J Vascular Surg 71(1):180–188CrossRef

17.

Groves LA, VanBerlo B, Veinberg N, Alboog A, Peters TM, Chen E (2020) Automatic segmentation of the carotid artery and internal jugular vein from 2D ultrasound images for 3D vascular reconstruction. Int J Comput Assisted Radio Surg 15(11):1835-1846CrossRef

18.

Isola P, Zhu JY, Zhou T, Efros AA (2017) Image-to-image translation with conditional adversarial networks. In: Conf on computer vision and pattern recognition, pp 1125–1134

19.

Jain PK, Gupta S, Bhavsar A, Nigam A, Sharma N (2020) Localization of common carotid artery transverse section in B-mode ultrasound images using faster RCNN: a deep learning approach. Med Biological Eng Comput 58(3):471–482CrossRef

20.

Keys R (1981) Cubic convolution interpolation for digital image processing. Trans Acoustics Speech Signal Process 29(6):1153–1160CrossRef

21.

Khavari P, Asif A, Rivaz H (2018) Non-local super resolution in ultrasound imaging. In: International workshop on multimedia signal processing, IEEE, pp 1–6

22.

Lavarello R, Kamalabadi F, O’Brien WD (2006) A regularized inverse approach to ultrasonic pulse-echo imaging. Trans Med Imaging 25(6):712–722CrossRef

23.

Ledig C, Theis L, Huszár F, Caballero J, Cunningham A, Acosta A, Aitken A, Tejani A, Totz J, Wang Z, et al (2017) Photo-realistic single image super-resolution using a generative adversarial network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4681–4690

24.

Lim B, Son S, Kim H, Nah S, Mu Lee K (2017) Enhanced deep residual networks for single image super-resolution. In: Proc of the IEEE Conf on computer vision and pattern recognition workshops, pp 136–144

25.

Lingvall F (2004) A method of improving overall resolution in ultrasonic array imaging using spatio-temporal deconvolution. Ultrasonics 42(1–9):961–968CrossRefPubMed

26.

Liu H, Liu J, Hou S, Tao T, Han J (2021) Perception consistency ultrasound image super-resolution via self-supervised cyclegan. Neural Computing and Applications pp 1–11

27.

Lu J, Liu W (2018) Unsupervised super-resolution framework for medical ultrasound images using dilated convolutional neural networks. In: 3rd international conference on image, vision and computing, IEEE, pp 739–744

28.

Morin R, Basarab A, Kouamé D (2012) Alternating direction method of multipliers framework for super-resolution in ultrasound imaging. In: International symposium on biomedical imaging, IEEE, pp 1595–1598

29.

Morin R, Basarab A, Ploquin M, Kouamé D (2012) Post-processing multiple-frame super-resolution in ultrasound imaging. Medical imaging: ultrasonic imaging, tomography, and therapy, SPIE 8320:433–440

30.

Ng MK, Weiss P, Yuan X (2010) Solving constrained total-variation image restoration and reconstruction problems via alternating direction methods. J Scient Comput 32(5):2710–2736CrossRef

31.

Patel TR, Bodduluri S, Anthony T, Monroe WS, Kandhare PG, Robinson JP, Nakhmani A, Zhang C, Bhatt SP, Bangalore PV (2019) Performance characterization of single and multi GPU training of U-Net architecture for medical image segmentation tasks. In: Proceedings of the practice and experience in advanced research computing on rise of the machines (Learning). ACM

32.

Peleg T, Elad M (2014) A statistical prediction model based on sparse representations for single image super-resolution. Trans Image Process 23(6):2569–2582CrossRef

33.

Pu B, Li K, Li S, Zhu N (2021) Automatic fetal ultrasound standard plane recognition based on deep learning and IIoT. IEEE Trans Industrial Inform 17(11):7771–7780CrossRef

34.

Rahimi A, Khalil A, Faisal A, Lai KW (2022) Ct-mri dual information registration for the diagnosis of liver cancer: A pilot study using point-based registration. Current Medical Imaging 18(1):61--66CrossRefPubMed

35.

Rakotonirina NC, Rasoanaivo A (2020) Esrgan+: Further improving enhanced super-resolution generative adversarial network. In: IEEE Inter Conf on acoustics, speech and signal processing, IEEE, pp 3637–3641

36.

Schoen S, Zhao Z, Alva A, Huang C, Chen S, Arvanitis C (2021) Morphological reconstruction improves microvessel mapping in super-resolution ultrasound. IEEE Transactions on Ultrasonics, Ferroelectrics, and Frequency Control 68(6):2141--2149CrossRefPubMedPubMedCentral

37.

Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556

38.

Taxt T, Jirik R (2004) Superresolution of ultrasound images using the first and second harmonic signal. Transactions on Ultrasonics, Ferroelectrics, and Frequency control 51(2):163--175CrossRefPubMed

39.

Temiz H, Bilge HS (2020) Super resolution of b-mode ultrasound images with deep learning. Access 8:78808–78820CrossRef

40.

Van Sloun RJ, Solomon O, Bruce M, Khaing ZZ, Eldar YC, Mischi M (2019) Deep learning for super-resolution vascular ultrasound imaging. In: International conference on acoustics, speech and signal processing, IEEE, pp 1055–l1059

41.

Viola F, Ellis MA, Walker WF (2007) Time-domain optimized near-field estimator for ultrasound imaging: Initial development and results. IEEE Trans Med Imaging 27(1):99–110CrossRef

42.

Wang X, Yu K, Wu S, Gu J, Liu Y, Dong C, Qiao Y, Change Loy C (2018) Esrgan: Enhanced super-resolution generative adversarial networks. In: Proceedings of the european conf on computer vision

43.

Yoon YH, Ye JC (2018) Deep learning for accelerated ultrasound imaging. In: International conference on acoustics, speech and signal processing, IEEE, pp 6673–6676

44.

Yu C, Zhang C, Xie L (2012) An envelope signal based deconvolution algorithm for ultrasound imaging. Signal Process 92(3):793–800CrossRef

45.

Yu J, Fan Y, Huang T (2020) Wide activation for efficient image and video super-resolution. In: British machine vision conf

46.

Zhao N, Basarab A, Kouame D, Tourneret JY (2015) Joint bayesian deconvolution and pointspread function estimation for ultrasound imaging. In: International symposium on biomedical, imaging IEEE, pp 235–238

47.

Zhao, N, Wei, Q, Basarab, A, Kouamé, D, Tourneret, JY (2016) Single image super-resolution of medical ultrasound images using a fast algorithm. In: International symposium on biomedical imaging, IEEE, pp 473–476

Titel: Super-resolution of 2D ultrasound images and videos
verfasst von: Simone Cammarasana
Paolo Nicolardi
Giuseppe Patanè
Publikationsdatum: 17.05.2023
Verlag: Springer Berlin Heidelberg
Erschienen in: Medical & Biological Engineering & Computing / Ausgabe 10/2023
Print ISSN: 0140-0118
Elektronische ISSN: 1741-0444
DOI: https://doi.org/10.1007/s11517-023-02818-x

Springer Professional

Super-resolution of 2D ultrasound images and videos

Abstract

Graphical abstract

Publisher's Note

1 Introduction

3 Proposed super-resolution of US signals

4 Experimental results

4.1 Proposed super-resolution of US images

4.2 Comparison with previous work

4.3 Proposed super-resolution of US videos

4.4 Denoising and super-resolution

4.5 Execution time and computational cost

5 Discussion

6 Conclusions

Acknowledgements

Declarations

Conflicts of interest

Publisher's Note

Premium Partner

Springer Professional

Abstract

Graphical abstract

Publisher's Note

1 Introduction

2 Related work

3 Proposed super-resolution of US signals

4 Experimental results

4.1 Proposed super-resolution of US images

4.2 Comparison with previous work

4.3 Proposed super-resolution of US videos

4.4 Denoising and super-resolution

4.5 Execution time and computational cost

5 Discussion

6 Conclusions

Acknowledgements

Declarations

Conflicts of interest

Publisher's Note

Weitere Artikel der Ausgabe 10/2023

Review of microwave imaging algorithms for stroke detection

A biomarker discovery of acute myocardial infarction using feature selection and machine learning

A 2-dimensional model framework for blood glucose prediction based on iterative learning control architecture

Membrane polarization at the excitation threshold induced by external electric fields in cardiomyocytes of rats at different developmental stages

FiMec tremor stabilization spoon: design and active stabilization control of two DoF robotic eating devices for hand tremor patients

An investigation of machine learning algorithms for prediction of lumbar disc herniation

Premium Partner