As described in the previous section, we validated HaraliCU by comparing the values of the features contrast, correlation, energy, and homogeneity with those extracted using the built-in functions graycomatrix
.
4.2 Computational results
The computational performance of the pipeline presented in this work was assessed by independently considering the two steps parallelized on the GPU: Haralick feature extraction and unsupervised SOM-based image pixel clustering.
HaraliCU The CUDA-based version of the Haralick feature extraction, employed in our pipeline, was tested against a CPU version coded in C++, which resulted extremely efficient with respect to the MATLAB version, based on the
graycomatrix
and
graycoprops
functions, to extract Haralick features on brain metastasis MR images [
42]. As a matter of fact, by varying the grayscale range from
\(2^4\) to
\(2^9\) levels, we achieved speed-up values around
\(50\times \) and
\(200\times \), respectively.
The GPU version of HaraliCU was executed on an NVIDIA GeForce GTX Titan X (3072 cores, clock 1.075 GHz, 12 GB of RAM), CUDA toolkit version 8 (driver 387.26), running on a workstation with Ubuntu 16.04 LTS, equipped with a CPU Intel Core i\(7-2600\) (clock 3.4 GHz) and 8 GB of RAM. The CPU version was run on the same workstation, relying upon the computational power provided by the CPU Intel Core i\(7-2600\). The CPU version was compiled by using the GNU C++ compiler (version 5.4.0) with optimization flag -O3
, while the GPU version was compiled with the CUDA Toolkit 8 by exploiting the optimization flag -O3
for both CPU and GPU codes.
In order to collect statistically sound results and take into consideration the variability and heterogeneity typically characterizing medical images, we randomly selected 30 images from 3 different patients (10 per patient) affected by brain metastases and 30 images from 3 different patients affected by ovarian cancer. We tested both the CPU and GPU versions by considering various window sizes, that is, \(\omega \in \{3, 7, 11, 15, 19, 23, 27, 31\}\), as well as two different intensity levels (i.e., \(2^8\) and \(2^{16}\)). For each combination of \(\omega \) and intensity levels, we also enabled and disabled the GLCM symmetry to evaluate how the symmetry affects the running time.
The speed-up achieved by HaraliCU considering only
\(2^8\) intensity levels increases almost linearly up to
\(\omega = 19\) (data not shown, see [
42] for details); by disabling the GLCM symmetry and using
\(\omega = 31\), we obtained the highest speed-ups of
\(12.74\times \) and
\(12.71\times \) on brain metastasis (
\(256\times 256\) pixels) and ovarian cancer images (
\(512\times 512\) pixels), respectively. When the full dynamics of the grayscale levels (i.e.,
\(2^{16}\)) is considered, HaraliCU outperforms the sequential counterpart, achieving speed-ups up to
\(15.80\times \) with
\(\omega =31\) and
\(19.50\times \) with
\(\omega =23\), on brain metastasis and ovarian cancer images, respectively. Taking into account ovarian cancer images, when
\(\omega \) is greater than 23 pixels, the speed-up decreases for two reasons. First, since a thread is launched for each pixel, it must consider more neighbor pixels that might have very different gray-level intensities. This corresponds with increasing the required workload that each thread must perform; however, considering that the GPU cores have a lower clock frequency than the CPU cores, the speed-up is clearly reduced. Second, the GPU resources are saturated as the GLCM size associated with each thread may increase due to the high full-dynamic range. In this specific situation, the total GLCM size might overwhelm the capacity of the global memory and some threads might handle different pixels, thus computing the corresponding Haralick features in a sequential way.
CUDA-SOM The performance of CUDA-SOM was assessed by comparing it to a C++ version, running on a single core of the CPU, specifically developed for this work, since the available R implementation of the SOM is limited to a network size of \(150\times 150\) neurons.
We first run a batch of tests to the aim of analyzing the impact of the number of samples and the size of the SOM on the computational time. We employed a machine equipped with 16 GB of RAM, a CPU Intel Core i7 4790k (clock 4.4 Ghz), and an NVIDIA GeForce 1050ti (768 cores, clock 1.392 GHz, 4 GB of RAM).
As reported in Table
1, the running time of the C++ version is lower in the case of small size SOMs (i.e.,
\(20\times 20\) neurons), while the GPU allows us to reduce the computation time, up to
\(5.75\times \), when a SOM having size
\(300\times 300\) neurons is trained with 120,000 samples. Additional tests (data not shown) confirmed the trend observed, as the speed-up further increases to
\(\sim 7\times \) with a SOM having size
\(400\times 400\) neurons.
Table 1
Running time required by the C++ and GPU versions of SOM, by varying the number of samples used to train the network and the number of neurons
30,000 | \(20 \times 20\) | 50 | 144 | \(0.34\times \) |
60,000 | \(20 \times 20\) | 101 | 288 | \(0.35\times \) |
120,000 | \(20 \times 20\) | 859 | 2479 | \(0.34\times \) |
30,000 | \(150 \times 150\) | 1424 | 486 | \(2.93\times \) |
60,000 | \(150 \times 150\) | 2928 | 980 | \(2.98\times \) |
120,000 | \(150 \times 150\) | 5736 | 1950 | \(2.94\times \) |
30,000 | \(300 \times 300\) | 10318 | 1798 | \(5.73\times \) |
60,000 | \(300 \times 300\) | 19978 | 3478 | \(5.74\times \) |
120,000 | \(300 \times 300\) | 39920 | 6940 | \(5.75\times \) |
As a second batch of tests, we compared the performance of different NVIDIA GPUs, i.e., Titan Z (\(2 \times 2880\) cores, clock 0.876 GHz, 6 GB of RAM), Titan X (GM200, 3072 cores, clock 1.075 GHz, 12 GB of RAM), GeForce 1050ti (768 cores, clock 1.392 GHz, 4 GB of RAM), GeForce 1080ti (3584 cores, clock 1582 GHz, 11 GB of RAM), when executing CUDA-SOM with different SOM sizes, considering 60,000 samples and 7 features.
Table
2 reports the speed-up values achieved by each GPU with respect to the C++ implementation. As expected, in the case of small-size SOMs, the CPU was more convenient than the GPUs; moreover, the GeForce 1080ti obtained the best results, by exploiting its highest clock frequency, achieving
\(10\times \) speed-up in the case of the SOM with
\(400\times 400\) neurons.
Table 2
Speed-up achieved by CUDA-SOM using different GPUs compared to the C++ implementation
\(20 \times 20\) | 0.16 | 0.21 | 0.33 | 0.34 |
\(80 \times 80\) | 1.48 | 1.72 | 2.27 | 2.85 |
\(250 \times 250\) | 4.98 | 4.78 | 5.00 | 7.66 |
\(400 \times 400\) | 6.86 | 6.61 | 6.70 | 10.03 |
Considering the analysis performed on medical images, in the case of ovarian cancer CT, the running time (including file loading) was of 79 and 1020 s in the case of 100 and 1000 iterations, respectively. To understand the advantage of CUDA-SOM, consider that the running time of the same SOM algorithm, implemented with C++ and OpenMP, is 2956 s to complete 100 iterations. This reduction in the running time corresponds to a \(37 \times \) speed-up.
CUDA-SOM was executed on an NVIDIA Tesla P100 (3584 cores, clock 1.329 GHz, 16 GB of RAM), CUDA toolkit version 8 (driver 440.95.01), running on a computer node of the Cambridge Service for Data Driven Discovery (CSD3) with Scientific Linux 7. Each node is equipped with a single CPU Intel Xeon E5-2650 v4 (clock 2.2 GHz), 94 GB of RAM, and up to 4 NVIDIA Tesla P100 GPUs. The CPU version was run on the same node, relying upon the computational power provided by the CPU Intel Xeon E5-2650 v4. The CPU version was compiled by using the GNU C++ compiler (version 5.4.0) with optimization flag -O3
, while the GPU version was compiled with the CUDA Toolkit 8.0 by exploiting the optimization flag -O3
for both CPU and GPU codes.