1. Introduction
COVID-19 is a disease caused by the SARS-CoV-2 virus, declared a pandemic by the World Health Organisation on 11 March 2020. At the time of writing, COVID-19 has more than one hundred and eighty million confirmed cases and has caused more than three million deaths, with a mortality rate of 2.1% [
1]. As hospitals have been shown to have limited availability of adequate equipment, a rapid diagnosis would have been and still is essential to control the spread of the disease, increase the effectiveness of medical treatment, and, consequently, the chances of survival without intensive care. Basically, the polymerase chain reaction and reverse transcriptase (RT-PCR) method is the primary screening tool for COVID-19, in which SARS-CoV-2 ribonucleic acid (RNA) is detected within an upper respiratory tract sputum sample [
2]. However, many countries are unable to provide sufficient testing, and, in any case, only people with apparent symptoms are tested, and it takes hours to provide an accurate result.
Therefore, there is a need for faster and more reliable screening techniques that could further confirm the PCR test or replace it entirely, such as imaging-based methods. They may complement its use to achieve greater diagnostic certainty or even substitute in some countries where RT-PCR is not readily available. In some cases, chest X-ray (CXR) abnormalities are seen in patients who initially had a negative RT-PCR test, and several studies have shown that chest computed tomography (CT) has greater sensitivity for COVID-19 than RT-PCR and could be considered a primary tool for diagnosis [
3,
4,
5,
6]. In response to the pandemic, researchers have rushed to develop models using artificial intelligence (AI), particularly machine learning, to support clinicians [
7].
Computed tomography is already a widely explored medical imaging technique that allows non-invasive visualisation of the interior of an object [
8,
9,
10,
11,
12,
13] and is widely used in many applications, such as medical imaging for clinical purposes [
14,
15,
16,
17,
18]. For this reason, clinical institutions have used CT as an effective and complementary screening tool alongside RT-PCR [
5,
6] with a higher sensitivity of up to 98% compared to 71% for RT-PCR [
19,
20]. In particular, several studies have shown that CT has excellent utility in detecting COVID-19 infections during routine CT examinations for reasons unrelated to COVID-19, such as monitoring of elective surgical procedures and neurological examinations [
21]. Other scenarios where CT imaging has been exploited include cases where patients have worsening respiratory complications and cases where patients with negative RT-PCR test results are suspected to be COVID-19 positive due to other factors. Early studies have shown that chest CT images of patients may contain some potential indicators for COVID-19 [
2,
5,
6,
22] infections, but may also be contained in non-COVID-19 infections. This issue can lead to challenges for radiologists in distinguishing COVID-19 infections from non-COVID-19 infections using chest CT [
23,
24]. However, the duration of diagnosis is the main limitation of CT scan tests: even experienced radiologists need about 21.5 min to analyse the test results of each case [
25], and during the emergency, a large number of CT images have to be evaluated in a very short time, thus increasing the probability of misclassification. For this reason, intelligent automatic diagnosis systems that automatically classify chest CT images can help to improve speed and to rapidly confirm the test result.
In recent years, deep learning workflows have emerged from the proposed AlexNet convolutional neural network (CNN) in 2012 [
26]. CNNs do not follow the typical workflow of image analysis because they can extract features independently without the need for feature descriptors or specific feature extraction techniques. Therefore, they differ from conventional machine learning methods because they require little or no image preprocessing and can automatically infer an optimal data representation from raw images without requiring prior feature selection, resulting in a more objective and less biased process. Furthermore, they achieved optimal results in many domains, such as computer vision devoted to medical analysis, with images coming from magnetic resonance imaging (MRI) [
27], microscopy [
28], CT [
29], ultrasound [
30], X-ray [
31], and mammography [
32]. They have been successfully applied to various different problems, like classification or segmentation [
33,
34,
35,
36]. Deep learning-based methods have also made significant progress in the analysis of lung diseases, which is a comparable scenario to COVID-19 [
37,
38,
39]. However, the scenario of CT images of lungs referred to COVID-19 and non-COVID-19 patients can be particularly problematic to classify, especially when the damage due to pneumonia of different causes is present simultaneously. The main findings of chest CT scans of COVID-19 positive patients indicate traces of ground-glass opacity (GGO) [
40]. Two CT scans of COVID-19 and non-COVID-19 are shown in
Figure 1.
The overall objective of this study is to investigate the behaviour of the main existing off-the-shelf CNNs for the classification of patients’ CTs. This work is a preliminary investigation for the future development of a tool that provides confirmation of the viral test result or provides more details about the ongoing infection, also considering that according to the Center for Disease Control (CDC), even if a chest CT or X-ray suggests COVID-19, the viral test is the only specific method for diagnosis [
42]. Specifically, we propose a comprehensive investigation of the problem of COVID-19 classification from chest CT images from different perspectives:
- 1.
We present a comparative study of several off-the-shelf CNN architectures in order to select a suitable deep learning model to perform a three-class classification on the public COVIDx CT-2A dataset, specifically divided into COVID-19, pneumonia and healthy cases;
- 2.
On the same dataset, we performed a patient-oriented experiment by grouping all the CT images of the patients, in which the aim was to provide a diagnosis;
- 3.
We investigated the robustness of the methods by performing two cross-dataset experiments and evaluating the performance of CNNs previously trained on COVIDx CT-2A. In particular, we performed a two-class classification between COVID-19 and healthy cases, on the COVID-CT dataset, without fine-tuning;
- 4.
We repeated the experiment just described, by fine-tuning the most promising CNNs, demonstrating that it is still problematic to integrate automatic methods in the clinical diagnosis of COVID-19.
We both demonstrate how off-the-shelf deep learning architectures can be utilised to classify CT images representing COVID-19 affected patients and how transfer learning capabilities are still far from offering a concrete contribution in a real-world scenario, as demonstrated by our cross-dataset experiments, without addressing it with different techniques. The experiments are not intended to provide an exhaustive comparison of the performance of these methods; instead, we wanted to select the most suitable for our classification of CT images without, for the time being, investigating possible parametric improvements. The purpose is to create a concrete baseline with the potential to be modified and developed further. Moreover, several works in the context of COVID-19 diagnostics have considered small or private datasets or lacked rigorous experimental methods, potentially leading to over-fitting and overestimation of performance [
7,
43]. For this reason, we:
- 1.
Carefully selected the two datasets on which to conduct the experiments described. In fact, Roberts et al. [
7] have recently shown that most of the datasets used in the literature for the diagnosis or prognosis of COVID-19 suffer from duplication and quality problems;
- 2.
Selected COVIDx CT-2A, a public reference dataset specifically proposed for COVID-19 detection from CT imaging, because of the high risks of bias due to source problems and datasets created from unsupervised public online repositories. It has already been provided with train, validation, and testing splits.
We verified the robustness of the solution on both the public COVIDx CT-2A and COVID-CT datasets. Our proposed approach achieves promising results on COVID-19 identification, although it does not show satisfactory performance on cross-dataset experiments.
The rest of the article is organised as follows. The following paragraph presents a review of deep learning approaches for COVID-19 detection.
Section 2 describes the datasets used in our experiments and presents the metrics adopted to evaluate the experimental results illustrated in
Section 3. In
Section 4, we analyse and discuss the experimental results and give a comparison with the state of the art. Finally, conclusions and future directions are drawn in
Section 5.
Related Work
Here, we briefly describe some works that have addressed tasks related to COVID-19. Although the research is still evolving, the automatic classification of COVID-19 has gained wide attention from researchers around the world [
7,
42,
44,
45]. In this context, we can broadly distinguish the proposed methods into those based on 2D and 3D images. Among the most recent ones [
46,
47], 3D images could be handy to avoid losing the interstitial information of the lungs. However, several works have exploited 2D images showing the property of extracting representative features of COVID-19 lesions for disease detection [
48,
49,
50,
51,
52,
53,
54,
55,
56]. They are all CNN-based and used CT [
48,
49,
50,
51,
52,
53,
54,
55] or CXR [
43,
50,
56] images. We particularly focused this study on deep learning-based classification methods for COVID-19 detection.
Among the CT-based methods, Jin et al. [
48] proposed a deep learning-based system for COVID-19 diagnosis, performing lung segmentation, COVID-19 diagnosis, and COVID-infectious slices location. In contrast, Hu et al. [
51] proposed a weakly supervised multiscale deep learning framework for COVID-19 detection, inspired by the VGG architecture [
57], which assimilates different scales of lesion information using CT data of the chest. Polsinelli et al. [
52] implemented a lightweight CNN, based on the SqueezeNet model [
58] for efficient discrimination of COVID-19 CT images against other community-acquired pneumonia or healthy CT images. Biswas et al. [
53] used a transfer learning strategy on the three pretrained models of VGG-16 [
57], ResNet50 [
59], and Xception [
60], combining them with the ensemble stacking strategy and tested the method on CT images of the chest. Zhao et al. [
55] adopted the ResNet-v2, a modified version of ResNet [
59]. Moreover, they added group normalisation instead of batch normalisation and conducted a weight standardisation for all convolutional layers. Lastly, they also incorporated the pre-training data from CIFAR-10 [
61], ILSVRC-2012 [
62], and ImageNet-21k [
63] as the parameters for initialisation.
On the subject of CXR-based works, Minaee et al. [
50] employed four pretrained models (ResNet18 [
59], ResNet50 [
59], SqueezeNet [
58] and DenseNet-121 [
64]) on CXR data, and analysed their performance for COVID-19 detection. On the other hand, Signoroni et al. [
43] developed BS-Net, a multi-block deep learning-based architecture designed for the assessment of pneumonia severity on CXRs. More recently, Oyelade et al. [
56] proposed CovFrameNet, a novel deep learning-based framework based on a substantial image pre-processing step and a CNN architecture for detecting the presence of COVID-19 on CXRs.
Thanks to the powerful discriminative ability of CNNs, several authors tried to propose CNN-based frameworks for the diagnosis or prognosis of COVID-19, even though CNNs typically require large scale datasets to perform a correct classification. However, most of the existing CT scan datasets for COVID-19 contain at most hundreds of CT images [
65,
66,
67]. Therefore, we exploited COVIDx CT-2A [
68], composed of 194,922 CT images, as described in
Section 2.1.1 to propose a baseline classification approach and we evaluated it on the external dataset COVID-CT, described in
Section 2.1.2 to assess generalisability of the proposal. In general, we aim to avoid the following drawbacks:
- 1.
Using small scale datasets;
- 2.
Using not robust or multiple unsupervised source datasets;
- 3.
Testing the method without external validation.
Regarding the works that employed the datasets used in our study, Zhao et al. [
41] worked on COVID-CT, while Gunraj et al. [
54] on COVIDx CT-2A. The former is based on a transfer learning approach on the DenseNet network, while the latter proposed COVID-Net CT [
54], a deep convolutional neural network tailored for detection of COVID-19 cases from chest CT images.
This work differs from those described above because:
- (i)
We propose an extensive comparison between different off-the-shelf CNN architectures, in order to obtain the most suitable for the task, using a large and public dataset;
- (ii)
We avoid the high risks of errors due to datasets created from unsupervised online public repositories, using two public reference datasets, to try to validate our approach;
- (iii)
We introduce a preliminary solution based on learning by sampling, showing how CNNs need further improvements to generalise the detection of COVID-19 in heterogeneous datasets.
3. Results
We now describe the experimentation conducted in this work. In detail, in
Section 3.1 we first describe the experimental setup adopted for the classification tasks. Then, in
Section 3.2 we report the results of the experiments performed on both datasets.
3.1. Experimental Setup
The images to be classified are lung CTs. They are organised into classes, as described below. Considering this work as a baseline for further investigation, the images are not subject to any preprocessing or augmentation process. In order to make the experiments reproducible, we kept the dataset splits provided by the authors and did not apply any randomisation strategy. We employed two different training strategies:
- (i)
From scratch;
- (ii)
Fine-tuning the previously trained networks.
The tests were carried out on several popular CNNs to find the best architecture for our purpose. The tested networks are AlexNet [
26], the Residual Networks [
59] ResNet18, ResNet50, ResNet101, GoogLeNet [
76], ShuffleNet [
77], MobileNetV2 [
78], InceptionV3 [
79], and VGG16 [
57].
The experiments were performed using the hyperparameters setting described in
Table 1 for all networks to assess potential performance variations. In particular, after empirical evaluation, we adopted Adam, which performed better than the other solvers. In addition, the maximum number of epochs was set to 20 due to a large number of images.
Since COVIDx CT-2A is the largest dataset, we employed it for model training. Its images were divided by the authors according to the following percentages: 70%, 20%, and 10% for training, validation, and testing, respectively. As for the COVID-CT dataset, we used it in two ways: the first time, it was taken as a whole as a test set, while the second time, it was divided in the same way as COVIDx CT-2A to be used for a fine-tuning strategy.
3.2. CT Image Classification via Deep Learning
Several types of experiments were designed in this work in order to assess the feasibility of the deep learning approach and its robustness. In particular, on the COVIDx CT-2A dataset, we performed:
On the other hand, on the COVID-CT dataset, we realised:
- 1.
A two-class classification using the four best-performing networks from the previous experiments on the entire dataset;
- 2.
A two-class classification using the same four networks, fine-tuning them on this dataset.
3.2.1. Three-Class Classification on COVIDx CT-2A
In this experiment, we trained each network used in this work, using the split proposed by the authors, in order to obtain a baseline result.
Table 2 shows the results obtained with each architecture employed, while
Figure 4 shows the relationship between the MAvG metric and the three classes included in the dataset.
3.2.2. Patient-Oriented Classification on COVIDx CT-2A
For this experiment, the models obtained from the experiments described in
Section 3.2.1 were used. We proceeded as follows: the test set (consisting of 25,658 images) was used, according to the subdivision provided by the authors, to have one set of images for each of the 426 different patients in the test set. We ensured that each patient only had images belonging to one class because otherwise, this would invalidate the test. Once the images of each patient had been examined, the model would produce results similar to those of the ternary classification. With this in mind, it was decided to use class accuracy as a critical metric: if it was above 50%, the patient belonged to the class. Otherwise, it would be classified as incorrect. In this way, it was possible to see how each model behaved with each class, and finally, average accuracy was calculated to describe the level of accuracy of the network, as shown in
Table 3.
3.2.3. Two-Class Classification on COVID-CT
For this experiment, we proceeded in two steps: initially, we used the entire COVID-CT dataset as a test dataset for the top four models obtained from the
Section 3.2.1 experiments. Subsequently, it was decided to perform a fine-tuning strategy on the same models. In particular, we chose VGG19, given its results in both previous experiments, MobileNetV2, one of the most superficial networks with good results in classifying patients of the
normal and
pneumonia classes, and, finally, VGG16 and ResNet18, being the two networks with the best results after VGG19. The dataset was then divided into training, validation, and testing, according to the percentages provided by the authors.
Table 4 shows the results on the whole dataset, while
Table 5 shows the results after the fine-tuning strategy.
5. Conclusions
The objective of this work was to propose a classification methodology for the diagnosis of COVID-19 through deep learning techniques applied on CT images. To achieve this goal, an extensive comparative study of the main existing CNN architectures was carried out.
The tests carried out on the two datasets showed very different results. Those obtained with the COVIDx CT-2A dataset are excellent for all the models used; in particular, VGG19 stands out for the high values obtained in the specificity metric and precision and recall. No other network has achieved these results. However, it is important to say that networks such as VGG16 and ResNet18 also achieved more than satisfactory results. As far as the other networks are concerned, GoogLeNet and ResNet50 seem the least suitable, as they always deviated considerably from the average values obtained. In addition, the results obtained with VGG-19 are comparable with the results of the networks currently existing at the state of the art that works on COVIDx CT-2A.
The patient-oriented classification also brought outstanding results, with high accuracy values for the class COVID-19, and, in some cases, 100% accuracy for the class pneumonia. The best network remains, in any case, VGG19, being the one with the highest average accuracy and, therefore, misclassifying a few patients compared to the other networks. Through the analysis of the misclassified patients, it was deduced that it is probably necessary to create an ad hoc network exploiting the existing CNNs to improve the results.
About the COVID-CT dataset, however, the results do not match the previous ones and, on the contrary, there was a drop in performance of almost 50%. Only fine-tuning was able to remedy this, increasing the values obtained by 20%. Nevertheless, this does not compensate for the difference in performance. The problem could be mainly due to the quality of the images of the COVID-CT dataset, which are often compromised or of very poor quality.
This work highlighted some limitations. First of all, cross-dataset experiments showed that existing CNNs, even after a fine-tuning procedure, really suffer from limited dataset scenarios. Second, patient-oriented experiments show that some networks misclassified some COVID-19 patients as normal pneumonia cases, while others did not. This clearly motivates further investigation on the models and, also, possible modifications. Third, the absence of defined standards in the acquisition of these images and, in addition, the problem of building affordable COVID-19 datasets from heterogeneous sources, especially during the early months of the pandemic [
7] can be considered a limitation and also a future direction, as it clearly appears that the distinctive COVID-19 features need to be further studied.
The indications emerging from this work are that:
- (i)
In addition to fine-tuning, some preprocessing steps oriented to the enhancement of CT images could be helpful for the networks to produce more discriminative features; and
- (ii)
Considering the results of the patient-oriented experiments, a hybrid approach, even involving ad hoc handcrafted features, could improve the results.
In future directions, we certainly aim to discover other valuable features from CT images to recognise COVID-19, extending the investigation to include handcrafted features and even combining them with deep features. In addition, we also want to consider assessing the severity of COVID-19.
We will conduct further experiments to identify key features in CT images and facilitate screening by medical doctors. We want to stress again that this work is still at the stage of theoretical research, and the models have not been validated in real clinical routines. Our contribution is to offer a baseline with some public benchmark datasets to be extended with new investigations. Therefore, we would like to test our system in the clinical routine and communicate with doctors to understand how such a system can be integrated into the clinical routine.
Therefore, we would like to:
- 1.
Modify VGG19 to investigate the best accuracy density (accuracy divided by the number of parameters) and the best inference time;
- 2.
Optimise the hyperparameters, for example with Bayesian method;
- 3.
Use class activation map (CAM) to understand which parts of the image are relevant in the misclassification cases obtained by VGG19 but not from the other networks;
- 4.
Test our system in the clinical routine and communicate with doctors to understand how such a system can be integrated into the clinical routine.