2.1 Data
The data used for training and system testing were provided by the Respiratory Physiopathology Department at the Piero Palagi Hospital in Florence. The database comprises 528 anonymized records on 414 patients. There are fewer patients than the number of records because some records were for follow-ups on the same patient, corresponding to medical examinations performed on different dates. Each record contains the measurements of the physiological parameters acquired through the Respiratory Function Tests and the Diffusing Capacity of the Lung for Carbon Monoxide (DLCO) tests. The physiological parameters measured generating data for each patient were: Forced Expired Volume in 1 s (FEV1), Forced Vital Capacity (FVC), Slow Vital Capacity (SVC), FEV1/FVC ratio, FEV1/SVC ratio, Forced Expired Flow at 25–75% (FEF 25–75), Peak Expiratory Flow (PEF), Vital Capacity (VC), Total Lung Capacity (TLC), Residual Volume (RV), Functional Residual Capacity (FRC), Expiratory Reserve Volume, DLCO, Alveolar Volume (VA) and DLCO/VA. All these parameters were measured before and after bronchodilation. Other data taken included each patient’s age, height, bodyweight and sex.
Based on the results of the respiratory function and DLCO tests, five expert pulmonary disease specialists assessed the extent of each patient’s respiratory deficits and whether or not their DLCO was reduced. The respiratory deficits were then classified as mild, moderate or severe. The patients’ DLCO was classified as normal or reduced. During the medical examinations, the patients were also asked if they had experienced any exacerbations, whether they had been admitted to hospital or whether they should have had recourse to emergency care due to respiratory problems in the months prior to the examination. Based on this information, the risk of exacerbation was assessed and classified as low or high.
2.2 Predictive models
The data discussed above were processed and analysed using IBM SPSS Modeler 18.1 software [
21]. This software is a powerful data-mining workbench that enables predictive models to be constructed, without programming. The objective was to create a predictive model based on the input parameters listed above that would enable the classification of the three output parameters of interest: extent of respiratory deficit, D
LCO and risk of exacerbation.
The first step of this phase was to try to replicate the results and performance of similar systems, which used the same input parameters, already in the literature. The two algorithms that achieved the best performance in terms of accuracy, sensitivity and specificity were the Neural Network and the Support Vector Machine (SVM) [
22,
23]. Both of these are supervised learning algorithms. They involve a process that, starting from a training dataset, leads to the inference of a mathematical function that links the system’s inputs and outputs. The training dataset comprises a series of examples, each of which consists of a known input-output pair. Subsequently, when a new known input is provided, differing from the inputs in the training dataset, the system is able to predict its output, which was not known beforehand. Therefore, the system can generalize the relationships between inputs and outputs learned through the training dataset and then provide outputs from new inputs not considered previously.
The Neural Network algorithm builds a network of interconnected nodes organized into layers, which behaves like a network of neurons in the brain. Like a biological neural network, an artificial neural network learns to perform specific tasks through examples it is given. Each node represents an artificial neuron. These are connected by synapses that can transmit a signal from one neuron to another. Each neuron has a state, generally represented by a real number between 0 and 1. Neurons and synapses can also be given a weighting, which varies as training progresses and depending on the examples provided. This weighting will either increase or decrease the strength of the signal being transmitted. The neurons can also have an activation threshold. Based on this parameter, the signal will be transmitted only if its intensity exceeds the threshold level.
SVM is a supervised automatic learning technique that treats examples as points in space. Inputs with different outputs are represented by points in space belonging to different regions divided by a clear gap, which is as large as possible. New inputs, which are not part of the training set, are mapped to one of the different regions of the space, based on how the system was trained.
Therefore, two predictive models were created. The first was based on a neural network and the second on a Support Vector Machine. The two predictive models were used to classify the output as “extent of respiratory deficit”. Performance was calculated for each model in terms of accuracy, sensitivity and specificity. The performance obtained was compared with what was generated by systems already in the literature to verify if the data available were consistent with the proposed objective.
The next step was to develop a new system based on a predictive model that outperformed the two models developed previously. The “Automatic Classifier” function of the IBM SPSS Modeler was used to select the most suitable classification algorithm for the data available. This enabled a rapid comparison of a wide variety of classification algorithms, including CART, Random Forest, QUEST, CHAID, Bayesian Network, Logistic Regression, C5.0, KNN and others. Based on this comparison, the most suitable algorithm to build the predictive model was C5.0, for the data available.
The
C5.0 Algorithm is a supervised automatic learning technique that builds a decision tree for classifying an output parameter starting from a series of examples. Given a set of examples S, and an output parameter (representing the category to which the various examples belong), the algorithm starts creating the decision tree by performing the following operations:
If the output of all the examples belonging to S has the same value and then if all the examples belong to the same class, or if S is small (there is an internal parameter that defines how small S can be), the tree will have a single leaf labeled with the most frequent value of the output in S.
If the first point is not verified, a test will be defined based on one single input parameter, which will give results in one or more output values. The test represents the root of the tree. There will be a different branch of the tree for each different output value. The S-set will then be divided into as many different subsets as there are different output values.
The preceding points are then re-applied recursively to each subset of S.
To create all the models considered, the operations described below were performed.
1)PARTITION OF THE ENTIRE AVAILABLE DATASET INTO A TRAINING SET AND A TEST SET.
This operation is necessary to train and test the predictive model. To identify the optimal percentage of records to be assigned to each set, 19 different combinations were assessed. For each combination of percentages, the model’s accuracy in classifying the output in question on the test set was assessed. Since the partition for a given combination of percentages is random, five different partitions were executed for each combination and the test sets based on the optimal percentages were identified. Average and variance of accuracy of the five partitions were calculated. As an optimal combination of percentages of records to be assigned to training and test sets, the one with the highest average accuracy and the lowest variance of accuracy was selected. The lowest variance of accuracy ensures that the calculated accuracy classification will not depend on, or will only depend minimally on, how the partition is executed. The entire dataset was then partitioned into training and test sets based on the optimal percentages identified.
2)TRAINING THE PREDICTIVE MODEL AND CALCULATION OF PERFORMANCE
Once the dataset was optimally divided into training and test sets, the internal parameters of each algorithm were optimized to obtain the best possible performance in terms of accuracy, sensitivity and specificity. Then, the model was trained. Only the training set was used to train the model. Once the training was completed, the model’s sensitivity, specificity and accuracy in classifying the output being tested were calculated. Performance was calculated using only the test set. Since test and training set partitions are random, this last step was repeated ten times. The sensitivity, specificity and accuracy of the model were calculated for each partition. The model’s final performance was calculated as an average of the performance of the ten partitions. This limited the dependence of the performance on the execution of random partitioning in test and training sets.