Introduction
-
This solution is designed in a way that, which is well suited for both binary class and multi-class imbalance problems.
-
The solution is based on algorithmic modification rather than data resampling at the processing phase.
-
In our solution, the new kernel selection function has been proposed.
Related work
Author | Approach | Objective | Algorithm | Result | Scope |
---|---|---|---|---|---|
Liu et al. [37] | Data level approach | Proposed two under-sampling method Balance-Cascade and Easy-Ensemble | EasyEnsemble and BalanceCascade | Deal with data imbalance | Used for data-level approach and for resolving data imbalance issues |
Wang et al. [38] | Data level approach | An adaptive over-sampling approach has been proposed | Data density approach | Deals with data imbalance | Used for resolving data imbalance issue |
Geo et al. [39] | Data level approach | Binary class over-sampling has been proposed | Using probabilistic methods | Deals with data imbalance | Used for resolving data imbalance issue |
Batuwita and Palade [44] | Algorithmic level | Data imbalanced in the presence of noise | Fuzzy based SVM | Removing data imbalance | Classifier optimization |
Cano et al. [45] | Algorithmic level | Proposed data imbalanced classifier | Gravitation weight-based | Removing data imbalance | Classifier optimization |
Algorithmic level | Proposed boundary-based class boundary alignments | Improved SVM | Removing data imbalance | Classifier optimization | |
Oh et al. [48] | Algorithmic level | Proposed active sample election technique for data imbalance problem | Active sample election | Resolve data imbalance problem by improving performance | Increase the accuracy of the classifier |
Liu et al. [49] | Algorithmic level | Proposed a sample selection technique | SVM | Increased the performance of the classifier | Increase the accuracy of the classifier |
Fu and lee [51] | Algorithmic level | Proposed a certainty-based active learning algorithm | Machine learning | Resolve the data imbalanced and increase the performance | Active learning approach |
Materials and methods
Data
Variable | Variable Abbreviation | Nature of data | Measuring unit | Period of data collection | Variable type | Data source | |||
---|---|---|---|---|---|---|---|---|---|
Particulate Matter10 | PM10 | Real-Time | ug/m3 | 01 Jan 2019 to 01 Oct 2020 | Pollutant | CPCB | |||
Sulfur Dioxide | SO2 | Real-Time | ug/m3 | 01 Jan 2019 to 01 Oct 2020 | Pollutant | CPCB | |||
Particulate Matter2.5 | PM2.5 | Real-Time | ug/m3 | 01 Jan 2019 to 01 Oct 2020 | Pollutant | CPCB | |||
Ozone | O3 | Real-Time | ug/m3 | 01 Jan 2019 to 01 Oct 2020 | Pollutant | CPCB | |||
Nitrogen Oxide | NOx | Real-Time | Ppb | 01 Jan 2019 to 01 Oct 2020 | Pollutant | CPCB | |||
Nitrogen Dioxide | NO2 | Real-Time | ug/m3 | 01 Jan 2019 to 01 Oct 2020 | Pollutant | CPCB | |||
Nitrogen Monoxide | NO | Real-Time | ug/m3 | 01 Jan 2019 to 01 Oct 2020 | Pollutant | CPCB | |||
Ammonia | NH3 | Real-Time | ug/m3 | 01 Jan 2019 to 01 Oct 2020 | Pollutant | CPCB | |||
Carbon Monoxide | CO | Real-Time | ug/m3 | 01 Jan 2019 to 01 Oct 2020 | Pollutant | CPCB | |||
Air Quality Index | AQI | Real-Time | ug/m3 | 01 Jan 2019 to 01 Oct 2020 | Pollution Level | CPCB |
Variable | Mean | Unit | Std. Dev | Prescribe range | Actual range | ||
---|---|---|---|---|---|---|---|
Min | Max | Min | Max | ||||
PM10 | 208.869 | ug/m3 | 154.392 | 0.00 | 100 | 0.14 | 1000 |
SO2 | 106.398 | ug/m3 | 99.803 | 0.00 | 80 | 0.7 | 989.58 |
PM2.5 | 30.339 | ug/m3 | 55.716 | 0.00 | 60 | 0.01 | 499.1 |
O3 | 51.994 | ug/m3 | 60.044 | 0.00 | 18 | 0.01 | 500 |
NOx | 43.873 | ppb | 33.533 | 0.00 | 200 | 0.01 | 485.85 |
NO2 | 35.515 | ug/m3 | 20.61 | 0.00 | 200 | 0.01 | 494.11 |
NO | 14.821 | ug/m3 | 11.381 | 0.00 | 200 | 0.01 | 194.9 |
NH3 | 1.362 | ug/m3 | 1.082 | 0.00 | 200 | 0.01 | 40.25 |
CO | 41.407 | ug/m3 | 59.011 | 0.00 | 4 | 0.01 | 997 |
AQI | 217.321 | ug/m3 | 152.63 | 0.00 | 100 | 8.85 | 1000 |
Dataset | CPCB (Central Pollution Control Board India) |
---|---|
Samples length | 270,596 |
Number of Attributes | 10 |
Number of Classes | 6 |
Samples in each class | |
Class 1 | 13,452 |
Class 2 | 47,910 |
Class 3 | 93,167 |
Class 4 | 55,045 |
Class 5 | 30,421 |
Class 6 | 30,601 |
Ratio of Imbalances | 6.92 |
Proposed methodology
Basic support vector machine algorithm (SVM)
Kernel function selection
Testing of Chi-square
Computing the weighting factor
Computing the parameter \({\boldsymbol{z}}_{\boldsymbol{j}}\)
Description of the proposed algorithm
Statistical analysis
Accuracy
Precision
Recall
F1-score
True negative rate (TNR)
Negative predictive value (NPV)
False negative rate (FNR)
False positive rate (FPR)
False discovery rate (FDR)
False omission rate (FOR)
Results
Model comparison
Performance evaluation of classification algorithms
Classification results for real-time sensor generated air quality index (AQI) dataset | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
Classifier name | Real-time sensor generated air quality index (AQI) dataset representation | |||||||||
Precision | Recall | F1-Score | TNR | NPV | FNR | FPR | FDR | FOR | Accuracy | |
Ada Boost classifier | 0.48 | 0.60 | 0.46 | 0.59 | 0.59 | 0.41 | 0.08 | 0.41 | 0.41 | 59.72 |
MLP classifier | 0.96 | 0.97 | 0.96 | 0.95 | 0.95 | 0.03 | 0.01 | 0.03 | 0.05 | 95.71 |
Gaussian NB | 0.81 | 0.81 | 0.81 | 0.80 | 0.80 | 0.19 | 0.03 | 0.19 | 0.2 | 80.87 |
SVM classifier | 0.97 | 0.97 | 0.97 | 0.96 | 0.96 | 0.03 | 0.01 | 0.03 | 0.04 | 96.92 |
Proposed algorithm | 1.00 | 1.00 | 1.00 | 0.99 | 0.99 | 0.002 | 0.001 | 0.002 | 0.01 | 99.66 |
Model used | Accuracy (%) |
---|---|
Existing literature methods | |
Cost-sensitive Boosting[41] | 90.52 |
Cost-sensitive SVM [43] | 95.01 |
Fuzzy based SVM [44] | 97.19 |
Improved SVM [46] | 97.51 |
Impoved SVM [49] | 96.90 |
Proposed Model | |
Scalable kernel based SVM | 99.66 |
Data collected from | Classifiers | ||||
---|---|---|---|---|---|
ADB | MLP | GNB | SVM | Proposed algorithm | |
A1 | 67.00 | 86.40 | 83.13 | 94.81 | 99.67 |
A2 | 53.14 | 74.77 | 82.88 | 94.47 | 99.51 |
A3 | 68.92 | 75.45 | 81.06 | 95.12 | 99.65 |
A4 | 66.10 | 81.39 | 84.80 | 95.07 | 99.67 |
A5 | 84.78 | 90.24 | 82.71 | 94.96 | 99.52 |
A6 | 67.44 | 83.42 | 86.00 | 94.29 | 99.56 |
A7 | 68.54 | 90.26 | 80.92 | 93.27 | 99.59 |
A8 | 67.67 | 90.98 | 81.44 | 95.83 | 99.95 |
A9 | 68.12 | 84.92 | 85.05 | 95.68 | 99.73 |
A10 | 73.94 | 91.50 | 85.13 | 97.53 | 99.86 |
A11 | 60.74 | 86.13 | 83.20 | 95.41 | 99.67 |
A12 | 85.69 | 90.33 | 82.55 | 96.49 | 99.79 |
A13 | 67.83 | 87.26 | 78.62 | 96.50 | 99.81 |
A14 | 97.58 | 65.50 | 82.86 | 95.20 | 99.77 |
A15 | 63.94 | 85.63 | 83.23 | 93.68 | 99.25 |
A16 | 66.93 | 85.48 | 82.03 | 96.55 | 99.41 |
A17 | 63.91 | 77.76 | 81.46 | 95.94 | 99.64 |
A18 | 73.02 | 91.95 | 81.62 | 96.54 | 99.95 |
A19 | 68.22 | 89.95 | 84.67 | 97.40 | 99.45 |
A20 | 70.82 | 87.94 | 80.63 | 94.51 | 100 |
A21 | 69.35 | 85.57 | 84.79 | 95.99 | 99.78 |
A22 | 75.54 | 74.08 | 82.70 | 95.60 | 99.95 |
A23 | 81.24 | 91.02 | 82.18 | 94.81 | 99.81 |
A24 | 73.75 | 90.84 | 84.18 | 94.94 | 99.72 |
A25 | 69.88 | 77.36 | 83.29 | 96.56 | 99.85 |
A26 | 92.20 | 83.16 | 79.97 | 94.83 | 99.53 |
A27 | 66.77 | 79.01 | 82.08 | 96.50 | 99.82 |
A28 | 65.78 | 84.26 | 83.95 | 95.08 | 99.86 |
A29 | 72.48 | 92.01 | 83.52 | 96.87 | 99.87 |
A30 | 65.24 | 87.00 | 84.39 | 94.90 | 99.5 |
A31 | 64.33 | 91.36 | 80.58 | 94.63 | 99.71 |
A32 | 69.91 | 85.12 | 80.55 | 92.09 | 99.67 |
A33 | 88.91 | 92.78 | 83.11 | 96.48 | 99.78 |
A34 | 72.87 | 91.05 | 80.34 | 95.72 | 99.76 |
A35 | 63.84 | 91.09 | 83.50 | 95.12 | 99.91 |
A36 | 82.97 | 85.59 | 85.44 | 96.92 | 99.79 |
A37 | 79.14 | 71.17 | 84.12 | 96.63 | 99.76 |
Total accuracy | 71.85 | 85.13 | 82.78 | 95.48 | 99.72 |
Discussion
Performance evaluation of classification algorithms
Effect on healthcare
Pollutants | AQI | |
---|---|---|
Effect on health | Short term | 1. Serious cardiovascular illness 2. Serious respiratory illness 3. Cause more strain on lungs and heart 4. Damaged respiratory system cells |
Long term | 1. Faster aging of the lungs 2. Reduction of lung capacity 3. Reduction in lungs functionality 4. Bronchitis 5. Asthma 6. Possibly cancer 7. Emphysema 8. Shorter life span | |
Severe health problems for | 1. The person suffering from heart disease 2. The person suffering from congestive heart failure 3. The person suffering from coronary artery syndrome 4. The person suffering from asthma 5. The person suffering from Emphysema 6. The person suffering from COPD (Chronic Obstructive Pulmonary Disease) 7. Women with Pregnancy 8. Outdoor labors 9. Old age people and children below 14 years of age 10. Sportsperson who exercise strongly outdoors |