1 Introduction
-
We conduct a comparative evaluation of four state-of-the-art Android malware detectors (+ variants) using the same experimental setup to identify the best performing approach. The studied detectors are: DREBIN, MaMaDroid Family (two variants of the approach), RevealDroid, MalScan (six variants of the approach).
-
We examine the similarities/difference in the malware detected by state-of-the-art approaches.
-
We investigate the impact of merging feature sets from state-of-the-art Android malware approaches on the detection performance.
-
We investigate the impact of combining predictions from state-of-the-art malware detectors using 16 combination methods.
-
The performance of state-of-the-art Android malware detectors is highly dependent on the experimental dataset. None of the studied approaches has reported the best detection performance on all the evaluation settings.
-
Some families of malware are detected very accurately by some state-of-the-art approaches, but almost completely escape detection of some other approaches.
-
Combining features and predictions from state-of-the-art malware detectors (i.e., using
Bagging
andEnsemble Selection
) is promising to leverage the capabilities of the best detectors and maintain a stable detection rate on all the evaluation settings.
2 Study design
2.1 Research questions
-
RQ1: Is there a state-of-the-art malware detector that outperforms all others across all datasets?
-
RQ2: To what extent do state-of-the-art approaches detect similar/different malware?
-
RQ3: Does merging the feature sets from state-of-the-art approaches lead to a high-performing malware detector in all the settings?
-
RQ4: Does combining predictions from state-of-the-art approaches lead to a high-performing malware detector in all the settings?
-
RQ5: Does combining feature sets or predictions from state-of-the-art approaches lead to classifiers that significantly outperform the original detectors?
2.2 Dataset
Subsets | Malicious apps | Benign apps | Total | |
---|---|---|---|---|
Literature dataset | DREBIN | 5363 | 111592 | 116955 |
MaMaDroid | 30895 | 7756 | 38651 | |
RevealDroid | 18924 | 22480 | 41404 | |
MalScan | 12943 | 14038 | 26981 | |
Total∗∗ | 43819 | 153616 | 197435 | |
AndroZoo dataset | 2019 | 59256 | 122966 | 182222 |
2020 | 18746 | 64831 | 83577 | |
Total | 78002 | 187797 | 265799 |
Literature dataset
AndroZoo dataset
2.3 Experimental setup
-
Temporally-consistent
: The classifiers are trained on old apps, and tested on new apps (i.e., the dataset is split based on the apps’ creation dates). -
Temporally-inconsistent
: the classification experiment does not take into account the creation time of the apps (i.e., the dataset is shuffled before the split into training, validation, and test sets)
Temporally-inconsistent
settings, we repeat each experiment ten times after randomly shuffling and splitting the datasets. As for the Temporally-consistent
settings, we also repeat the experiments ten times by randomly selecting 90% of the apps from the training, validation, and test splits (i.e., the training and the test contain the oldest and the newest apps, respectively). For both settings we report an average detection performance.2.4 Study subjects: literature detectors
-
Indeed, a tremendous number of malware detection papers are published in the literature, but our study focuses on the best approaches with the most significant contributions in the field. Thus, our study subjects are selected among papers published in 16 top venues in Software Engineering, Security, and Machine Learning: EMSE, TIFS, TOSEM, TSE, FSE, ASE, ICSE, NDSS, S&P, Usenix Security, CCS, AsiaCCS, SIGKDD, NIPS, ICML, and IJCAI.
-
In order to accurately and fairly assess the detection performance of the studied approaches, they need to be reproducible. Specifically, our evaluation results can be attributed to the original approaches only in the case when the reproducibility of these detectors is verified and confirmed. In the reproduction study from which we select our approaches (Daoudi et al. 2021b), ten years of Android malware detection papers from major venues have been considered. However, only four approaches have been successfully reproduced. Our study subjects are the only state-of-the-art malware detectors whose reproducibility has been validated in the literature.
Features set | ML algorithm | |
---|---|---|
DREBIN | Hardware components, requested permissions, app components, filtered intents, restricted API calls, used permissions, suspicious API calls, and network addresses | LinearSVC |
MaMaDroid | Markov Chain representation of the abstracted API calls | Random Forest |
RevealDroid | Android API usage, Reflective, and Native Call Features | LinearSVC |
MalScan | Centrality analysis on the social network representation of the call graph | KNN |
2.4.1 DREBIN (Arp et al. 2014)
2.4.2 MaMaDroid (Mariconti et al. 2017)
2.4.3 RevealDroid (Garcia et al. 2018)
2.4.4 MalScan (Wu et al. 2019)
3 Study results
Literature dataset
, AndroZoo dataset
, and their subsets.3.1 RQ1: Is there a state-of-the-art malware detector that outperforms all others across all datasets?
Literature dataset
in a temporally-inconsistent
experiment. Table 3 describes our experimental settings.
Temporally-inconsistent | Temporally-consistent | |
---|---|---|
Literature dataset | LitTempInconsist | LitTempConsist |
AndroZoo dataset | AndTempInconsist | AndTempConsist |
Literature datasets
and three AndroZoo datasets
(i.e., whole datasets and their subsets), the total number of our experimental settings reaches 16. In the remainder of this paper, we use “dataset” and “setting” interchangeably.Temporally inconsistent | Temporally consistent | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Literature dataset | AndroZoo dataset | Literature dataset | AndroZoo dataset | ||||||||||||||
Wh | Dr | Rv | Mm | Ml | Wh | 19 | 20 | Wh | Dr | Rv | Mm | Ml | Wh | 19 | 20 | ||
RQ1 | DREBIN | 0.92 | 0.94 | 0.97 | 0.98 | 0.96 | 0.96 | 0.96 | 0.97 | 0.44 | 0.87 | 0.94 | 0.92 | 0.86 | 0.85 | 0.94 | 0.82 |
Reveal | 0.68 | 0.44 | 0.9 | 0.94 | 0.89 | 0.95 | 0.95 | 0.95 | 0.38 | 0.41 | 0.81 | 0.83 | 0.62 | 0.89 | 0.94 | 0.85 | |
MaMaF | 0.48 | 0.31 | 0.9 | 0.94 | 0.82 | 0.95 | 0.95 | 0.97 | 0.19 | 0.32 | 0.88 | 0.89 | 0.64 | 0.92 | 0.98 | 0.86 | |
MaMaP | 0.71 | 0.48 | 0.94 | 0.95 | 0.95 | 0.96 | 0.97 | 0.98 | 0.22 | 0.14 | 0.94 | 0.85 | 0.74 | 0.92 | 0.98 | 0.87 | |
MalD | 0.88 | 0.87 | 0.94 | 0.95 | 0.95 | 0.96 | 0.96 | 0.97 | 0.33 | 0.7 | 0.86 | 0.79 | 0.87 | 0.93 | 0.96 | 0.87 | |
MalH | 0.89 | 0.89 | 0.95 | 0.96 | 0.96 | 0.97 | 0.96 | 0.97 | 0.33 | 0.7 | 0.89 | 0.83 | 0.88 | 0.93 | 0.96 | 0.87 | |
MalK | 0.89 | 0.9 | 0.95 | 0.96 | 0.95 | 0.96 | 0.96 | 0.97 | 0.36 | 0.69 | 0.89 | 0.81 | 0.87 | 0.9 | 0.95 | 0.86 | |
MalCl | 0.89 | 0.88 | 0.95 | 0.96 | 0.96 | 0.97 | 0.96 | 0.97 | 0.4 | 0.77 | 0.89 | 0.84 | 0.88 | 0.93 | 0.96 | 0.87 | |
MalA | 0.89 | 0.89 | 0.95 | 0.96 | 0.96 | 0.96 | 0.95 | 0.97 | 0.34 | 0.71 | 0.89 | 0.83 | 0.87 | 0.91 | 0.96 | 0.87 | |
MalCo | 0.89 | 0.89 | 0.95 | 0.96 | 0.96 | 0.96 | 0.96 | 0.97 | 0.34 | 0.7 | 0.89 | 0.83 | 0.87 | 0.91 | 0.96 | 0.87 | |
RQ3 | LinearSVC | 0.78 | 0.83 | 0.94 | 0.93 | 0.89 | 0.87 | 0.85 | 0.87 | 0.48 | 0.73 | 0.87 | 0.81 | 0.71 | 0.81 | 0.86 | 0.76 |
RF | 0.91 | 0.92 | 0.97 | 0.97 | 0.97 | 0.98 | 0.98 | 0.98 | 0.38 | 0.85 | 0.79 | 0.77 | 0.86 | 0.93 | 0.98 | 0.87 | |
KNN | 0.86 | 0.83 | 0.92 | 0.95 | 0.91 | 0.94 | 0.94 | 0.93 | 0.3 | 0.76 | 0.72 | 0.74 | 0.82 | 0.85 | 0.88 | 0.74 | |
AdaBoost | 0.86 | 0.86 | 0.97 | 0.96 | 0.95 | 0.96 | 0.95 | 0.97 | 0.45 | 0.91 | 0.95 | 0.9 | 0.81 | 0.93 | 0.97 | 0.86 | |
Bagging | 0.94 | 0.96 | 0.98 | 0.97 | 0.97 | 0.98 | 0.98 | 0.99 | 0.49 | 0.85 | 0.94 | 0.86 | 0.85 | 0.94 | 0.98 | 0.87 | |
GradBoosting | 0.9 | 0.92 | 0.98 | 0.97 | 0.97 | 0.97 | 0.97 | 0.98 | 0.48 | 0.82 | 0.96 | 0.9 | 0.88 | 0.93 | 0.98 | 0.87 | |
RQ4 | MajorVote | 0.92 | 0.92 | 0.96 | 0.97 | 0.97 | 0.97 | 0.97 | 0.98 | 0.33 | 0.85 | 0.91 | 0.85 | 0.87 | 0.93 | 0.97 | 0.87 |
AvgProba | 0.91 | 0.92 | 0.96 | 0.96 | 0.97 | 0.97 | 0.97 | 0.97 | 0.32 | 0.83 | 0.91 | 0.85 | 0.89 | 0.94 | 0.97 | 0.87 | |
AccWProba | 0.91 | 0.92 | 0.96 | 0.96 | 0.97 | 0.97 | 0.97 | 0.97 | 0.32 | 0.84 | 0.91 | 0.85 | 0.89 | 0.94 | 0.97 | 0.87 | |
F1WProba | 0.91 | 0.91 | 0.96 | 0.96 | 0.97 | 0.97 | 0.97 | 0.97 | 0.33 | 0.81 | 0.91 | 0.85 | 0.89 | 0.94 | 0.97 | 0.87 | |
MinProba | 0.3 | 0.14 | 0.92 | 0.94 | 0.82 | 0.93 | 0.93 | 0.93 | 0.12 | 0.0 | 0.63 | 0.62 | 0.38 | 0.83 | 0.96 | 0.79 | |
MaxProba | 0.83 | 0.81 | 0.86 | 0.93 | 0.86 | 0.94 | 0.94 | 0.95 | 0.4 | 0.54 | 0.95 | 0.93 | 0.85 | 0.89 | 0.89 | 0.86 | |
ProdProba | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | |
StaPredSVM | 0.93 | 0.93 | 0.97 | 0.97 | 0.96 | 0.96 | 0.96 | 0.97 | 0.33 | 0.82 | 0.95 | 0.87 | 0.85 | 0.85 | 0.94 | 0.83 | |
StaProbSVM | 0.94 | 0.95 | 0.97 | 0.97 | 0.98 | 0.97 | 0.97 | 0.98 | 0.39 | 0.84 | 0.92 | 0.84 | 0.9 | 0.89 | 0.95 | 0.85 | |
StaPredRF | 0.92 | 0.93 | 0.97 | 0.97 | 0.97 | 0.97 | 0.97 | 0.97 | 0.34 | 0.79 | 0.94 | 0.88 | 0.85 | 0.85 | 0.94 | 0.83 | |
StaProbRF | 0.94 | 0.95 | 0.97 | 0.97 | 0.97 | 0.97 | 0.97 | 0.98 | 0.38 | 0.8 | 0.94 | 0.89 | 0.89 | 0.9 | 0.95 | 0.85 | |
StaPredKNN | 0.92 | 0.92 | 0.97 | 0.97 | 0.97 | 0.97 | 0.97 | 0.97 | 0.39 | 0.79 | 0.92 | 0.87 | 0.86 | 0.86 | 0.94 | 0.82 | |
StaProbKNN | 0.92 | 0.94 | 0.96 | 0.97 | 0.97 | 0.97 | 0.97 | 0.98 | 0.42 | 0.8 | 0.92 | 0.83 | 0.9 | 0.9 | 0.94 | 0.87 | |
StaPredMLP | 0.93 | 0.93 | 0.97 | 0.97 | 0.97 | 0.97 | 0.97 | 0.97 | 0.38 | 0.8 | 0.94 | 0.88 | 0.85 | 0.86 | 0.94 | 0.83 | |
StaProbMLP | 0.94 | 0.95 | 0.97 | 0.97 | 0.97 | 0.97 | 0.97 | 0.98 | 0.39 | 0.81 | 0.95 | 0.85 | 0.89 | 0.87 | 0.95 | 0.85 | |
EnsemSelect | 0.94 | 0.95 | 0.98 | 0.98 | 0.98 | 0.98 | 0.98 | 0.98 | 0.43 | 0.89 | 0.96 | 0.91 | 0.89 | 0.94 | 0.98 | 0.88 | |
Wh: Whole dataset, Dr: DREBIN dataset, Rv: RevealDroid dataset, Mm: MaMaDroid dataset, | |||||||||||||||||
Ml: MalScan dataset, 19: 2019 dataset, 20: 2020 dataset |
temporally-consistent
setting. For the other datasets (i.e., AndroZoo datasets
and Literature subsets
), the detection performance on the temporally-consistent
experiment is also generally lower than the performance reported in the temporally-inconsistent
experiment. The detection performance on the whole LitTempConsist is much lower due to the composition of this dataset.AndroZoo dataset
contains apps that span over two years (i.e., apps from 2019 and 2020). As for the whole Literature dataset
, it contains Android apps that are spanning over eight years (i.e., apps from 2010 to 2018), which makes this dataset considerably difficult for all the classifiers.Literature datasets
, DREBIN yielded the highest F1 score in nine out of ten experiments. DREBIN’s feature set seems to be more suitable to detect the apps created before and until 2018, which is demonstrated by the temporally-inconsistent
and the temporally-consistent
experiments, respectively.Literature dataset
but also on its subsets created in the sub-years of 2010-2018. As for the AndroZoo datasets
, no approach has reported the highest detection performance in all the experiments. Consequently, no specific feature set from the evaluated state-of-the-art approaches consistently helps to detect the highest number of malware created between 2019 and 2020.3.2 RQ2: To what extent do state-of-the-art approaches detect similar/different malware?
Literature dataset
and the whole AndroZoo dataset
. Overall, 642 and 204 unique malware families are present in Literature dataset
and AndroZoo dataset
respectively.Literature dataset
and the whole AndroZoo dataset
in both Temporally consistent
and Temporally inconsistent
settings. We select four top families from each setting and we present them in Table 5. We also report the results for the top 20 families in each setting in Tables 11 and 12 in Appendix.
Families | # | DREBIN | Reveal | MaMaF | MaMaP | MalD | MalH | MalK | MalCl | MalA | MalCo | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
LitTempInconsist | dowgin | 352 | 98.6 | 68.8 | 63.1 | 86.9 | 93.2 | 94.0 | 94.3 | 94.0 | 94.0 | 94.0 |
airpush | 214 | 86.4 | 57.0 | 6.1 | 72.4 | 87.4 | 91.6 | 90.2 | 91.1 | 91.6 | 90.7 | |
adwo | 195 | 82.6 | 72.8 | 11.8 | 62.6 | 85.6 | 86.2 | 86.7 | 85.6 | 86.2 | 86.2 | |
youmi | 111 | 82.9 | 66.7 | 38.7 | 51.4 | 82.0 | 82.9 | 85.6 | 82.9 | 82.9 | 82.9 | |
LitTempConsist | jiagu | 408 | 77.2 | 64.0 | 0.0 | 0.2 | 0.7 | 10.3 | 0.7 | 0.7 | 19.9 | 19.9 |
dnotua | 303 | 5.0 | 5.0 | 0.3 | 11.2 | 94.4 | 2.6 | 94.1 | 94.1 | 2.6 | 2.6 | |
smsreg | 136 | 94.1 | 77.9 | 36.0 | 52.2 | 57.4 | 66.9 | 63.2 | 63.2 | 69.1 | 69.1 | |
secapk | 122 | 40.2 | 92.6 | 4.9 | 4.9 | 41.8 | 43.4 | 43.4 | 43.4 | 43.4 | 43.4 | |
AndTempInconsist | secneo | 69 | 84.1 | 98.6 | 0 | 98.6 | 79.7 | 58.0 | 79.7 | 79.7 | 78.3 | 78.3 |
ewind | 8 | 100.0 | 62.5 | 75.0 | 100.0 | 75.0 | 75.0 | 87.5 | 75.0 | 75.0 | 75.0 | |
datacollector | 7 | 100.0 | 85.7 | 0.0 | 85.7 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | |
kuguo | 6 | 83.3 | 83.3 | 0.0 | 66.7 | 66.7 | 66.7 | 66.7 | 66.7 | 66.7 | 66.7 | |
AndTempConsist | hiddad | 32 | 0.0 | 3.1 | 0 | 3.1 | 15.6 | 21.9 | 15.6 | 21.9 | 21.9 | 21.9 |
joker | 11 | 27.3 | 0 | 0 | 0 | 63.6 | 63.6 | 63.6 | 72.7 | 63.6 | 63.6 | |
emagsoftware | 9 | 66.7 | 33.3 | 0 | 0 | 11.1 | 11.1 | 11.1 | 11.1 | 11.1 | 11.1 | |
autoins | 7 | 100.0 | 100.0 | 0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 |
youmi
family in the LitTempInconsist setting. Similarly, RevealDroid and MaMaDroid Package detect the same proportion of apps from secneo
family in the AndTempInconsist setting. Besides, compared to the other techniques, some approaches seem more efficient at detecting specific families. For example, the secapk
family is effectively detected by RevealDroid in the LitTempConsist setting. In AndTempConsist, DREBIN is the approach that detects the highest proportion of malware from emagsoftware
family.3.3 RQ3: Does merging the feature sets from state-of-the-art approaches lead to a high-performing malware detector in all the settings?
-
AdaBoost
(Freund and Schapire 1997), which fits a series of base classifiers (e.g., Decision Tree) on the dataset such that each classifier focuses more on the incorrect predictions made by the previous classifier. This method assigns higher weights to the incorrectly predicted samples in order to enhance their prediction by the subsequent classifiers. -
Bagging
(Breiman 1996), which trains a series of base classifiers on random subsets of the dataset and aggregates their predictions. -
GradientBoosting
(Friedman 2001), which fits a series of base classifiers on the dataset in order to improve the prediction performance. Each classifier is trained to minimise the prediction errors of the previous classifier using the Gradient descent algorithm.
Bagging
reports the highest F1 score in 11 out of 16 experiments. For GradientBoosting
and AdaBoost
, they achieve the best detection score in four and one experiment respectively. In the five experiments where Bagging
has not reported the highest detection scores, the difference in the F1 score between this detector and the best approaches reaches a maximum value of six percentage points.Bagging
has a different detection performance than that of the other classifiers including those that have outperformed it on five datasets. Overall, our results show that none of the classifiers trained on the merged feature set has reported the highest detection performance on all the datasets.
3.4 RQ4: Does combining predictions from state-of-the-art approaches lead to a high-performing malware detector in all the settings?
-
Majority Voting
, where an app is considered as malware if it is detected by the majority of the classifiers (i.e., in our case at least 6 out of the 10 classifiers). Otherwise it is predicted as benign. -
Average Probability
, which represents the average of the probability6 scores given by the ten classifiers in the prediction of maliciousness. An app is predicted as malware if this Average probability is over 0.5. -
Accuracy Weighted Probability
, where the probabilities of each classifier are weighted according to their Accuracy metric. An app is predicted as malware if the weighted Probability for malware class is higher than the weighted Probability for benign class. -
F1 Weighted Probability
, where the probabilities of each classifier are weighted according to their F1 metric. An app is predicted as malware if the weighted Probability for malware class is higher than the weighted Probability for benign class. -
Min Probability
, which represents the minimum score among the probability scores given by the ten classifiers. An app is predicted as malware if this Min probability is over 0.5. -
Max Probability
, which represents the maximum score among the probability scores given by the ten classifiers. An app is predicted as malware if this Max probability is over 0.5. -
Product Probability
, which represents the product of the probability scores given by the ten classifiers in the prediction of maliciousness. An app is predicted as malware if this Product Probability is over 0.5. -
Stacking Prediction
(Wolpert 1992), where the predictions of each classifier are used to train a binary meta-classifier. We evaluated the Stacking method using four meta-classifiers: SVM, RF, KNN, and Multi-Layer Perception (MLP), with three hidden layers of 32, 64, and 128 neurons, respectively. The final predictions of this method are given by the meta-classifier. -
Stacking Probability
, where the prediction probabilities of each classifier are used to train a binary meta-classifier. Similarly, we evaluated Stacking Probability using four meta-classifiers: SVM, RF, KNN, and Multi-Layer Perception (MLP), with the same architecture as inStacking Prediction
. -
Ensemble Selection
, where the probabilities of each classifier are weighted according to its overall performance on specific metrics (i.e., F1-score, Recall, ...). Since such a performance must be determined beforehand, we use a validation dataset that serves to iteratively7 infer the weights for each classifier (Caruana et al. 2004).
Min Probability
sees the most important variation of 84 percentage points (0.12 for the whole LitTempConsist to 0.96 for 2019
AndTempConsist). The difficulty of the whole LitTempConsist is also confirmed when combining the predictions since the highest F1 score reported in that dataset is 0.43.Ensemble Selection
is the best technique in 12 experiments. Stacking Probability
with SVM
achieves the highest F1 score in three experiments. As for Max Probabilities
, it outperformed the others on one dataset. When Ensemble Selection
is not the highest performing classifier, the difference in F1 score between Ensemble Selection
and the best method is at most two percentage points.Ensemble Selection
is not similar to that of all the evaluated classifiers, including Max Probabilities
and Stacking Probability
with SVM
, which have outperformed it on four datasets. Our results show that none of the Ensemble Learning classifiers has yielded the highest detection performance on all the datasets.
3.5 RQ5: Does combining feature sets or predictions from state-of-the-art approaches lead to classifiers that significantly outperform the original detectors?
AndroZoo datasets
. In Section 3.3, we have assessed the added value of the merged feature set using six classifiers. Our results showed that Bagging
achieved the highest F1 score in 11 out of 16 experiments. On the DREBIN LitTempConsist dataset, AdaBoost
has outperformed Bagging
with 6 percentage points. With the combination of predictions experiments, we have observed the same pattern: No Ensemble Learning method has reported the highest F1 score in all the settings. For example, Ensemble Selection
achieved the best detection scores in 12 experiments, but other methods have outperformed it in four evaluation experiments. However, the difference in F1 score between Ensemble Selection
and the best approaches in these four experiments is at most two percentage points.Bagging
as the best classifier trained with the merged feature set. As for RQ4, Ensemble Selection
is considered the best method to combine the predictions. We refer to Table 4 to compare the detection performance of these two methods with that of the best state-of-the-art classifiers on each dataset.Bagging
has increased the detection performance in nine experiments. The increase in the F1 score is at most two percentage points except for the whole LitTempConsist where it has reached five percentage points. This classifier has also decreased the F1 score in four experiments by one, two, six, and one percentage point, respectively. In the remaining three experiments, Bagging
has reported the same detection performance as the best state-of-the-art approaches. As for Ensemble Selection
, it has increased the detection performance by at most two percentage points in 11 experiments. This method has also reported the same detection performance as the best approaches in three experiments and decreased the F1 score by one percentage point in two experiments.Bagging
nor Ensemble Selection
has remarkably increased the detection performance of state-of-the-art malware detectors. While it has enhanced the F1 score by five percentage points on one dataset, Bagging
has also decreased the F1 score by six percentage points on one dataset. For Ensemble Selection
, despite improving the F1 score in 11 experiments, this improvement is at most two percentage points. Nevertheless, Ensemble Selection
has generally maintained the highest detection performance of state-of-the-art malware detectors independently of the dataset since it has maintained the least performance gap with the best classifiers on all the datasets.Bagging
and Ensemble Selection
. Since the p_value of the test is 5.42 − 179, we conduct the Nemenyi test and report our results in the sub-figure (d) of Fig. 1.Bagging
and Ensemble Selection
is different than that of the state-of-the-art classifiers. Moreover, the p_value of the test that compares Bagging
and Ensemble Selection
is greater than 0.5, which means that we failed to reject the null hypothesis. Our results suggest that there is insufficient evidence to affirm that the detection performance of these two classifiers is different.Bagging
and Ensemble Selection
have generally maintained the highest detection performance of the state-of-the-art approaches independently of the datasets.
4 Discussion
temporally-consistent
(in contrast with typical random sampling), in order to assess malware classifiers’ ability to cope with emerging malware. Overall, considering all experimental scenarios, the results show that none of the studied approaches stands out across all settings.4.1 Ensuring high detection performance across datasets
Bagging
and Ensemble Selection
have reported promising results: the yielded classifier generally achieves, in all scenarios, a detection score that is as good as the best score reported by individual approaches. Therefore, these combination methods ensure that the highest detection performance is stabilised independently of the dataset.4.2 Hypothetic reasons behind the failure of ensemble learning to outperform the state of the art
Bagging
and Ensemble Selection
methods have increased the highest F1 score reported by the base learners by five and two percentage points, respectively.temporally-consistent
manner. Yet, our experiments show that combining feature sets or predictions from these state-of-the-art classifiers does not lead to the hoped improvement.LitTempInconsist | LitTempConsist | AndTempInconsist | AndTempConsist | |
---|---|---|---|---|
Best Approach | DREBIN | DREBIN | MalCl | MalD |
DREBIN | – | – | 148 | 625 |
Reveal | 289 | 1127 | 190 | 659 |
MaMaF | 356 | 1388 | 248 | 709 |
MaMaP | 289 | 1262 | 192 | 688 |
MalD | 201 | 512 | 237 | – |
MalH | 200 | 1075 | 234 | 648 |
MalK | 193 | 653 | 227 | 699 |
MalCl | 200 | 548 | – | 679 |
MalA | 200 | 1063 | 225 | 661 |
MalCo | 199 | 1063 | 229 | 658 |
Temporally-consistent
experiments are challenging:
temporally-inconsistent
experiments. Moreover, we have seen in Section 3.5 that Ensemble Selection
has decreased the detection performance of the original classifiers in two experiments which are both temporally-consistent
.temporally-consistent
setting.temporally-consistent
experiments can be explained by the evolution of Android malware and the emergence of new malware families. Indeed, Android malware is evolving fast, and new families can exhibit previously-unknown behaviours. In the temporally-consistent
experiments, the test dataset is likely to contain malware belonging to families that were unseen in the training dataset. In the temporally-inconsistent
experiments, this situation is possible, but less likely due to the randomness of the split. Given that the training is supposed to characterise maliciousness, if the training set is not representative of the different families, the model will not generalise to samples in the test set, which would lead to poor detection performance.4.3 Threats-to-validity
5 Related work
5.1 Assessment of existing work
5.2 Ensemble learning for android malware detection
6 Conclusion
Bagging
and Ensemble Selection
methods are promising and can generally maintain the best detection scores independently of the dataset. To further facilitate future studies, we make available to the research community the extracted features (for 462k apps) following the approaches of ten detector variants.