1 Introduction
-
We found that while there is a large potential for differential testing, due to different implementations of the same algorithms, it is difficult to harness the potential and find feasible combinations of algorithms that should behave the same. This is due to the diversity in hyperparameters between the implementations.
-
We observe many deviations between implementations that should yield similar results. The number of deviations indicates that there is a large number of noise in the results, which makes it impossible for us to pick up a reliable signal to identify possible bugs. Experts that want to test specific algorithms could possibly use a lenient test oracle that only considers if the classifications are significantly different, but it is unclear if such an oracle is sufficiently powerful to detect bugs.
2 Terminology and Notations
3 Related Work
4 Case Study
4.1 Subjects
Framework | Version | Language | Description |
---|---|---|---|
Scikit-learn | 1.0.1 | Python | Scikit-learn (Pedregosa et al. 2011) is a popular machine learning library for Python and one of the reasons for the growing popularity of python as a language for data analysis. |
Weka | 3.8.5 | Java | Weka is a popular machine learning library for Java which had the first official release in 1999. The library is very mature and has been used by researchers and practitioners for decades and is, e.g., part of the Pentaho business intelligence software.4 |
Spark MLLib | 3.0.0 | Scala | |
Caret | 6.0-90 | R | Caret (Kuhn M 2018) provides common interfaces for many learning algorithms implemented in R. Through this, Caret provides a harmonized Application Programing Interface (API) that allows using many learning algorithms that are implemented with different APIs as part of different R packages. Thus, Caret greatly reduces the difficulty of trying out different algorithms to solve a problem. |
4.2 Methodology
-
UNIFORM: 200 randomly generated instances, with ten features that are uniform in the interval [0,1] with two classes, such that 50% of the data are in each class. The separation between the classes is based on a 10-dimensional rectangle, i.e., a clear separation between the classes is possible.
-
RANDOM: 200 randomly generated instances, where the features are uniform within the interval [0,1] with two classes that are randomly assigned. Thus, no reasonable classification is possible.
-
BC: The breast cancer data from the UCI repository has 569 instances with 30 numeric features for cancer prediction. The features are normalized to the interval [0,1].
-
WINE: The wine data from the UCI repository has 178 instances with 11 numeric features for wine quality prediction. The features are normalized to the interval [0,1].
4.3 Phase 1: Overlap of Algorithms
-
Naive Bayes, Decision Trees, Random Forest, Support Vector Machines, Multilayer Perceptrons, and Logistic Regression can be found in all frameworks;
-
a trivial classifier, k-Nearest Neighbor, and Gradient Boosting in at least three of the frameworks; and
-
a Perceptron, Stochastic Gradient Descent, Gaussian Process, LDA, QDA, Nearest Centroid, Extra Trees, and Logistic Model Trees are present in two frameworks.
-
Additionally, there are 47 algorithms that are only implemented by a single framework (see Table 5 in the appendix).
Scikit-learn | Weka | Spark MLlib | Caret |
---|---|---|---|
BernoulliNB, ComplementNB, MultinomialNB, GaussianNB, CategoricalNB | NaiveBayesMultinomial, NaiveBayes | NaiveBayes | awnb, naive_bayes, nb, manb |
DecisionTreeClassifier | SimpleCart, J48, DecisionStump, HoeffdingTree | DecisionTreeClassifier | C5.0Tree, C5.0Rules, rpartCost, C5.0Cost, rpartScore, rpart, rpart1SE, rpart2, C5.0, ctree, ctree2, bartMachine |
RandomForestClassifier | RandomForest | RandomForestClassifier | wsrf, RRF, RRFglobal, rfRules, ordinalRF, ranger, Rborist, rf, ORFlog, ORFpls, ORFridge, ORFsvm, cforest |
LinearSVC, NuSVC, SVC | SMO | LinearSVC | svmSpectrumString, svmRadial, svmRadialCost, svmRadialSigma, svmPoly, svmLinear, svmLinear2, svmExpoString, svmRadialWeights, svmBoundrangeString, svmLinearWeights, lssvmRadial, lssvmPoly, lssvmLinear, svmLinear3, svmLinearWeights2 |
MLPClassifier | MultilayerPerceptron | MultilayerPerceptron-Classifier | mlpKerasDecayCost, mlpKerasDecay, mlpKerasDropoutCost, mlpKerasDropout, mlpSGD, mlpML, mlpWeightDecayML, mlp, mlpWeightDecay, monmlp |
MLPClassifier | MultilayerPerceptron | MultilayerPerceptron-Classifier | mlpKerasDecayCost, mlpKerasDecay, mlpKerasDropoutCost, mlpKerasDropout, mlpSGD, mlpML, mlpWeightDecayML, mlp, mlpWeightDecay, monmlp |
LogisticRegression, RidgeClassifier | Logistic, SimpleLogistic | LogisticRegression | regLogistic, plr, polr |
DummyClassifier | ZeroR | null | |
KNeighborsClassifier | IBk | snn, kknn, knn | |
GradientBoostingClassifier | GBTClassifier | gbm_h2o, gbm | |
Perceptron | VotedPerceptron | ||
SGDClassifier | SGD | ||
GaussianProcessClassifier | gaussprRadial, gaussprPoly, gaussprLinear | ||
LinearDiscriminantAnalysis | slda, sparseLDA, sda, rrlda, Linda, rlda, rda, PenalizedLDA, pda, pda2, Mlda, loclda, stepLDA, lda, lda2, RFlda, dda | ||
NearestCentroid | pam | ||
QuadraticDistriminantAnalysis | qda, QdaCov, stepQDA | ||
ExtraTreeClassifier | extraTrees | ||
LMT | LMT |
naive_bayes
and nb
. Consequently, there is even a possibility for differential testing within the Caret package, without even using other frameworks.
4.4 Phase 2: Feasible Subset
Group | Scikit-learn | Weka | Spark MLlib | Caret |
---|---|---|---|---|
GNB | GaussianNB | NaiveBayes | NaiveBayes | naive_bayes, nb |
KDENB | NaiveBayes | naive_bayes, nb | ||
MNB | MultinomialNB | NaiveBayesMultinomial | NaiveBayes | |
RF1 | RandomForestClassifier | RandomForest | ranger, rborist | |
RF2 | RandomForestClassifier | RandomForest | ranger, rborist, rf | |
LSVM | SVC | SMO | LinearSVC | svmLinear, svmLinear2, svmLinear3 |
PSVM | SVC | SMO | svmPoly | |
RBFSVM | SVC | SMO | svmRadial | |
MLP | MLPClassifier | MultilayerPerceptron | MultilayerPerceptronClassifier | mlp*, mlpSGD* |
DUMMY | DummyClassifier | ZeroR | null | |
KNN | KNeighborsClassifier | IBk | knn | |
LR | LogisticRegression | Logistic | LogisticRegression | regLogistic, plr, polr |
RIDGE | LogisticRegression, RidgeClassifier | Logistic | LogisticRegression | regLogistic, plr |
LASSO | LogisticRegression | LogisticRegression | regLogistic |
4.5 Phase 3: Test Execution
Group | Summary of test results |
---|---|
GNB | Classes are equal or almost equal, with small differences for Weka. Scores are almost always different, except for the Caret-naive_bayes and Caret-nb, and the Spark MLlib and Scikit-learn implementations, which are equal. No differences are significant. |
KDENB | Small but insignificant differences in classes and scores between the Caret-naive_bayes and Caret-nb. Large differences to Weka, that are significant for both classes and scores on about half of the data sets. |
MNB | Classes and scores are equal, with one exception: the Scikit-learn implementation sometimes classifies one instance differently. These differences are in cases where the score is almost exactly 0.5 and for one framework slightly smaller and for the other framework slightly larger than 0.5, e.g., for Scikit-learn 0.499 and for Weka 0.501. |
RF1 | Results are never equal, with larger differences on test than on training data. The differences between classes are not significant, except in two cases on the RANDOM data, where the classes are significantly different. The differences between the scores are almost always significant. |
RF2 | Results are never equal, with larger differences on test than on training data. The differences of the classes are not significant, except in two cases on the RANDOM data. The scores between Weka and Caret-rboist is only significant once on the WINE data. The scores of Scikit-learn are significantly different most of the time. |
LSVM | Results are never equal, with mostly small and insignificant differences. The exception is Spark MLlib, which has a large difference to the other implementations that are significant on the RANDOM and WINE data. |
PSVM | Scikit-learn and Weka are equal. Caret yields different results, but these differences are only significant on the UNIFORM data. |
RBFSVM | Scikit-learn and Weka are equal. Caret yields different results and these differences are almost always significant. |
MLP | On the RANDOM and UNIFORM data, the classes are almost equal between all implementations with no significant difference. On the WINE and BC data there are large and significant differences, where Weka disagrees with the other frameworks. The differences between scores are always large and significant. |
DUMMY | The classes are always equal, the scores depend on the implementation of the trivial model: Caret and Weka have the same approach, Scikit-learn always disagrees. |
KNN | The classifications are equal or almost equal between all implementations. The scores of Scikit-learn and Caret are also equal or almost equal, with the exception of the WINE data. Here, Caret is equal to Weka instead. On the other data sets, Weka has a large and significant difference from the other implementations. |
LR | The classes are equal or almost equal. The scores are almost always equal, except on the BC data, where we observe significant differences between all implementations. |
RIDGE | The classes and scores of Weka, Scikit-learn-LogisticRegression, and Caret-plr are equal. The other implementations have small and insignificant differences for the classes and large and significant deviations for the scores. |
LASSO | The classifications of Caret and Scikit-learn are equal or almost equal. Spark MLlib has large differences, but they are only significant on the RANDOM data. The scores between all implementations are different. The differences between Caret and Scikit-learn are mostly not significant, Spark MLlib is always significantly different. |
-
Perhaps unsurprisingly, non-deterministic algorithms like random forests or MLPs can only be compared using statistical significance. This seems to work well for the classifications using the χ2 test. However, our results indicate that the scores should not be compared, as the differences are very often significant.
-
Perhaps surprisingly, we often observe different classifications with deterministic algorithms, even to the point where we cannot state which algorithms are likely correct. For example, the small differences for GNB make it impossible to determine, without a detailed triage of all involved algorithms, to infer if any of the implementations contain a bug causing the small deviations, or if this is natural due to implementation choices. Most of the time, these differences are not statistically significant, which further raises the question if investigating these issues would be worth the effort.
-
Scores often depend on the implementation. Even for a “simple” algorithm like LR, we observe significant differences. Thus, we cannot recommend to use differential testing for scores, as there is a very large risk of false positives.
5 Discussion
-
The definition of the algorithms in the literature may not be sufficiently detailed to guarantee that everything is implemented in exactly the same way. While we believe that this could sometimes be a reason, this cannot explain all differences we observe. For example, LSVM has a clear mathematical definition which is well-known and should lead to the same optimal result, except in few corner cases. In comparison, no two implementations yield exactly the same results.
-
It is possible that the API documentations are incomplete and there are hidden values of hyperparameter that are hard-coded in the implementations but not visible through the documentation. However, in this case, we should see clearer patterns in the deviations, e.g., one algorithm deviating from the others on all data sets. While we sometimes observe this, there are relatively few cases where only one implementation disagrees with all others. Usually, many implementations disagree with each other within one group.
-
Optimizations for the numerical stability, run-time, or memory efficiency could lead to subtle changes in the algorithms, which could explain the deviations between the implementations. The pattern of deviations should be the same as before, i.e., one algorithm disagrees with the others, which we often do not observe very often.
-
The different programming languages, including the libraries available, could also have an impact. While, e.g., the different pairs for GNB indicate that this may be due to such an effect, we also observe many cases that indicate that this is not an issue. For example, the different R implementations that Caret offers often disagree with each other as well, even though they are, presumably, based on the same native R features. Hence, while this could be a factor, we believe that this only plays a minor role.