Introduction
Literature review
Background
Hadoop
MapReduce
MapReduce framework
MapReduce workflow
A simple word count example
Hadoop Distributed File System (HDFS)
Research design and methodology
Problems with MapReduce for Iterations
Meta-learning Algorithm
-
Class-attribute-combiner:a meta-level training instance includes the training features, the predictions by each of the base classifiers and the true class for test instance;
-
Binary-class-combiner: for this rule, the meta-level instance consists of all the base classifiers’ predictions for all the classes and the true class for the test instance;
-
Class-combiner: it is similar to the binary-class-combiner except that this rule only contains the predictions for the predicted class from all the base classifiers and the true class.
Meta-MapReduce (MMR)
-
Generate meta-level training data: the validation data is split across P mappers. Each mapper will execute the process of testing all the base classifiers (line 4). Instead of generating the predicted labels, the predicted probabilities P D m for each of the classes from all the base classifiers are collected and put together (line 6). Then, the meta-level training instances M I p are made up by all the base classifiers’ prediction distributions PD (as attributes) and true labels of the validation data \(l(D_{k^{p}}^{p})\) (as labels) (line 8). The next step is that each map function outputs the 〈k e y,v a l u e〉 pairs \(\left \langle l(D_{k^{p}}^{p}),MI^{p}\right \rangle \). Here,the key is the true labels of the validation data: \(l(D_{k^{p}}^{p})\) while value is the meta-instances M I p (line 9). At this point, the map tasks are finished and the total meta-instances MI are organized together from all the outputs of the mappers (line 13).
-
Train meta-learner: let the total number of classes be NC. For each class, new meta-instances: f i l t e r n c (M I) are generated by selecting the prediction probabilities corresponding to this class from MI (as attributes) and setting the class label to be 1 if it belongs to this class, or 0 if not. The meta-classifier is then trained on these new instances and the trained model for each class is set as h n c (line 16). As a result, each class will have its own built meta-classifier. The number of meta-classifiers equals to the number of classes. In the end, all the trained meta-classifiers are sorted together according to the classes and saved (line 18).
-
meta-level test data: this part is similar to the part of generating meta-level training data in the validation process. The only difference is that the input data are the test data. This part is shown through lines 2-14.
-
Test (h 1,⋯,h N C ): for each instance M I ∗[i] in the meta-instances M I ∗, the probability p r o b n c for each class is obtained by classifying the corresponding meta-classifier h n c on a new meta-instance f i l t e r n c (M I ∗[i]) (line 17). This instance is generated the same way as f i l t e r(M I) in the validation process. The probabilities for all classes are then summed (line 18) and normalized (line 21). The predicted label l i is found by seeking the class ID of the calculated normalized highest probabilities (line 23). Finally, the predicted labels for all the test instances are returned (line 25).
Results and discussion
Experiment Settings
Data | No. of instances | No. of attributes | Size on disk |
---|---|---|---|
yeast | 892 | 8 | 34 KB |
wineRed | 1599 | 12 | 84 KB |
wineWhite | 4898 | 12 | 263 KB |
pendigits | 7494 | 16 | 360 KB |
spambase | 4601 | 57 | 687 KB |
musk | 6598 | 167 | 4.2 MB |
telescope | 19020 | 11 | 1.4 MB |
kdd | 148517 | 42 | 21.2 MB |
isolet | 7797 | 618 | 30.7 MB |
org | 2059 | 9731 | 38.4 MB |
census | 299285 | 42 | 129 MB |
S1 | 100000 | 400 | 210 MB |
S2 | 200000 | 400 | 420 MB |
Performance Results
-
Accuracy Performance: The error rates of PML when the number of computing nodes changes from 1 to 20 are shown in Table 2. It can be seen from this table that compared to the one mapper case our PML algorithm has lower or equivalent error rates in 9 out of 11 datasets. Although the error rates increased in datasets “yeast” and “musk”, the behavior of yeast is mainly due to the fact that it has the smallest data size. When the data is split among different mappers, the more mappers there are, the smaller the partition each mapper has, which leads to inaccurate base models. This eventually produces higher error rates in the final ensemble model.Table 2Error rates for different number of nodesDataNumber of computing nodes15101520yeast0.30710.31500.35200.35650.3497wineRed0.23510.22700.23130.23320.2457wineWhite0.22760.22740.22130.22740.2274pendigits0.07370.06930.06440.06480.0625spambase0.06800.05750.05540.05750.0547musk0.01800.02700.02650.03590.0380telescope0.16080.15650.15340.15070.1487kdd0.03420.03600.03000.02790.0290isolet0.15300.13670.13940.14500.1535org0.15780.14570.14320.16360.1690census0.04960.04920.04910.04900.0482
-
Comparison with Adaboost.PL: We also compared the error rates obtained by our MMR and the parallelized Adaboost algorithm Ada-Boost.PL [10]. The comparison is based on 8 of the datasets in [10] as we could not find the datasets “swsequence” and “biogrid” which they also tested. The comparison results are shown in Table 3. It can be seen that our PML algorithm has the lowest error rates in 7 out 8 datasets. The reduction of error rates compared to AdaBoost.PL is due to the fact that meta-learning is an ensemble scheme which improves the performance of base learners (here the base learner is Adaboost). The reason why we got higher error rates on the “yeast” dataset is because the dataset’s size is too small to build accurate base models with large split numbers (number of mappers), which is the same reason as we have explained in the accuracy performance experiments. As we don’t have the source code for the AdaBoost.PL algorithm, it is difficult for us to do the statistical significance test on the performance results.Table 3Error rates comparison between AdaBoost.PL and MMRData10 nodes20 nodesAdaBoost.PLMMRAdaBoost.PLMMRyeast0.34640.35200.33750.3497wineRed0.24640.23130.25640.2457wineWhite0.23130.22130.23090.2274pendigits0.06990.06440.06420.0625spambase0.05760.05540.06170.0547musk0.05650.02650.06120.0380telescope0.15990.15340.15950.1487isolet0.16010.13940.16610.1535
-
Speedup: to demonstrate the effectiveness of the MapReduce framework of MMR, we tested the speedup [34] performance of the datasets: “kdd”, “isolet”, “org”, “census”, “S1” and “S2”. We calculate speedup as the ratio of the training time for a single computing node over that of a number of computing nodes processing in parallel (we vary this number from 5, 10, 15 to 20). The detailed results of speedup for different datasets are shown in Table 4. To illustrate the speedup performance of different datasets, we plot their speedup results in Fig. 5. As can be seen from this figure, the larger the dataset size, the higher speedup the program achieves. The reason is that the communication cost between different computing nodes dominates small datasets, while the computation cost dominates large datasets.Table 4Speedup results for different computing nodesDataNumber of computing nodes5101520kdd5.52779.603610.857611.2606isolet2.66973.40593.57963.9684org4.09756.35857.91398.7913census4.95088.443011.885013.2451S18.317014.658416.079618.1314S24.475512.592718.008218.9423×