Skip to main content
Erschienen in: Education and Information Technologies 5/2020

28.04.2020

RETRACTED ARTICLE: Impact of the learning set’s size

verfasst von: Adil Korchi, Mohamed Dardor, El Houssine Mabrouk

Erschienen in: Education and Information Technologies | Ausgabe 5/2020

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Learning techniques have proven their capacity to treat large amount of data. Most statistical learning approaches use specific size learning sets and create static models. Withal, in certain some situations such as incremental or active learning the learning process can work with only a smal amount of data. In this case, the search for algorithms capable of producing models with only a few examples begin to be necessary. Generally, the literature relative to classifiers are evaluated according to criteria such as their classification performance, their ability to sort data. But this taxonomy of classifiers can singularly evolve if one is interested in their capabilities in the presence of some few examples. From our point of view, few studies have been carried out on this issue. It is in sense that this paper seeks to study a wider range of learning algorithms as well as data sets in order to show the power of every chosen algorithm that manipulates data. It also appears from this study, problem of algorithm’s choice to process small or large amount of data. And in order to resolve this, we will show that there are algorithms able of generating models with little data. In this case we look to select the smallest amount of data allowing the best learning to be achieved. We also wanted to show that some algorithms are capable of making good predictions with little data that is therefore necessary in order to have the least costly labeling procedure possible. And to concretize this, we will talk first about learning speed and typology of the tested algorithms to know the ability of a classifier to obtain an “interesting” solution to a classification problem using a minimum of examples present in learning, and we will know some various families of classification models based on parameter learning. After that, we will test all the classifiers mentioned previously such as linear and Non-linear classifiers. Then, we will seek to study the behavior these algorithms as a function of learning set’s size trough the experimental protocol in which various datasets will be Splited, manipulated and evaluated from the classification field in order to give results that merge from our experimental protocol. After that, we will discuss the obtained results through the global analysis section, and then conclude with recommendations.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literatur
Zurück zum Zitat Bauer, E., & Kohavi, R. (1999). An Empirical Comparison Of Voting Classification Algorithms: Bagging, boosting, and variants. Machine Learning, 36(1–2), 105–139.CrossRef Bauer, E., & Kohavi, R. (1999). An Empirical Comparison Of Voting Classification Algorithms: Bagging, boosting, and variants. Machine Learning, 36(1–2), 105–139.CrossRef
Zurück zum Zitat Beluch, W. H., Genewein, T., Nürnberger, A., & Köhler, J. M. (2018). The power of ensembles for active learning in image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 9368–9377). Beluch, W. H., Genewein, T., Nürnberger, A., & Köhler, J. M. (2018). The power of ensembles for active learning in image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 9368–9377).
Zurück zum Zitat Blake, C. L. & Merz, C. J. (1998). UCI repository of machine learning databases. University of California, Irvine, Dept. of Information and Computer Sciences. Blake, C. L. & Merz, C. J. (1998). UCI repository of machine learning databases. University of California, Irvine, Dept. of Information and Computer Sciences.
Zurück zum Zitat Bouchard, G. & Triggs, B. (2004). August. The tradeoff between generative and discriminative classifiers, pp.721–728. Bouchard, G. & Triggs, B. (2004). August. The tradeoff between generative and discriminative classifiers, pp.721–728.
Zurück zum Zitat Bouckaert, R. R. (2004). Bayesian network classifiers in weka. Bouckaert, R. R. (2004). Bayesian network classifiers in weka.
Zurück zum Zitat Boulle, M. (2004). Khiops: A Statistical Discretization Method Of Continuous Attributes. Machine Learning, 55(1), 53–69.CrossRef Boulle, M. (2004). Khiops: A Statistical Discretization Method Of Continuous Attributes. Machine Learning, 55(1), 53–69.CrossRef
Zurück zum Zitat Boullé, M. (2005). A grouping method for categorical attributes having very large number of values. In International Workshop on Machine Learning and Data Mining in Pattern Recognition (pp. 228–242). Berlin, Heidelberg: Springer.CrossRef Boullé, M. (2005). A grouping method for categorical attributes having very large number of values. In International Workshop on Machine Learning and Data Mining in Pattern Recognition (pp. 228–242). Berlin, Heidelberg: Springer.CrossRef
Zurück zum Zitat Boullé, M. (2006a). MODL: A Bayes optimal discretization method for continuous attributes. Machine Learning, 65(1), 131–165.CrossRef Boullé, M. (2006a). MODL: A Bayes optimal discretization method for continuous attributes. Machine Learning, 65(1), 131–165.CrossRef
Zurück zum Zitat Boullé, M. (2006b). Regularization and averaging of the selective Na ï ve Bayes classifier. In The 2006 IEEE International Joint Conference on Neural Network Proceedings (pp. 1680–1688) IEEE. Boullé, M. (2006b). Regularization and averaging of the selective Na ï ve Bayes classifier. In The 2006 IEEE International Joint Conference on Neural Network Proceedings (pp. 1680–1688) IEEE.
Zurück zum Zitat Breiman, L., Friedman, J., Stone, C. J. & Olshen, R. A. (1984). Classification and regression trees. Boca Raton: CRC press. Breiman, L., Friedman, J., Stone, C. J. & Olshen, R. A. (1984). Classification and regression trees. Boca Raton: CRC press.
Zurück zum Zitat Cervantes, A., Gagné, C., Isasi, P. & Parizeau, M. (2018). Evaluating and characterizing incremental learning from non-stationary data. arXiv preprint arXiv:1806.06610. Cervantes, A., Gagné, C., Isasi, P. & Parizeau, M. (2018). Evaluating and characterizing incremental learning from non-stationary data. arXiv preprint arXiv:1806.06610.
Zurück zum Zitat Chen, S., Webb, G. I., Liu, L., & Ma, X. (2019). A novel selective Naïve Bayes Algorithm. Knowledge-Based Systems, 105361. Chen, S., Webb, G. I., Liu, L., & Ma, X. (2019). A novel selective Naïve Bayes Algorithm. Knowledge-Based Systems, 105361.
Zurück zum Zitat Cucker, F., & Smale, S. (2002). Best Choices For Regularization Parameters in learning theory: on the bias-variance problem. Foundations of Computational Mathematics, 2(4), 413–428. Cucker, F., & Smale, S. (2002). Best Choices For Regularization Parameters in learning theory: on the bias-variance problem. Foundations of Computational Mathematics, 2(4), 413–428.
Zurück zum Zitat Demiröz, G., & Güvenir, H. A. (1997). Classification by voting feature intervals. In European Conference on Machine Learning (pp. 85–92). Berlin, Heidelberg: Springer. Demiröz, G., & Güvenir, H. A. (1997). Classification by voting feature intervals. In European Conference on Machine Learning (pp. 85–92). Berlin, Heidelberg: Springer.
Zurück zum Zitat Domingos, P., & Hulten, G. (2000). Mining high-speed data streams. In Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 71–80).CrossRef Domingos, P., & Hulten, G. (2000). Mining high-speed data streams. In Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 71–80).CrossRef
Zurück zum Zitat Domingos, P., & Pazzani, M. (1997). On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning, 29(2–3), 103–130. Domingos, P., & Pazzani, M. (1997). On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning, 29(2–3), 103–130.
Zurück zum Zitat Fawcett, T. (2004). ROC graphs: notes and practical considerations for researchers. Machine Learning, 31(1), 1–38. Fawcett, T. (2004). ROC graphs: notes and practical considerations for researchers. Machine Learning, 31(1), 1–38.
Zurück zum Zitat Féraud, R., Boullé, M., Clérot, F., Fessant, F., & Lemaire, V. (2010). The orange customer analysis platform. In Industrial Conference on Data Mining (pp. 584–594). Springer, Berlin, Heidelberg. Féraud, R., Boullé, M., Clérot, F., Fessant, F., & Lemaire, V. (2010). The orange customer analysis platform. In Industrial Conference on Data Mining (pp. 584–594). Springer, Berlin, Heidelberg.
Zurück zum Zitat Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7(2), 179–188. Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7(2), 179–188.
Zurück zum Zitat Freund, Y., & Mason, L. (1999). The alternating decision tree learning algorithm. In icml (Vol. 99, pp. 124–133). Freund, Y., & Mason, L. (1999). The alternating decision tree learning algorithm. In icml (Vol. 99, pp. 124–133).
Zurück zum Zitat Gama, J., Rocha, R., & Medas, P. (2003). Accurate decision trees for mining high-speed data streams. In proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining (pp. 523–528).CrossRef Gama, J., Rocha, R., & Medas, P. (2003). Accurate decision trees for mining high-speed data streams. In proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining (pp. 523–528).CrossRef
Zurück zum Zitat Gama, J., Medas, P., & Rodrigues, P. (2005). Learning decision trees from dynamic data streams. In proceedings of the 2005 ACM symposium on applied computing (pp. 573–577).CrossRef Gama, J., Medas, P., & Rodrigues, P. (2005). Learning decision trees from dynamic data streams. In proceedings of the 2005 ACM symposium on applied computing (pp. 573–577).CrossRef
Zurück zum Zitat Guyon, I., Lemaire, V., Boullé, M., Dror, G., & Vogel, D. (2009). Analysis of the kdd cup 2009: Fast scoring on a large orange customer database. In KDD-Cup 2009 Competition (pp. 1–22). Guyon, I., Lemaire, V., Boullé, M., Dror, G., & Vogel, D. (2009). Analysis of the kdd cup 2009: Fast scoring on a large orange customer database. In KDD-Cup 2009 Competition (pp. 1–22).
Zurück zum Zitat Guyon, I., Cawley, G. C., Dror, G., & Lemaire, V. (2011). Results of the active learning challenge. In Active Learning and Experimental Design workshop. In conjunction with AISTATS 2010 (pp. 19–45). Guyon, I., Cawley, G. C., Dror, G., & Lemaire, V. (2011). Results of the active learning challenge. In Active Learning and Experimental Design workshop. In conjunction with AISTATS 2010 (pp. 19–45).
Zurück zum Zitat Han, T., Jiang, D., Zhao, Q., Wang, L., & Yin, K. (2018). Comparison of random forest, artificial neural networks and support vector machine for intelligent diagnosis of rotating machinery. Transactions of the Institute of Measurement and Control, 40(8), 2681–2693. Han, T., Jiang, D., Zhao, Q., Wang, L., & Yin, K. (2018). Comparison of random forest, artificial neural networks and support vector machine for intelligent diagnosis of rotating machinery. Transactions of the Institute of Measurement and Control, 40(8), 2681–2693.
Zurück zum Zitat Hyndman, R. J., & Athanasopoulos, G. (2018). Forecasting: principles and practice. OTexts. Hyndman, R. J., & Athanasopoulos, G. (2018). Forecasting: principles and practice. OTexts.
Zurück zum Zitat John, G. H. & Langley, P. (1995). Estimating continuous distributions in Bayesian classifiers. In proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence. John, G. H. & Langley, P. (1995). Estimating continuous distributions in Bayesian classifiers. In proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence.
Zurück zum Zitat Langley, P., Iba, W. & Thomas, K. (1992). An analysis of Bayesian classi er. In proceedings of the Tenth National Conference of Artificial Intelligence. Langley, P., Iba, W. & Thomas, K. (1992). An analysis of Bayesian classi er. In proceedings of the Tenth National Conference of Artificial Intelligence.
Zurück zum Zitat Le Cessie, S., & Van Houwelingen, J. C. (1992). Ridge estimators in logistic regression. Journal of the Royal Statistical Society: Series C (Applied Statistics), 41(1), 191–201. Le Cessie, S., & Van Houwelingen, J. C. (1992). Ridge estimators in logistic regression. Journal of the Royal Statistical Society: Series C (Applied Statistics), 41(1), 191–201.
Zurück zum Zitat Lim, T. S., Loh, W. Y., & Shih, Y. S. (2000). A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms. Machine Learning, 40(3), 203–228. Lim, T. S., Loh, W. Y., & Shih, Y. S. (2000). A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms. Machine Learning, 40(3), 203–228.
Zurück zum Zitat Losing, V., Hammer, B., & Wersing, H. (2018). Incremental on-iine learning: a review and comparison of state of the art algorithms. Neurocomputing, 275, 1261–1274. Losing, V., Hammer, B., & Wersing, H. (2018). Incremental on-iine learning: a review and comparison of state of the art algorithms. Neurocomputing, 275, 1261–1274.
Zurück zum Zitat Michalski, R. S., Mozetic, I., Hong, J. & Lavrac, N. (1986). The multi-purpose incremental learning system Aq15 and its testing application to three medical domains. Proc. AAAI 1986, pp.1–041. Michalski, R. S., Mozetic, I., Hong, J. & Lavrac, N. (1986). The multi-purpose incremental learning system Aq15 and its testing application to three medical domains. Proc. AAAI 1986, pp.1–041.
Zurück zum Zitat Mohamad, S., Sayed-Mouchaweh, M., & Bouchachia, A. (2018). Active learning for classifying data streams with unknown number of classes. Neural Networks, 98, 1–15. Mohamad, S., Sayed-Mouchaweh, M., & Bouchachia, A. (2018). Active learning for classifying data streams with unknown number of classes. Neural Networks, 98, 1–15.
Zurück zum Zitat Quinlan, J. R. (1993). C4. 5: programs for machine learning. Morgan Kaufmann, San Francisco. C4. 5: Programs for machine learning. Morgan Kaufmann, San Francisco. Quinlan, J. R. (1993). C4. 5: programs for machine learning. Morgan Kaufmann, San Francisco. C4. 5: Programs for machine learning. Morgan Kaufmann, San Francisco.
Zurück zum Zitat Settles, B. (2010). Active learning literature survey. University of Wisconsin. Madison: Computer Science technical report 1648 52, 55-66. Settles, B. (2010). Active learning literature survey. University of Wisconsin. Madison: Computer Science technical report 1648 52, 55-66.
Zurück zum Zitat Wang, J., Zhang, L., Cao, J. J., & Han, D. (2018). NBWELM: Naive Bayesian based weighted extreme learning machine. International Journal of Machine Learning and Cybernetics, 9(1), 21–35. Wang, J., Zhang, L., Cao, J. J., & Han, D. (2018). NBWELM: Naive Bayesian based weighted extreme learning machine. International Journal of Machine Learning and Cybernetics, 9(1), 21–35.
Zurück zum Zitat Wen, J., Fang, X., Cui, J., Fei, L., Yan, K., Chen, Y., & Xu, Y. (2018). Robust sparse linear discriminant analysis. IEEE Transactions on Circuits and Systems for Video Technology, 29(2), 390–403. Wen, J., Fang, X., Cui, J., Fei, L., Yan, K., Chen, Y., & Xu, Y. (2018). Robust sparse linear discriminant analysis. IEEE Transactions on Circuits and Systems for Video Technology, 29(2), 390–403.
Zurück zum Zitat Witten, I. H., & Frank, E. (2002). Data mining: practical machine learning tools and techniques with java implementations. ACM SIGMOD Record, 31(1), 76–77. Witten, I. H., & Frank, E. (2002). Data mining: practical machine learning tools and techniques with java implementations. ACM SIGMOD Record, 31(1), 76–77.
Zurück zum Zitat Wolpert, D. H. (2018). The relationship between PAC, the statistical physics framework, the Bayesian framework, and the VC framework. In The mathematics of generalization (pp. 117–214). CRC press. Wolpert, D. H. (2018). The relationship between PAC, the statistical physics framework, the Bayesian framework, and the VC framework. In The mathematics of generalization (pp. 117–214). CRC press.
Zurück zum Zitat Xu, J., Xu, C., Zou, B., Tang, Y. Y., Peng, J., & You, X. (2018). New incremental learning algorithm with support vector machines. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 49(11), 2230–2241. Xu, J., Xu, C., Zou, B., Tang, Y. Y., Peng, J., & You, X. (2018). New incremental learning algorithm with support vector machines. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 49(11), 2230–2241.
Metadaten
Titel
RETRACTED ARTICLE: Impact of the learning set’s size
verfasst von
Adil Korchi
Mohamed Dardor
El Houssine Mabrouk
Publikationsdatum
28.04.2020
Verlag
Springer US
Erschienen in
Education and Information Technologies / Ausgabe 5/2020
Print ISSN: 1360-2357
Elektronische ISSN: 1573-7608
DOI
https://doi.org/10.1007/s10639-020-10165-9

Weitere Artikel der Ausgabe 5/2020

Education and Information Technologies 5/2020 Zur Ausgabe