Skip to main content

2017 | OriginalPaper | Buchkapitel

An Iterated Greedy Algorithm for Improving the Generation of Synthetic Patterns in Imbalanced Learning

verfasst von : Francisco Javier Maestre-García, Carlos García-Martínez, María Pérez-Ortiz, Pedro Antonio Gutiérrez

Erschienen in: Advances in Computational Intelligence

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Real-world classification datasets often present a skewed distribution of patterns, where one or more classes are under-represented with respect to the rest. One of the most successful approaches for alleviating this problem is the generation of synthetic minority samples by convex combination of available ones. Within this framework, adaptive synthetic (ADASYN) sampling is a relatively new method which imposes weights on minority examples according to their learning complexity, in such a way that difficult examples are more prone to be over-sampled. This paper proposes an improvement of the ADASYN method, where the learning complexity of these patterns is also used to decide which sample of the neighbourhood is selected. Moreover, to avoid suboptimal results when performing the random convex combination, this paper explores the application of an iterative greedy algorithm which refines the synthetic patterns by repeatedly replacing a part of them. For the experiments, six binary datasets and four over-sampling methods are considered. The results show that the new version of ADASYN leads to more robust results and that the application of the iterative greedy metaheuristic significantly improves the quality of the generated patterns, presenting a positive effect on the final classification model.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Alcalá, J., Fernández, A., Luengo, J., Derrac, J., García, S., Sánchez, L., Herrera, F.: Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J. Multiple-Valued Log. Soft Comput. 17(2–3), 255–287 (2010) Alcalá, J., Fernández, A., Luengo, J., Derrac, J., García, S., Sánchez, L., Herrera, F.: Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J. Multiple-Valued Log. Soft Comput. 17(2–3), 255–287 (2010)
2.
Zurück zum Zitat Bunkhumpornpat, C., Sinapiromsaran, K., Lursinsap, C.: Safe-level-SMOTE: safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In: Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, T.-B. (eds.) PAKDD 2009. LNCS (LNAI), vol. 5476, pp. 475–482. Springer, Heidelberg (2009). doi:10.1007/978-3-642-01307-2_43 CrossRef Bunkhumpornpat, C., Sinapiromsaran, K., Lursinsap, C.: Safe-level-SMOTE: safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In: Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, T.-B. (eds.) PAKDD 2009. LNCS (LNAI), vol. 5476, pp. 475–482. Springer, Heidelberg (2009). doi:10.​1007/​978-3-642-01307-2_​43 CrossRef
3.
Zurück zum Zitat Chan, P.K., Fan, W., Prodromidis, A.L., Stolfo, S.J.: Distributed data mining in credit card fraud detection. IEEE Intell. Syst. Appl. 14(6), 67–74 (1999)CrossRef Chan, P.K., Fan, W., Prodromidis, A.L., Stolfo, S.J.: Distributed data mining in credit card fraud detection. IEEE Intell. Syst. Appl. 14(6), 67–74 (1999)CrossRef
4.
Zurück zum Zitat Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)MATH Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)MATH
5.
Zurück zum Zitat Cruz, R., Fernandes, K., Cardoso, J.S., Costa, J.F.P.: Tackling class imbalance with ranking. In: 2016 International Joint Conference on Neural Networks (IJCNN), pp. 2182–2187. IEEE (2016) Cruz, R., Fernandes, K., Cardoso, J.S., Costa, J.F.P.: Tackling class imbalance with ranking. In: 2016 International Joint Conference on Neural Networks (IJCNN), pp. 2182–2187. IEEE (2016)
6.
Zurück zum Zitat Domingos, P.: Metacost: a general method for making classifiers cost-sensitive. In: Proceedings of 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 155–164. ACM (1999) Domingos, P.: Metacost: a general method for making classifiers cost-sensitive. In: Proceedings of 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 155–164. ACM (1999)
7.
Zurück zum Zitat Fernández-Caballero, J.C., Martínez-Estudillo, F.J., Hervás-Martínez, C., Gutiérrez, P.A.: Sensitivity versus accuracy in multiclass problems using memetic pareto evolutionary neural networks. IEEE Trans. Neural Netw. 21(5), 750–770 (2010)CrossRef Fernández-Caballero, J.C., Martínez-Estudillo, F.J., Hervás-Martínez, C., Gutiérrez, P.A.: Sensitivity versus accuracy in multiclass problems using memetic pareto evolutionary neural networks. IEEE Trans. Neural Netw. 21(5), 750–770 (2010)CrossRef
8.
Zurück zum Zitat Galar, M., Fernández, A., Barrenechea, E., Bustince, H., Herrera, F.: A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans. Syst. Man Cybern. Part C Appl. Rev. 42(4), 463–484 (2012)CrossRef Galar, M., Fernández, A., Barrenechea, E., Bustince, H., Herrera, F.: A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans. Syst. Man Cybern. Part C Appl. Rev. 42(4), 463–484 (2012)CrossRef
9.
Zurück zum Zitat García-Martínez, C., Lozano, M., Rodriguez, F.J.: Arbitrary function optimization. No free lunch and real-world problems. Soft. Comput. 16(12), 2115–2133 (2012)CrossRef García-Martínez, C., Lozano, M., Rodriguez, F.J.: Arbitrary function optimization. No free lunch and real-world problems. Soft. Comput. 16(12), 2115–2133 (2012)CrossRef
10.
Zurück zum Zitat García-Martínez, C., Rodriguez, F.J., Lozano, M.: Tabu-enhanced iterated greedy algorithm: a case study in the quadratic multiple knapsack problem. Eur. J. Oper. Res. 232, 454–463 (2014)MathSciNetCrossRefMATH García-Martínez, C., Rodriguez, F.J., Lozano, M.: Tabu-enhanced iterated greedy algorithm: a case study in the quadratic multiple knapsack problem. Eur. J. Oper. Res. 232, 454–463 (2014)MathSciNetCrossRefMATH
11.
Zurück zum Zitat Garcia-Pedrajas, N., Pérez-Rodríguez, J., de Haro-García, A.: OligoIS: scalable instance selection for class-imbalanced data sets. IEEE Trans. Cybern. 43(1), 332–346 (2013)CrossRef Garcia-Pedrajas, N., Pérez-Rodríguez, J., de Haro-García, A.: OligoIS: scalable instance selection for class-imbalanced data sets. IEEE Trans. Cybern. 43(1), 332–346 (2013)CrossRef
12.
Zurück zum Zitat Ghazikhani, A., Yazdi, H.S., Monsefi, R.: Class imbalance handling using wrapper-based random oversampling. In: 20th Iranian Conference on Electrical Engineering (ICEE 2012), pp. 611–616. IEEE (2012) Ghazikhani, A., Yazdi, H.S., Monsefi, R.: Class imbalance handling using wrapper-based random oversampling. In: 20th Iranian Conference on Electrical Engineering (ICEE 2012), pp. 611–616. IEEE (2012)
13.
Zurück zum Zitat Han, H., Wang, W.-Y., Mao, B.-H.: Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Huang, D.-S., Zhang, X.-P., Huang, G.-B. (eds.) ICIC 2005. LNCS, vol. 3644, pp. 878–887. Springer, Heidelberg (2005). doi:10.1007/11538059_91 CrossRef Han, H., Wang, W.-Y., Mao, B.-H.: Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Huang, D.-S., Zhang, X.-P., Huang, G.-B. (eds.) ICIC 2005. LNCS, vol. 3644, pp. 878–887. Springer, Heidelberg (2005). doi:10.​1007/​11538059_​91 CrossRef
14.
Zurück zum Zitat Hanley, J.A., McNeil, B.J.: The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143(1), 29–36 (1982)CrossRef Hanley, J.A., McNeil, B.J.: The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143(1), 29–36 (1982)CrossRef
15.
Zurück zum Zitat He, H., Bai, Y., Garcia, E.A., Li, S.: ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In: International Joint Conference on Neural Networks (IJCNN), pp. 1322–1328 (2008) He, H., Bai, Y., Garcia, E.A., Li, S.: ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In: International Joint Conference on Neural Networks (IJCNN), pp. 1322–1328 (2008)
16.
Zurück zum Zitat He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)CrossRef He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)CrossRef
17.
Zurück zum Zitat Japkowicz, N., Stephen, S.: The class imbalance problem: a systematic study. Intell. Data Anal. 6(5), 429–449 (2002)MATH Japkowicz, N., Stephen, S.: The class imbalance problem: a systematic study. Intell. Data Anal. 6(5), 429–449 (2002)MATH
18.
Zurück zum Zitat Jo, T., Japkowicz, N.: Class imbalances versus small disjuncts. ACM SIGKDD Explor. Newsl. 6(1), 40–49 (2004)CrossRef Jo, T., Japkowicz, N.: Class imbalances versus small disjuncts. ACM SIGKDD Explor. Newsl. 6(1), 40–49 (2004)CrossRef
19.
Zurück zum Zitat Kubat, M., Matwin, S.: Addressing the curse of imbalanced training sets: one-sided selection. In: Proceedings of 14th International Conference on Machine Learning, pp. 179–186. Morgan Kaufmann (1997) Kubat, M., Matwin, S.: Addressing the curse of imbalanced training sets: one-sided selection. In: Proceedings of 14th International Conference on Machine Learning, pp. 179–186. Morgan Kaufmann (1997)
21.
Zurück zum Zitat Lim, P., Goh, C.K., Tan, K.C.: Evolutionary cluster-based synthetic oversampling ensemble (eco-ensemble) for imbalance learning. IEEE Trans. Cybern. 99, 1–12 (2016)CrossRef Lim, P., Goh, C.K., Tan, K.C.: Evolutionary cluster-based synthetic oversampling ensemble (eco-ensemble) for imbalance learning. IEEE Trans. Cybern. 99, 1–12 (2016)CrossRef
22.
Zurück zum Zitat Liu, X.Y., Wu, J., Zhou, Z.H.: Exploratory undersampling for class-imbalance learning. IEEE Trans. Syst. Man Cybern. Part B 39(2), 539–550 (2009)CrossRef Liu, X.Y., Wu, J., Zhou, Z.H.: Exploratory undersampling for class-imbalance learning. IEEE Trans. Syst. Man Cybern. Part B 39(2), 539–550 (2009)CrossRef
23.
Zurück zum Zitat Luengo, J., Fernández, A., García, S., Herrera, F.: Addressing data complexity for imbalanced data sets: analysis of smote-based oversampling and evolutionary undersampling. Soft. Comput. 15(10), 1909–1936 (2011)CrossRef Luengo, J., Fernández, A., García, S., Herrera, F.: Addressing data complexity for imbalanced data sets: analysis of smote-based oversampling and evolutionary undersampling. Soft. Comput. 15(10), 1909–1936 (2011)CrossRef
24.
Zurück zum Zitat Maciejewski, T., Stefanowski, J.: Local neighbourhood extension of smote for mining imbalanced data. In: 2011 IEEE Symposium on Computational Intelligence and Data Mining (CIDM), pp. 104–111. IEEE (2011) Maciejewski, T., Stefanowski, J.: Local neighbourhood extension of smote for mining imbalanced data. In: 2011 IEEE Symposium on Computational Intelligence and Data Mining (CIDM), pp. 104–111. IEEE (2011)
25.
Zurück zum Zitat Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12(October), 2825–2830 (2011)MathSciNetMATH Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12(October), 2825–2830 (2011)MathSciNetMATH
26.
Zurück zum Zitat Pérez-Ortiz, M., Gutiérrez, P.A., Tino, P., Hervás-Martínez, C.: Oversampling the minority class in the feature space. IEEE Trans. Neural Netw. Learn. Syst. 27(9), 1947–1961 (2016)MathSciNetCrossRef Pérez-Ortiz, M., Gutiérrez, P.A., Tino, P., Hervás-Martínez, C.: Oversampling the minority class in the feature space. IEEE Trans. Neural Netw. Learn. Syst. 27(9), 1947–1961 (2016)MathSciNetCrossRef
27.
Zurück zum Zitat Ruiz, R., Stützle, T.: A simple and effective iterated greedy algorithm for the permutation flowshop scheduling problem. Eur. J. Oper. Res. 177, 2033–2049 (2007)CrossRefMATH Ruiz, R., Stützle, T.: A simple and effective iterated greedy algorithm for the permutation flowshop scheduling problem. Eur. J. Oper. Res. 177, 2033–2049 (2007)CrossRefMATH
28.
Zurück zum Zitat Seiffert, C., Khoshgoftaar, T.M., Van Hulse, J., Napolitano, A.: Rusboost: a hybrid approach to alleviating class imbalance. IEEE Trans. Syst. Man Cybern.-Part A: Syst. Hum. 40(1), 185–197 (2010)CrossRef Seiffert, C., Khoshgoftaar, T.M., Van Hulse, J., Napolitano, A.: Rusboost: a hybrid approach to alleviating class imbalance. IEEE Trans. Syst. Man Cybern.-Part A: Syst. Hum. 40(1), 185–197 (2010)CrossRef
29.
Zurück zum Zitat Thai-Nghe, N., Gantner, Z., Schmidt-Thieme, L.: Cost-sensitive learning methods for imbalanced data. In: International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE (2010) Thai-Nghe, N., Gantner, Z., Schmidt-Thieme, L.: Cost-sensitive learning methods for imbalanced data. In: International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE (2010)
30.
Zurück zum Zitat Wang, S., Minku, L.L., Yao, X.: Resampling-based ensemble methods for online class imbalance learning. IEEE Trans. Knowl. Data Eng. 27(5), 1356–1368 (2015)CrossRef Wang, S., Minku, L.L., Yao, X.: Resampling-based ensemble methods for online class imbalance learning. IEEE Trans. Knowl. Data Eng. 27(5), 1356–1368 (2015)CrossRef
31.
Zurück zum Zitat Wilcoxon, F.: Individual comparisons by ranking methods. Biom. Bull. 1(6), 80–83 (1945)CrossRef Wilcoxon, F.: Individual comparisons by ranking methods. Biom. Bull. 1(6), 80–83 (1945)CrossRef
32.
Zurück zum Zitat Wong, G.Y., Leung, F.H., Ling, S.H.: A novel evolutionary preprocessing method based on over-sampling and under-sampling for imbalanced datasets. In: Industrial Electronics Society, IECON 2013–39th Annual Conference of the IEEE, pp. 2354–2359. IEEE (2013) Wong, G.Y., Leung, F.H., Ling, S.H.: A novel evolutionary preprocessing method based on over-sampling and under-sampling for imbalanced datasets. In: Industrial Electronics Society, IECON 2013–39th Annual Conference of the IEEE, pp. 2354–2359. IEEE (2013)
Metadaten
Titel
An Iterated Greedy Algorithm for Improving the Generation of Synthetic Patterns in Imbalanced Learning
verfasst von
Francisco Javier Maestre-García
Carlos García-Martínez
María Pérez-Ortiz
Pedro Antonio Gutiérrez
Copyright-Jahr
2017
DOI
https://doi.org/10.1007/978-3-319-59147-6_44