Top

Progress in Artificial Intelligence

Published in:

06-02-2019 | Regular Paper

Instance selection improves geometric mean accuracy: a study on imbalanced data classification

Authors: Ludmila I. Kuncheva, Álvar Arnaiz-González, José-Francisco Díez-Pastor, Iain A. D. Gunn

Published in: Progress in Artificial Intelligence | Issue 2/2019

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

A natural way of handling imbalanced data is to attempt to equalise the class frequencies and train the classifier of choice on balanced data. For two-class imbalanced problems, the classification success is typically measured by the geometric mean (GM) of the true positive and true negative rates. Here we prove that GM can be improved upon by instance selection, and give the theoretical conditions for such an improvement. We demonstrate that GM is non-monotonic with respect to the number of retained instances, which discourages systematic instance selection. We also show that balancing the distribution frequencies is inferior to a direct maximisation of GM. To verify our theoretical findings, we carried out an experimental study of 12 instance selection methods for imbalanced data, using 66 standard benchmark data sets. The results reveal possible room for new instance selection methods for imbalanced data.

previous article Ensemble of classifier chains and Credal C4.5 for solving multi-label classification

next article A decision-making approach where argumentation added value tackles social choice deficiencies

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

We will use the terms “example”, “instance”, “object” and “prototype” interchangeably, meaning a data point in the feature space of interest, e.g. \(\mathbf {x}\in \mathbb {R}^n\).

We find it curious that no such methods, on this category, have yet been developed to maximise GM.

Available at http://sci2s.ugr.es/keel/imbalanced.php.

We noticed that, while the original OSS is defined by Kubat in [30] as CNN followed by TL, later on, Batista [5] defined it in reverse order and also independently proposed an equivalent to Kubat’s OSS. This misunderstanding has spread in subsequent works. However, we have maintained the original name OSS for CNN+TL, as used [30], and we use TL+CNN for Batista et al.’s method [5].

The random selection was performed by using the SpreadSubsample instance supervised filter.

Available in the KEEL GitHub repository: https://github.com/SCI2SUGR/KEEL.

Available in Google code: https://code.google.com/archive/p/imbalanced-data-sampling/.

Akbani, R., Kwek, S., Japkowicz, N.: Applying support vector machines to imbalanced datasets. In: Boulicaut, J.F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) Machine Learning: ECML 2004: 15th European Conference on Machine Learning, Pisa, Italy, 20-24 September, 2004. Proceedings, pp. 39–50. Springer Berlin Heidelberg, Berlin, Heidelberg (2004). https://doi.org/10.1007/978-3-540-30115-8_7

Alcala-Fdez, J., Fernández, A., Luengo, J., Derrac, J., García, S., Sánchez, L., Herrera, F.: KEEL data-mining software tool: data set repository and integration of algorithms and experimental analysis framework. J. Mult. Valued Logic Soft Comput. 17(2–3), 255–287 (2011)

Barandela, R., Sánchez, J., García, V., Rangel, E.: Strategies for learning in class imbalance problems. Pattern Recognit. 36(3), 849–851 (2003)CrossRef

Barandela, R., Valdovinos, R., Sánchez, J.: New applications of ensembles of classifiers. Pattern Anal. Appl. 6(3), 245–256 (2003)MathSciNetCrossRef

Batista, G.E.A.P.A., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explor. Newsl. 6(1), 20–29 (2004). https://doi.org/10.1145/1007730.1007735 CrossRef

Batuwita, R., Palade, V.: FSVM-CIL: fuzzy support vector machines for class imbalance learning. IEEE Trans. Fuzzy Syst. 18(3), 558–571 (2010). https://doi.org/10.1109/TFUZZ.2010.2042721 CrossRef

Chawla, N., Japkowicz, N., Kotcz, A.: Editorial: special issue on learning from imbalanced data sets. ACM SIGKDD Explor. Newsl. 6(1), 1–6 (2004)CrossRef

Cieslak, D.A., Chawla, N.V., Striegel, A.: Combating imbalance in network intrusion datasets. In: GrC, pp. 732–737 (2006)

Cleofas-Sánchez, L., Sánchez, J.S., García, V.: Gene selection and disease prediction from gene expression data using a two-stage hetero-associative memory. Prog. Artif. Intell. (2018). https://doi.org/10.1007/s13748-018-0148-6

10.

Cover, T., Hart, P.: Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 13(1), 21–27 (1967)CrossRefMATH

11.

Dal Pozzolo, A., Caelen, O., Le Borgne, Y.A., Waterschoot, S., Bontempi, G.: Learned lessons in credit card fraud detection from a practitioner perspective. Expert Syst. Appl. 41(10), 4915–4928 (2014)CrossRef

12.

Dasarathy, B.V.: Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques. IEEE Computer Society Press, Los Alamitos, California (1990)

13.

Dietterich, T.: Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput. 10(7), 1895–1923 (1998)CrossRef

14.

Díez-Pastor, J.F., Rodríguez, J.J., García-Osorio, C., Kuncheva, L.I.: Random balance: ensembles of variable priors classifiers for imbalanced data. Knowl. Based Syst. 85, 96–111 (2015)CrossRef

15.

Díez-Pastor, J.F., Rodríguez, J.J., García-Osorio, C.I., Kuncheva, L.I.: Diversity techniques improve the performance of the best imbalance learning ensembles. Inf. Sci. 325, 98–117 (2015)MathSciNetCrossRef

16.

Drown, D.J., Khoshgoftaar, T.M., Seliya, N.: Evolutionary sampling and software quality modeling of high-assurance systems. IEEE Trans. Syst. Man Cybern. Part A Syst. Hum. 39(5), 1097–1107 (2009)CrossRef

17.

Eskildsen, S.F., Coupé, P., Fonov, V., Collins, D.L.: Detecting alzheimer’s disease by morphological MRI using hippocampal grading and cortical thickness. In: Proceedings of the MICCAI Workshop Challenge on Computer-Aided Diagnosis of Dementia Based on Structural MRI Data, pp. 38–47 (2014)

18.

Fernández, A., García, S., Galar, M., Prati, R., Krawczyk, B., Herrera, F.: Learning from imbalanced data sets. Springer International PU (2018). https://books.google.es/books?id=8Fp0DwAAQBAJ. Accessed 4 Feb 2019

19.

Fix, E., Hodges, J.L.: Discriminatory analysis: non parametric discrimination: small sample performance. Technical report project 21-49-004 (11), USAF School of Aviation Medicine, Randolph Field, Texas (1952)

20.

Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55(1), 119–139 (1997). https://doi.org/10.1006/jcss.1997.1504 MathSciNetCrossRefMATH

21.

Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., Herrera, F.: A review on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches. IEEE Trans. Syst. Man Cybern. Part C Appl. Rev. 42(4), 463–484 (2012)CrossRef

22.

Galar, M., Fernández, A., Barrenechea, E., Herrera, F.: Eusboost: enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling. Pattern Recognit. 46(12), 3460–3471 (2013)CrossRef

23.

Garcia, S., Derrac, J., Cano, J., Herrera, F.: Prototype selection for nearest neighbor classification: taxonomy and empirical study. IEEE Trans. Pattern Anal. Mach. Intell. 34(3), 417–435 (2012). https://doi.org/10.1109/TPAMI.2011.142 CrossRef

24.

García-Pedrajas, N., Pérez-Rodríguez, J., García-Pedrajas, M.D., Ortiz-Boyer, D., Fyfe, C.: Class imbalance methods for translation initiation site recognition in DNA sequences. Knowl. Based Syst. 25(1), 22–34 (2012)CrossRef

25.

García, S., Herrera, F.: Evolutionary undersampling for classification with imbalanced datasets: proposals and taxonomy. Evolut. Comput. 17(3), 275–306 (2009). https://doi.org/10.1162/evco.2009.17.3.275 MathSciNetCrossRef

26.

Hart, P.: The condensed nearest neighbor rule (corresp.). IEEE Trans. Inf. Theory 14(3), 515–516 (1968)CrossRef

27.

Jain, A.K., Zongker, D.: Feature selection: evaluation, application, and small sample performance. IEEE Trans. Pattern Anal. Mach. Intell. 19(2), 153–158 (1997). https://doi.org/10.1109/34.574797 CrossRef

28.

Krawczyk, B.: Learning from imbalanced data: open challenges and future directions. Prog. Artif. Intell. 5(4), 221–232 (2016). https://doi.org/10.1007/s13748-016-0094-0 CrossRef

29.

Krawczyk, B., Galar, M., Jeleń, Ł., Herrera, F.: Evolutionary undersampling boosting for imbalanced classification of breast cancer malignancy. Appl. Soft Comput. 38, 714–726 (2016)CrossRef

30.

Kubat, M., Matwin, S.: Addressing the curse of imbalanced training sets: one-sided selection. In: Proceedings of the Fourteenth International Conference on Machine Learning, pp. 179–186. Morgan Kaufmann (1997)

31.

Kuncheva, L.: Combining Pattern Classifiers: Methods and Algorithms. Wiley, New York (2014). https://books.google.co.uk/books?id=RtRLBAAAQBAJ. Accessed 4 Feb 2019

32.

Laurikkala, J.: Improving identification of difficult small classes by balancing class distribution. In: Quaglini, S., Barahona, P., Andreassen, S. (eds.) Artificial Intelligence in Medicine: 8th Conference on Artificial Intelligence in Medicine in Europe, AIME 2001 Cascais, Portugal, 1–4 July, 2001, Proceedings, pp. 63–66. Springer Berlin Heidelberg, Berlin, Heidelberg (2001). https://doi.org/10.1007/3-540-48229-6_9

33.

López, V., Triguero, I., Carmona, C.J., García, S., Herrera, F.: Addressing imbalanced classification with instance generation techniques: IPADE-ID. Neurocomputing 126, 15–28 (2014)CrossRef

34.

Phua, C., Alahakoon, D., Lee, V.: Minority report in fraud detection: classification of skewed data. ACM SIGKDD Explor. Newsl. 6(1), 50–59 (2004)CrossRef

35.

Pudil, P., Novovičová, J., Kittler, J.: Floating search methods in feature selection. Pattern Recognit. Lett. 15(11), 1119–1125 (1994). https://doi.org/10.1016/0167-8655(94)90127-9 CrossRef

36.

Saeys, Y., Inza, I., naga, P.L.: A review of feature selection techniques in bioinformatics. Bioinformatics 23(19), 2507–2517 (2007)CrossRef

37.

Sáez, J.A., Krawczyk, B., Woźniak, M.: Analyzing the oversampling of different classes and types of examples in multi-class imbalanced datasets. Pattern Recognit. (2016). https://doi.org/10.1016/j.patcog.2016.03.012

38.

Sanz, J.A., Bernardo, D., Herrera, F., Bustince, H., Hagras, H.: A compact evolutionary interval-valued fuzzy rule-based classification system for the modeling and prediction of real-world financial applications with imbalanced data. IEEE Trans. Fuzzy Syst. 23(4), 973–990 (2015)CrossRef

39.

Seiffert, C., Khoshgoftaar, T., Van Hulse, J., Napolitano, A.: RUSBoost: a hybrid approach to alleviating class imbalance. IEEE Trans. Syst. Man Cybern. Part A Syst. Hum. 40(1), 185–197 (2010)CrossRef

40.

Seiffert, C., Khoshgoftaar, T.M., Van Hulse, J.: Improving software-quality predictions with data sampling and boosting. IEEE Trans. Syst. Man Cybern. Part A Syst. Hum. 39(6), 1283–1294 (2009)CrossRef

41.

Sun, Z., Song, Q., Zhu, X.: Using coding-based ensemble learning to improve software defect prediction. IEEE Trans. Syst. Man Cybern. Part C Appl. Rev. 42(6), 1806–1817 (2012)CrossRef

42.

Tao, T.: An Introduction to Measure Theory. Graduate Studies in Mathematics. American Mathematical Society, Providence (2013). https://books.google.es/books?id=SPGJjwEACAAJ. Accessed 4 Feb 2019

43.

Tesfahun, A., Bhaskari, D.L.: Intrusion detection using random forests classifier with smote and feature reduction. In: 2013 International Conference on Cloud & Ubiquitous Computing & Emerging Technologies (CUBE), pp. 127–132. IEEE (2013)

44.

Tomek, I.: Two modifications of CNN. IEEE Trans. Syst. Man Cybern. 6, 769–772 (1976)MathSciNetMATH

45.

Triguero, I., Derrac, J., García, S., Herrera, F.: A taxonomy and experimental study on prototype generation for nearest neighbor classification. IEEE Trans. Syst. Man Cybern. Part C Appl. Rev. 42(1), 86–100 (2012)CrossRef

46.

Visa, S., Ralescu, A.: Issues in mining imbalanced data sets-a review paper. In: Proceedings of the Sixteen Midwest Artificial Intelligence and Cognitive Science Conference, vol. 2005, pp. 67–73 (2005)

47.

Wilson, D.L.: Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans. Syst. Man Cybern. SMC–2(3), 408–421 (1972). https://doi.org/10.1109/TSMC.1972.4309137 MathSciNetCrossRefMATH

48.

Wilson, D.R., Martinez, T.R.: Reduction techniques for instance-based learning algorithms. Mach. Learn. 38(3), 257–286 (2000). https://doi.org/10.1023/A:1007626913721 CrossRefMATH

49.

Witten, I.H., Frank, E., Hall, M.A.: Data Mining: Practical Machine Learning Tools and Techniques, 3rd edn. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (2011)

50.

Wu, X., Kumar, V., Quinlan, J.R., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G.J., Ng, A., Liu, B., Yu, P.S., Zhou, Z.H., Steinbach, M., Hand, D.J., Steinberg, D.: Top 10 algorithms in data mining. Knowl. Inf. Syst. 14(1), 1–37 (2008)CrossRef

51.

Yang, P., Xu, L., Zhou, B.B., Zhang, Z., Zomaya, A.Y.: A particle swarm based hybrid system for imbalanced medical data sampling. BMC Genom. 10(3), 1–14 (2009). https://doi.org/10.1186/1471-2164-10-S3-S34 CrossRef

52.

Zhang, J., Mani, I.: kNN approach to unbalanced data distributions: a case study involving information extraction. In: Proceedings of The Twentieth International Conference on Machine Learning (ICML-2003), Workshop on Learning from Imbalanced Data Sets (2003)

53.

Zheng, B., Myint, S.W., Thenkabail, P.S., Aggarwal, R.M.: A support vector machine to identify irrigated crop types using time-series landsat NDVI data. Int. J. Appl. Earth Obs. Geoinf. 34, 103–112 (2015)CrossRef

Title: Instance selection improves geometric mean accuracy: a study on imbalanced data classification
Authors: Ludmila I. Kuncheva
Álvar Arnaiz-González
José-Francisco Díez-Pastor
Iain A. D. Gunn
Publication date: 06-02-2019
Publisher: Springer Berlin Heidelberg
Published in: Progress in Artificial Intelligence / Issue 2/2019
Print ISSN: 2192-6352
Electronic ISSN: 2192-6360
DOI: https://doi.org/10.1007/s13748-019-00172-4

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft"

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Other articles of this Issue 2/2019

Semantic textual similarity between sentences using bilingual word semantics

Ensemble of classifier chains and Credal C4.5 for solving multi-label classification

Query expansion based on clustering and personalized information retrieval

A decision-making approach where argumentation added value tackles social choice deficiencies

Dynamical memetization in coral reef optimization algorithms for optimal time series approximation

Improving transparency of deep neural inference process

Premium Partner