Top

International Journal of Machine Learning and Cybernetics

Published in:

08-05-2019 | Original Article

A Gaussian mixture model based combined resampling algorithm for classification of imbalanced credit data sets

Authors: Xu Han, Runbang Cui, Yanfei Lan, Yanzhe Kang, Jiang Deng, Ning Jia

Published in: International Journal of Machine Learning and Cybernetics | Issue 12/2019

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Credit scoring represents a two-classification problem. Moreover, the data imbalance of the credit data sets, where one class contains a small number of data samples and the other contains a large number of data samples, is an often problem. Therefore, if only a traditional classifier is used to classify the data, the final classification effect will be affected. To improve the classification of the credit data sets, a Gaussian mixture model based combined resampling algorithm is proposed. This resampling approach first determines the number of samples of the majority class and the minority class using a sampling factor. Then, the Gaussian mixture clustering is used for undersampling of the majority of samples, and the synthetic minority oversampling technique is used for the rest of the samples, so an eventual imbalance problem is eliminated. Here we compare several resampling methods commonly used in the analysis of imbalanced credit data sets. The obtained experimental results demonstrate that the proposed method consistently improves classification performances such as F-measure, AUC, G-mean, and so on. In addition, the method has strong robustness for credit data sets.

previous article Sparse and heuristic support vector machine for binary classifier and regressor fusion

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

ATZelectronics worldwide

ATZlectronics worldwide is up-to-speed on new trends and developments in automotive electronics on a scientific level with a high depth of information.

Order your 30-days-trial for free and without any commitment.

inform now

ATZelektronik

Die Fachzeitschrift ATZelektronik bietet für Entwickler und Entscheider in der Automobil- und Zulieferindustrie qualitativ hochwertige und fundierte Informationen aus dem gesamten Spektrum der Pkw- und Nutzfahrzeug-Elektronik.

Lassen Sie sich jetzt unverbindlich 2 kostenlose Ausgabe zusenden.

inform now

Albisua I, Arbelaitz O, Gurrutxaga I, Lasarguren A, Muguerza J, Pérez JM (2013) The quest for the optimal class distribution: an approach for enhancing the effectiveness of learning via resampling methods for imbalanced data sets. Prog Artif Intell 2(1):45–63CrossRef

Altman EI, Marco G, Varetto F (2004) Corporate distress diagnosis: comparisons using linear discriminant analysis and neural networks (the Italian experience). J Bank Financ 18(3):505–529CrossRef

Arminger G, Enache D, Bonne T (1997) Analyzing credit risk data: a comparison of logistic discrimination, classification tree analysis, and feedforward networks. Comput Stat 12(2):293–310MATH

Baesens B, Gestel TV, Viaene S, Stepanova M, Suykens J, Vanthienen J (2003) Benchmarking state-of-the-art classification algorithms for credit scoring. J Oper Res Soc 54(6):627–635MATHCrossRef

Baesens B, Mues C, Martens D, Vanthienen J (2009) 50 years of data mining and OR: upcoming trends and challenges. J Oper Res Soc 60(1):S16–S23MATHCrossRef

Beyan C, Fisher R (2015) Classifying imbalanced data sets using similarity based hierarchical decomposition. Pattern Recognit 48(5):1653–1672CrossRef

Błaszczyński J, Stefanowski J (2015) Neighbourhood sampling in bagging for imbalanced data. Neurocomputing 150:529–542CrossRef

Brown I, Mues C (2012) An experimental comparison of classification algorithms for imbalanced credit scoring data sets. Expert Syst Appl 39(3):3446–3453CrossRef

Chawla NV (2009) Data mining for imbalanced datasets: an overview. In: Data mining and knowledge discovery handbook. Springer, Boston, MA, pp 875–886

10.

Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16(1):321–357MATHCrossRef

11.

Chawla NV, Cieslak DA, Hall LO, Joshi A (2008) Automatically countering imbalance and its empirical relationship to cost. Data Min Knowl Discov 17(2):225–252MathSciNetCrossRef

12.

Chawla NV, Japkowicz N, Kotcz A (2004) Editorial: Special issue on learning from imbalanced data sets. ACM Sigkdd Explor Newsl 6(1):1–6CrossRef

13.

Chawla NV, Lazarevic A, Hall LO, Bowyer KW (2003) SMOTEBoost: improving prediction of the minority class in boosting. Lect Notes Comput Sci 2838:107–119CrossRef

14.

Cieslak DA, Chawla NV, Striegel A (2006) Combating imbalance in network intrusion datasets. In: IEEE international conference on granular computing, IEEE. Atlanta, USA

15.

Cohen WW (1995) Fast effective rule induction. In: Twelfth international conference on machine learning. Morgan Kaufmann Publishers Inc. Tahoe City, California, pp 115–123

16.

Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B Methodol 39(1):1–22MathSciNetMATH

17.

Desai VS, Crook JN, Jr GO (1996) A comparison of neural networks and linear scoring models in the credit union environment. Eur J Oper Res 95(1):24–37MATHCrossRef

18.

Domingos P (1999) Metacost: a general method for making classifiers cost-sensitive. In: KDD’99 proceedings of the ifth ACM SIGKDD international conference on knowledge discovery and data mining. San Diego, USA, vol 99, pp 155–164

19.

Fawcett T (2006) An introduction to ROC analysis. Pattern Recognit Lett 27(8):861–874MathSciNetCrossRef

20.

Freitas A (2011) Building cost-sensitive decision trees for medical applications. AI Commun 24(3):285–287CrossRef

21.

Galar M, Barrenechea E, Herrera F (2013) EUSBoost: enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling. Pattern Recognit 46(12):3460–3471CrossRef

22.

Galar M, Fernandez A, Barrenechea E, Bustince H, Herrera F (2012) A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans Syst Man Cybern Part C Appl Rev 42(4):463–484CrossRef

23.

García V, Marqués AI, Sánchez JS (2012) On the use of data filtering techniques for credit risk prediction with instance-based models. Expert Syst Appl 39(18):13267–13276CrossRef

24.

Ghazikhani A, Monsefi R, Yazdi HS (2013) Ensemble of online neural networks for non-stationary and imbalanced data streams. Neurocomputing 122:535–544CrossRef

25.

Guo H, Li Y, Shang J, Gu M, Huang Y, Gong B (2016) Learning from class-imbalanced data: review of methods and applications. Expert Syst Appl 73:220–239

26.

Han H, Wang WY, Mao BH (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: International conference on intelligent computing. Springer, Berlin, Heidelberg. Ulsan, Korea, pp 878–887

27.

Hand DJ, Henley WE (1997) Statistical classification methods in consumer credit scoring: a review. J R Stat Soc 160(3):523–541CrossRef

28.

Hartigan JA, Wong MA (1979) Algorithm AS 136: a k-means clustering algorithm. J R Stat Soc 28(1):100–108MATH

29.

Hu S, Liang Y, Ma L, He Y (2009) MSMOTE: improving classification performance when training data is imbalanced. In: 2009 second international workshop on computer science and engineering, IEEE. Qingdao, China, vol 2, pp 13–17

30.

Huang Z, Chen H, Hsu CJ, Chen WH, Wu S (2004) Credit rating analysis with support vector machines and neural networks: a market comparative study. Decis Support Syst 37(4):543–558CrossRef

31.

Jackowski K, Krawczyk B, Woźniak M (2012) Cost-sensitive splitting and selection method for medical decision support system. In: Intelligent data engineering and automated learning—IDEAL 2012. Springer, Berlin

32.

Li DC, Liu CW, Hu SC (2010) A learning method for the class imbalance problem with medical data sets. Comput Biol Med 40(5):509–518CrossRef

33.

Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intell Data Anal 6(5):429–449MATHCrossRef

34.

Kasabov N (2002) Evolving connectionist systems for adaptive learning and knowledge discovery: methods, tools, applications. In: Proceedings first international IEEE symposium intelligent systems, IEEE. Varna, Bulgaria, vol 1, pp 24–28

35.

Kasabov N, Feigin V, Hou ZG, Chen Y, Liang L, Krishnamurthi R, Parmar P (2014) Evolving spiking neural networks for personalised modelling, classification and prediction of spatio-temporal patterns with a case study on stroke. Neurocomputing 134(4):269–279CrossRef

36.

Kasabov NK, Doborjeh MG, Doborjeh ZG (2016) Mapping, learning, visualization, classification, and understanding of fMRI data in the NeuCube evolving spatiotemporal data machine of spiking neural networks. IEEE Trans Neural Netw Learn Syst PP(99):887–899

37.

Kohavi R (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. In: The international joint conference on artiicial intelligence, Morgan Kaufmann. Los Angeles, CA, vol 14, no 2, pp 1137–1145

38.

Kotsiantis S, Kanellopoulos D, Pintelas P (2006) Handling imbalanced datasets: a review. GESTS Int Trans Comput Sci Eng 30(1):25–36

39.

Krawczyk B, Woniak M, Schaefer G (2014) Cost-sensitive decision tree ensembles for effective imbalanced classification. Appl Soft Comput 14(1):554–562CrossRef

40.

Kubat M, Matwin S (1997) Addressing the curse of imbalanced training sets: one-sided selection. In: the 14th international conference on machine learning. Nashville, TN, USA, vol 97, pp 179–186

41.

Lenca P, Lallich S (2008) A comparison of different off-centered entropies to deal with class imbalance for decision trees. Lect Notes Comput Sci 5012:634–643CrossRef

42.

Li Y, Sun G, Zhu Y (2010) Data imbalance problem in text classification. In: 2010 third international symposium on information processing, IEEE. Qingdao, China, pp 301–305

43.

Lin Y, Huang X, Xu K (2013) Research on extreme risk warning for financial market based on RU-SMOTE-SVM. Forecasting 32(4)

44.

Liu TY (2012) Feature selection based on mutual information for gear imbalanced problem faulty diagnosis. In: IET conference publications, 2012, pp 54–54. https://doi.org/10.1049/cp.2012.0506

45.

Liu W, Chawla S (2011) Class confidence weighted kNN algorithms for imbalanced data sets. In: Computer science. https://doi.org/10.1007/978-3-642-20847-8, pp 345–356 (chapter 29)

46.

Liu W, Chawla S, Cieslak DA, Chawla NV (2010) A robust decision tree algorithm for imbalanced data sets. In: Paper presented at the SIAM international conference on data mining, SDM 2010, April 29–May 1, 2010, Columbus, Ohio, USA

47.

Lomax S, Vadera S (2013) A survey of cost-sensitive decision tree induction algorithms. ACM Comput Surv 45(2):1–35MATHCrossRef

48.

Maalouf M, Trafalis TB (2011) Robust weighted kernel logistic regression in imbalanced and rare events data. Comput Stat Data Anal 55(1):168–183MathSciNetMATHCrossRef

49.

Marqués AI, García V, Sánchez JS (2013) On the suitability of resampling techniques for the class imbalance problem in credit scoring. J Oper Res Soc 64(7):1060–1070CrossRef

50.

Mena L, Gonzalez JA (2006) Machine learning for imbalanced datasets: application in medical diagnostic. In: Paper presented at the nineteenth international Florida artificial intelligence research society conference, Melbourne Beach, Florida, USA, May

51.

Min F, Zhu W (2012) A competition strategy to cost-sensitive decision trees. Springer, BerlinCrossRef

52.

Altman EI (1968) Financial ratios, discriminant analysis and the prediction of corporate bankruptcy. J Financ 23(4):589–609CrossRef

53.

Perols J (2013) Financial statement fraud detection: an analysis of statistical and machine learning algorithms. Soc Sci Electron Publ 30(2):19–50

54.

Pluto K, Tasche D (2005) Estimating probabilities of default for low default portfolios. Dirk Tasche 6(3):79–103

55.

Rodda S, Mogalla S (2011) A normalized measure for estimating classification rules for multi-class imbalanced datasets. Int J Eng Sci Technol 3(4):3216–3220

56.

Rousseeuw P (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20(20):53–65MATHCrossRef

57.

Steenackers A, Goovaerts MJ (1989) A credit scoring model for personal loans. Insur Math Econ 8(1):31–34CrossRef

58.

Sun Y, Kamel MS, Wong AKC, Wang Y (2007) Cost-sensitive boosting for classification of imbalanced data. Pattern Recognit 40(12):3358–3378MATHCrossRef

59.

Thomas C (2013) Improving intrusion detection for imbalanced network traffic. Secur Commun Netw 6(3):309–324CrossRef

60.

Thomas LC, Crook J, Edelman D (2002) Credit scoring and its applications. SIAM, PhiladelphiaMATHCrossRef

61.

Tomek I (1976) Two modifications of CNN. IEEE Trans Syst Man Cybern SMC 6(11):769–772MathSciNetMATH

62.

Wang G, Hao J, Ma J, Jiang H (2011) A comparative assessment of ensemble learning for credit scoring. Expert Syst Appl 38(1):223–230CrossRef

63.

Wang S, Yao X (2009) Diversity analysis on imbalanced data sets by using ensemble models. In: 2009 IEEE symposium on computational intelligence and data mining, IEEE. Nashville, TN, USA, pp 324–331

64.

West D (2000) Neural network credit scoring models. Comput Oper Res 27(11):1131–1152MATHCrossRef

65.

Wiginton JC (1980) A note on the comparison of logit and discriminant models of consumer credit behavior. J Financ Quant Anal 15(3):757–770CrossRef

66.

Yang Y (2007) Adaptive credit scoring with kernel learning methods. Eur J Oper Res 183(3):1521–1536MATHCrossRef

67.

Yobas MB, Crook JN, Ross P (2000) Credit scoring using neural and evolutionary techniques. IMA J Manag Math 11(2):111–125MathSciNetMATHCrossRef

68.

Zheng Z, Wu X, Srihari R (2004) Feature selection for text categorization on imbalanced data. Sigkdd Explor 6(1):80–89CrossRef

Title: A Gaussian mixture model based combined resampling algorithm for classification of imbalanced credit data sets
Authors: Xu Han
Runbang Cui
Yanfei Lan
Yanzhe Kang
Jiang Deng
Ning Jia
Publication date: 08-05-2019
Publisher: Springer Berlin Heidelberg
Published in: International Journal of Machine Learning and Cybernetics / Issue 12/2019
Print ISSN: 1868-8071
Electronic ISSN: 1868-808X
DOI: https://doi.org/10.1007/s13042-019-00953-2

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

ATZelectronics worldwide

ATZelektronik

Other articles of this Issue 12/2019

Accelerating improved twin support vector machine with safe screening rule

Sparse and heuristic support vector machine for binary classifier and regressor fusion

A multiclass boosting algorithm to labeled and unlabeled data

Study of the polytope of the predicate

A hybrid method for increasing the speed of SVM training using belief function theory and boundary region

A novel algorithm for the vertex cover problem based on minimal elements of discernibility matrix