Skip to main content
Top
Published in: Microsystem Technologies 9/2020

08-08-2019 | Technical Paper

A hybrid system for imbalanced data mining

Authors: Zne-Jung Lee, Chou-Yuan Lee, So-Tsung Chou, Wei-Ping Ma, Fulan Ye, Zhen Chen

Published in: Microsystem Technologies | Issue 9/2020

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

In the era of information explosion, the production and collection of data is growing massively. Data mining is the process of finding valuable information in data. For imbalanced data, the majority classes have more instances than those of the minority classes. When data grows with imbalanced feature, the majority classes obtain main focus and will ignore the importance of the minority classes. It becomes hard and hard to solve these problems. Another obstacle for imbalanced data mining is the lack of skilled resources such as distributed mechanism. Thus, it is not easy to solve these problems by traditional algorithms of data mining such as decision tree, random forest and support vector machine. In this paper, a hybrid system based on support vector machine and Apache Spark is proposed to imbalanced data mining. In the proposed system, SVM with two approaches is proposed to implement on Apache Spark to parallel process imbalanced data. Two datasets from UCI repository are used to verify the correctness of the proposed system. Simulation results demonstrate that the classification accuracy can be significantly promoted by the proposed system.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literature
go back to reference Blake CL, Merz CJ (1998) UCI repository of machine learning databases. University of California, Irvine, Dept. of Information and Computer Sciences Blake CL, Merz CJ (1998) UCI repository of machine learning databases. University of California, Irvine, Dept. of Information and Computer Sciences
go back to reference Cherkassky V, Ma Y (2004) Practical selection of SVM parameters and noise estimation for SVM regression. Neural Netw 17(1):113–126CrossRef Cherkassky V, Ma Y (2004) Practical selection of SVM parameters and noise estimation for SVM regression. Neural Netw 17(1):113–126CrossRef
go back to reference Devi D, Purkayastha B (2017) Redundancy-driven modified Tomek-link based undersampling: a solution to class imbalance. Pattern Recogn Lett 93:3–12CrossRef Devi D, Purkayastha B (2017) Redundancy-driven modified Tomek-link based undersampling: a solution to class imbalance. Pattern Recogn Lett 93:3–12CrossRef
go back to reference Fernández A et al (2018) Cost-sensitive learning. Learning from imbalanced data sets. Springer, Cham, pp 63–78CrossRef Fernández A et al (2018) Cost-sensitive learning. Learning from imbalanced data sets. Springer, Cham, pp 63–78CrossRef
go back to reference Gosain A, Sardana S (2019) Farthest SMOTE: a modified SMOTE approach. Computational intelligence in data mining. Springer, Singapore, pp 309–320CrossRef Gosain A, Sardana S (2019) Farthest SMOTE: a modified SMOTE approach. Computational intelligence in data mining. Springer, Singapore, pp 309–320CrossRef
go back to reference Gu Q et al (2008) Data mining on imbalanced data sets. IEEE Int Confer Adv Comput Theory Eng 1020–1024 Gu Q et al (2008) Data mining on imbalanced data sets. IEEE Int Confer Adv Comput Theory Eng 1020–1024
go back to reference Han H, Wang WY, Mao BH (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. International conference on intelligent computing. Springer, Berlin, pp 878–887 Han H, Wang WY, Mao BH (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. International conference on intelligent computing. Springer, Berlin, pp 878–887
go back to reference Harrison P et al (2018) Selecting methods for ecosystem service assessment: a decision tree approach. Ecosyst Serv 29:481–498CrossRef Harrison P et al (2018) Selecting methods for ecosystem service assessment: a decision tree approach. Ecosyst Serv 29:481–498CrossRef
go back to reference He H, Garcia EA (2008) Learning from imbalanced data. IEEE Trans Knowl Data Eng 9:1263–1284 He H, Garcia EA (2008) Learning from imbalanced data. IEEE Trans Knowl Data Eng 9:1263–1284
go back to reference Hsu C W, Chang CC, Lin CJ (2003) A practical guide to support vector classification Hsu C W, Chang CC, Lin CJ (2003) A practical guide to support vector classification
go back to reference Kavitha M, Suriakala M (2017) Real time credit card fraud detection on huge imbalanced data using meta-classifiers. IEEE Int Confer Invent Comput Inform (ICICI) 881–887 Kavitha M, Suriakala M (2017) Real time credit card fraud detection on huge imbalanced data using meta-classifiers. IEEE Int Confer Invent Comput Inform (ICICI) 881–887
go back to reference Kim DS, Nguyen HN, Park JS (2005) Genetic algorithm to improve SVM based network intrusion detection system. In: IEEE 19th international conference on advanced information networking and applications (AINA’05), pp 155–158 Kim DS, Nguyen HN, Park JS (2005) Genetic algorithm to improve SVM based network intrusion detection system. In: IEEE 19th international conference on advanced information networking and applications (AINA’05), pp 155–158
go back to reference Mason C et al (2018) Predicting engineering student attrition risk using a probabilistic neural network and comparing results with a backpropagation neural network and logistic regression. Res High Educ 59(3):382–400CrossRef Mason C et al (2018) Predicting engineering student attrition risk using a probabilistic neural network and comparing results with a backpropagation neural network and logistic regression. Res High Educ 59(3):382–400CrossRef
go back to reference Moosaei R, Safaei AA (2016) Classification of service delivery to airport passengers using data mining. Int J Adv Appl Sci 3(6):87–94 Moosaei R, Safaei AA (2016) Classification of service delivery to airport passengers using data mining. Int J Adv Appl Sci 3(6):87–94
go back to reference Sanabila HR, Jatmiko W (2018) Ensemble learning on large scale financial imbalanced data. In: IEEE 2018 international workshop on big data and information security (IWBIS), pp 93–98 Sanabila HR, Jatmiko W (2018) Ensemble learning on large scale financial imbalanced data. In: IEEE 2018 international workshop on big data and information security (IWBIS), pp 93–98
go back to reference Shanahan JG, Dai L (2015) Large scale distributed data science using apache spark. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, pp 2323–2324 Shanahan JG, Dai L (2015) Large scale distributed data science using apache spark. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, pp 2323–2324
go back to reference Shoro AG, Soomro TR (2015) Big data analysis: apache spark perspective. Glob J Comput Sci Technol Shoro AG, Soomro TR (2015) Big data analysis: apache spark perspective. Glob J Comput Sci Technol
go back to reference Shyam R et al (2015) Apache spark a big data analytics platform for smart grid. Proc Technol 21:171–178CrossRef Shyam R et al (2015) Apache spark a big data analytics platform for smart grid. Proc Technol 21:171–178CrossRef
go back to reference Speiser JL et al (2019) A random forest method for modeling clustered and longitudinal binary outcomes. Chemometr Intell Lab Syst 185:122–134CrossRef Speiser JL et al (2019) A random forest method for modeling clustered and longitudinal binary outcomes. Chemometr Intell Lab Syst 185:122–134CrossRef
go back to reference Sun Y, Wong AK, Kamel MS (2009) Classification of imbalanced data: a review. Int J Pattern Recognit Artif Intell 23(04):687–719CrossRef Sun Y, Wong AK, Kamel MS (2009) Classification of imbalanced data: a review. Int J Pattern Recognit Artif Intell 23(04):687–719CrossRef
go back to reference Tang Y et al (2009) SVMs modeling for highly imbalanced classification. IEEE Trans Syst Man Cybern Part B (Cybernetics) 39(1):281–288CrossRef Tang Y et al (2009) SVMs modeling for highly imbalanced classification. IEEE Trans Syst Man Cybern Part B (Cybernetics) 39(1):281–288CrossRef
go back to reference Yan Y et al (2019) A parameter-free cleaning method for SMOTE in imbalanced classification. IEEE Access 7:23537–23548CrossRef Yan Y et al (2019) A parameter-free cleaning method for SMOTE in imbalanced classification. IEEE Access 7:23537–23548CrossRef
go back to reference Zhang J et al (2004) Learning rules from highly unbalanced data sets. IEEE Int Confer Data Mining (ICDM’04), 571–574 Zhang J et al (2004) Learning rules from highly unbalanced data sets. IEEE Int Confer Data Mining (ICDM’04), 571–574
go back to reference Zhang S et al (2018) Efficient KNN classification with different numbers of nearest neighbors. IEEE Trans Neural Netw Learn Syst 29(5):1774–1785MathSciNetCrossRef Zhang S et al (2018) Efficient KNN classification with different numbers of nearest neighbors. IEEE Trans Neural Netw Learn Syst 29(5):1774–1785MathSciNetCrossRef
Metadata
Title
A hybrid system for imbalanced data mining
Authors
Zne-Jung Lee
Chou-Yuan Lee
So-Tsung Chou
Wei-Ping Ma
Fulan Ye
Zhen Chen
Publication date
08-08-2019
Publisher
Springer Berlin Heidelberg
Published in
Microsystem Technologies / Issue 9/2020
Print ISSN: 0946-7076
Electronic ISSN: 1432-1858
DOI
https://doi.org/10.1007/s00542-019-04566-1

Other articles of this Issue 9/2020

Microsystem Technologies 9/2020 Go to the issue