Skip to main content
Top
Published in: Computing 3/2017

02-02-2016

Incorporating receiver operating characteristics into naive Bayes for unbalanced data classification

Authors: Taeheung Kim, Byung Do Chung, Jong-Seok Lee

Published in: Computing | Issue 3/2017

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Naive Bayesian classification has been widely used in data mining area because of its simplicity and robustness to missing values and irrelevant attributes. However, naive Bayes classifiers sometimes show poor performance due to their unrealistic assumption that all attributes are equally important and conditionally independent of each other. In this research, we dispense with the former assumption by proposing a new attribute weighting method. The proposed method considers each attribute as a single classifier and measures its discriminating ability using the area under an ROC curve (AUC). Each AUC value is then used to weight the corresponding attribute. In addition, we try to reduce the complexity of classification models by selecting high AUC attributes. Using 20 real datasets from the machine learning repository at UC Irvine (UCI), we conduct a numerical experiment to show that the proposed method is an improvement over standard naive Bayes classification and existing weighting methods.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Bradley P (1997) The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recogn 30(7):1145–1159CrossRef Bradley P (1997) The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recogn 30(7):1145–1159CrossRef
2.
go back to reference Campadelli P, Casiraghi E, Valentini G (2005) Support vector machines for candidate nodules classification. Neurocomputing 68:281–288CrossRef Campadelli P, Casiraghi E, Valentini G (2005) Support vector machines for candidate nodules classification. Neurocomputing 68:281–288CrossRef
3.
go back to reference Chan PK, Fan W, Prodromidis AL, Stolfo SJ (1999) Distributed data mining in credit card fraud detection. IEEE Intell Syst 14(6):67–74CrossRef Chan PK, Fan W, Prodromidis AL, Stolfo SJ (1999) Distributed data mining in credit card fraud detection. IEEE Intell Syst 14(6):67–74CrossRef
4.
go back to reference Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357MATH Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357MATH
5.
go back to reference Drummond C, Holte RC (2000) Exploiting the cost (in)sensitivity of decision tree splitting criteria. In: Proceedings of the 17th International Conference on Machine Learning, pp 239–246 Drummond C, Holte RC (2000) Exploiting the cost (in)sensitivity of decision tree splitting criteria. In: Proceedings of the 17th International Conference on Machine Learning, pp 239–246
6.
go back to reference Fawcett T (2006) An introduction to ROC analysis. Pattern Recogn Lett 27:861–874CrossRef Fawcett T (2006) An introduction to ROC analysis. Pattern Recogn Lett 27:861–874CrossRef
7.
go back to reference Ferri C, Flach P, Hernandez-Orallo J (2002) Learning decision trees using the area under the ROC Curve. In: Proceedings of the 19th International Conference on Machine Learning, pp 139–146 Ferri C, Flach P, Hernandez-Orallo J (2002) Learning decision trees using the area under the ROC Curve. In: Proceedings of the 19th International Conference on Machine Learning, pp 139–146
8.
go back to reference Guo H, Viktor H (2004) Learning from imbalanced data sets with boosting and data generation: The DataBoost-IM approach. SIGKDD Explor Spec Issue Imbal Data Sets 6:30–39CrossRef Guo H, Viktor H (2004) Learning from imbalanced data sets with boosting and data generation: The DataBoost-IM approach. SIGKDD Explor Spec Issue Imbal Data Sets 6:30–39CrossRef
9.
go back to reference Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182MATH Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182MATH
10.
go back to reference Hall M (2007) A decision tree-based attribute weighting filter for naive Bayes. Knowl-Based Syst 20(2):120–126CrossRef Hall M (2007) A decision tree-based attribute weighting filter for naive Bayes. Knowl-Based Syst 20(2):120–126CrossRef
11.
go back to reference Hanley JA, McNeil BJ (1982) The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143:29–36CrossRef Hanley JA, McNeil BJ (1982) The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143:29–36CrossRef
12.
go back to reference Hassan MR, Hossain MM, Bailey J, Ramamohanarao K (2008) Improving k-nearest neighbour classification with distance functions based on receiver operating characteristics. Lec Notes Comput Sci 5211:489–504CrossRef Hassan MR, Hossain MM, Bailey J, Ramamohanarao K (2008) Improving k-nearest neighbour classification with distance functions based on receiver operating characteristics. Lec Notes Comput Sci 5211:489–504CrossRef
13.
go back to reference Hossain MM, Hassan MR, Bailey J (2008) ROC-tree: a novel decision tree induction algorithm based on receiver operating characteristics to classify gene expression data. In: Proceedings of SIAM International Conference on Data Mining, pp 455–465 Hossain MM, Hassan MR, Bailey J (2008) ROC-tree: a novel decision tree induction algorithm based on receiver operating characteristics to classify gene expression data. In: Proceedings of SIAM International Conference on Data Mining, pp 455–465
14.
go back to reference Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intell Data Anal 6(5):429–449MATH Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intell Data Anal 6(5):429–449MATH
15.
go back to reference Langley P, Sage S (1994) Induction of selective Bayesian classifiers. In: Proceedings of 10th International Conference on Uncertainty in Artificial Intelligence, pp 399–406 Langley P, Sage S (1994) Induction of selective Bayesian classifiers. In: Proceedings of 10th International Conference on Uncertainty in Artificial Intelligence, pp 399–406
16.
go back to reference Lee CH, Gutierrez F, Dou D (2011) Calculating feature weights in naive Bayes with Kullback-Leibler measure. In: Proceedings of the 11th IEEE International Conference on Data Mining, pp 1146–1151 Lee CH, Gutierrez F, Dou D (2011) Calculating feature weights in naive Bayes with Kullback-Leibler measure. In: Proceedings of the 11th IEEE International Conference on Data Mining, pp 1146–1151
17.
go back to reference Lee JS, Zhu D (2011) When costs are unequal and unknown: a subtree grafting approach for unbalanced data classification. Decision Sci 42(4):803–829CrossRef Lee JS, Zhu D (2011) When costs are unequal and unknown: a subtree grafting approach for unbalanced data classification. Decision Sci 42(4):803–829CrossRef
18.
go back to reference Liu XY, Wu J, Zhou ZH (2009) Exploratory undersampling for class-imbalance learning. IEEE Trans Syst Man Cybern Part B Cybern 39(2):539–550CrossRef Liu XY, Wu J, Zhou ZH (2009) Exploratory undersampling for class-imbalance learning. IEEE Trans Syst Man Cybern Part B Cybern 39(2):539–550CrossRef
19.
go back to reference Tan PN, Steinbach M, Kumar V (2006) Introduction to data mining. Addison Wesley, Boston Tan PN, Steinbach M, Kumar V (2006) Introduction to data mining. Addison Wesley, Boston
20.
go back to reference Tang Y, Krasser S, Alperovitch D, Judge P (2008) Spam sender detection with classification modeling on highly imbalanced mail server behavior data. In: Proceedings of International Conference on Artificial Intelligence and Pattern Recognition, pp 174–180 Tang Y, Krasser S, Alperovitch D, Judge P (2008) Spam sender detection with classification modeling on highly imbalanced mail server behavior data. In: Proceedings of International Conference on Artificial Intelligence and Pattern Recognition, pp 174–180
22.
go back to reference Weiss GM, McCarthy K, Zabar B (2007) Cost-sensitive learning vs. sampling: which is best for handling unbalanced classes with unequal error costs? In: Proceedings of 2007 International Conference on Data Mining, pp 35–41 Weiss GM, McCarthy K, Zabar B (2007) Cost-sensitive learning vs. sampling: which is best for handling unbalanced classes with unequal error costs? In: Proceedings of 2007 International Conference on Data Mining, pp 35–41
23.
go back to reference Wu J, Cai Z (2011) Attribute weighting via differential evolution algorithm for attribute weighted naive Bayes (WNB). J Comput Inform Syst 7(5):1672–1679 Wu J, Cai Z (2011) Attribute weighting via differential evolution algorithm for attribute weighted naive Bayes (WNB). J Comput Inform Syst 7(5):1672–1679
24.
go back to reference Yen SJ, Lee YS (2009) Cluster-based under-sampling approaches for imbalanced data distributions. Expert Syst with Appl 36(3):5718–5727MathSciNetCrossRef Yen SJ, Lee YS (2009) Cluster-based under-sampling approaches for imbalanced data distributions. Expert Syst with Appl 36(3):5718–5727MathSciNetCrossRef
25.
go back to reference Zhang G, Berardi VL (1998) An investigation of neural networks in thyroid function diagnosis. Health Care Manage Sci 1(1):29–37CrossRef Zhang G, Berardi VL (1998) An investigation of neural networks in thyroid function diagnosis. Health Care Manage Sci 1(1):29–37CrossRef
26.
go back to reference Zhang H, Sheng S (2004) Learning weighted naive Bayes with accurate ranking. In: Proceedings of the 4th IEEE International Conference on Data Mining, pp 567–570 Zhang H, Sheng S (2004) Learning weighted naive Bayes with accurate ranking. In: Proceedings of the 4th IEEE International Conference on Data Mining, pp 567–570
Metadata
Title
Incorporating receiver operating characteristics into naive Bayes for unbalanced data classification
Authors
Taeheung Kim
Byung Do Chung
Jong-Seok Lee
Publication date
02-02-2016
Publisher
Springer Vienna
Published in
Computing / Issue 3/2017
Print ISSN: 0010-485X
Electronic ISSN: 1436-5057
DOI
https://doi.org/10.1007/s00607-016-0483-z

Other articles of this Issue 3/2017

Computing 3/2017 Go to the issue

Premium Partner