Skip to main content
Erschienen in: Computing 3/2017

02.02.2016

Incorporating receiver operating characteristics into naive Bayes for unbalanced data classification

verfasst von: Taeheung Kim, Byung Do Chung, Jong-Seok Lee

Erschienen in: Computing | Ausgabe 3/2017

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Naive Bayesian classification has been widely used in data mining area because of its simplicity and robustness to missing values and irrelevant attributes. However, naive Bayes classifiers sometimes show poor performance due to their unrealistic assumption that all attributes are equally important and conditionally independent of each other. In this research, we dispense with the former assumption by proposing a new attribute weighting method. The proposed method considers each attribute as a single classifier and measures its discriminating ability using the area under an ROC curve (AUC). Each AUC value is then used to weight the corresponding attribute. In addition, we try to reduce the complexity of classification models by selecting high AUC attributes. Using 20 real datasets from the machine learning repository at UC Irvine (UCI), we conduct a numerical experiment to show that the proposed method is an improvement over standard naive Bayes classification and existing weighting methods.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Bradley P (1997) The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recogn 30(7):1145–1159CrossRef Bradley P (1997) The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recogn 30(7):1145–1159CrossRef
2.
Zurück zum Zitat Campadelli P, Casiraghi E, Valentini G (2005) Support vector machines for candidate nodules classification. Neurocomputing 68:281–288CrossRef Campadelli P, Casiraghi E, Valentini G (2005) Support vector machines for candidate nodules classification. Neurocomputing 68:281–288CrossRef
3.
Zurück zum Zitat Chan PK, Fan W, Prodromidis AL, Stolfo SJ (1999) Distributed data mining in credit card fraud detection. IEEE Intell Syst 14(6):67–74CrossRef Chan PK, Fan W, Prodromidis AL, Stolfo SJ (1999) Distributed data mining in credit card fraud detection. IEEE Intell Syst 14(6):67–74CrossRef
4.
Zurück zum Zitat Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357MATH Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357MATH
5.
Zurück zum Zitat Drummond C, Holte RC (2000) Exploiting the cost (in)sensitivity of decision tree splitting criteria. In: Proceedings of the 17th International Conference on Machine Learning, pp 239–246 Drummond C, Holte RC (2000) Exploiting the cost (in)sensitivity of decision tree splitting criteria. In: Proceedings of the 17th International Conference on Machine Learning, pp 239–246
6.
Zurück zum Zitat Fawcett T (2006) An introduction to ROC analysis. Pattern Recogn Lett 27:861–874CrossRef Fawcett T (2006) An introduction to ROC analysis. Pattern Recogn Lett 27:861–874CrossRef
7.
Zurück zum Zitat Ferri C, Flach P, Hernandez-Orallo J (2002) Learning decision trees using the area under the ROC Curve. In: Proceedings of the 19th International Conference on Machine Learning, pp 139–146 Ferri C, Flach P, Hernandez-Orallo J (2002) Learning decision trees using the area under the ROC Curve. In: Proceedings of the 19th International Conference on Machine Learning, pp 139–146
8.
Zurück zum Zitat Guo H, Viktor H (2004) Learning from imbalanced data sets with boosting and data generation: The DataBoost-IM approach. SIGKDD Explor Spec Issue Imbal Data Sets 6:30–39CrossRef Guo H, Viktor H (2004) Learning from imbalanced data sets with boosting and data generation: The DataBoost-IM approach. SIGKDD Explor Spec Issue Imbal Data Sets 6:30–39CrossRef
9.
Zurück zum Zitat Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182MATH Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182MATH
10.
Zurück zum Zitat Hall M (2007) A decision tree-based attribute weighting filter for naive Bayes. Knowl-Based Syst 20(2):120–126CrossRef Hall M (2007) A decision tree-based attribute weighting filter for naive Bayes. Knowl-Based Syst 20(2):120–126CrossRef
11.
Zurück zum Zitat Hanley JA, McNeil BJ (1982) The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143:29–36CrossRef Hanley JA, McNeil BJ (1982) The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143:29–36CrossRef
12.
Zurück zum Zitat Hassan MR, Hossain MM, Bailey J, Ramamohanarao K (2008) Improving k-nearest neighbour classification with distance functions based on receiver operating characteristics. Lec Notes Comput Sci 5211:489–504CrossRef Hassan MR, Hossain MM, Bailey J, Ramamohanarao K (2008) Improving k-nearest neighbour classification with distance functions based on receiver operating characteristics. Lec Notes Comput Sci 5211:489–504CrossRef
13.
Zurück zum Zitat Hossain MM, Hassan MR, Bailey J (2008) ROC-tree: a novel decision tree induction algorithm based on receiver operating characteristics to classify gene expression data. In: Proceedings of SIAM International Conference on Data Mining, pp 455–465 Hossain MM, Hassan MR, Bailey J (2008) ROC-tree: a novel decision tree induction algorithm based on receiver operating characteristics to classify gene expression data. In: Proceedings of SIAM International Conference on Data Mining, pp 455–465
14.
Zurück zum Zitat Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intell Data Anal 6(5):429–449MATH Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intell Data Anal 6(5):429–449MATH
15.
Zurück zum Zitat Langley P, Sage S (1994) Induction of selective Bayesian classifiers. In: Proceedings of 10th International Conference on Uncertainty in Artificial Intelligence, pp 399–406 Langley P, Sage S (1994) Induction of selective Bayesian classifiers. In: Proceedings of 10th International Conference on Uncertainty in Artificial Intelligence, pp 399–406
16.
Zurück zum Zitat Lee CH, Gutierrez F, Dou D (2011) Calculating feature weights in naive Bayes with Kullback-Leibler measure. In: Proceedings of the 11th IEEE International Conference on Data Mining, pp 1146–1151 Lee CH, Gutierrez F, Dou D (2011) Calculating feature weights in naive Bayes with Kullback-Leibler measure. In: Proceedings of the 11th IEEE International Conference on Data Mining, pp 1146–1151
17.
Zurück zum Zitat Lee JS, Zhu D (2011) When costs are unequal and unknown: a subtree grafting approach for unbalanced data classification. Decision Sci 42(4):803–829CrossRef Lee JS, Zhu D (2011) When costs are unequal and unknown: a subtree grafting approach for unbalanced data classification. Decision Sci 42(4):803–829CrossRef
18.
Zurück zum Zitat Liu XY, Wu J, Zhou ZH (2009) Exploratory undersampling for class-imbalance learning. IEEE Trans Syst Man Cybern Part B Cybern 39(2):539–550CrossRef Liu XY, Wu J, Zhou ZH (2009) Exploratory undersampling for class-imbalance learning. IEEE Trans Syst Man Cybern Part B Cybern 39(2):539–550CrossRef
19.
Zurück zum Zitat Tan PN, Steinbach M, Kumar V (2006) Introduction to data mining. Addison Wesley, Boston Tan PN, Steinbach M, Kumar V (2006) Introduction to data mining. Addison Wesley, Boston
20.
Zurück zum Zitat Tang Y, Krasser S, Alperovitch D, Judge P (2008) Spam sender detection with classification modeling on highly imbalanced mail server behavior data. In: Proceedings of International Conference on Artificial Intelligence and Pattern Recognition, pp 174–180 Tang Y, Krasser S, Alperovitch D, Judge P (2008) Spam sender detection with classification modeling on highly imbalanced mail server behavior data. In: Proceedings of International Conference on Artificial Intelligence and Pattern Recognition, pp 174–180
22.
Zurück zum Zitat Weiss GM, McCarthy K, Zabar B (2007) Cost-sensitive learning vs. sampling: which is best for handling unbalanced classes with unequal error costs? In: Proceedings of 2007 International Conference on Data Mining, pp 35–41 Weiss GM, McCarthy K, Zabar B (2007) Cost-sensitive learning vs. sampling: which is best for handling unbalanced classes with unequal error costs? In: Proceedings of 2007 International Conference on Data Mining, pp 35–41
23.
Zurück zum Zitat Wu J, Cai Z (2011) Attribute weighting via differential evolution algorithm for attribute weighted naive Bayes (WNB). J Comput Inform Syst 7(5):1672–1679 Wu J, Cai Z (2011) Attribute weighting via differential evolution algorithm for attribute weighted naive Bayes (WNB). J Comput Inform Syst 7(5):1672–1679
24.
Zurück zum Zitat Yen SJ, Lee YS (2009) Cluster-based under-sampling approaches for imbalanced data distributions. Expert Syst with Appl 36(3):5718–5727MathSciNetCrossRef Yen SJ, Lee YS (2009) Cluster-based under-sampling approaches for imbalanced data distributions. Expert Syst with Appl 36(3):5718–5727MathSciNetCrossRef
25.
Zurück zum Zitat Zhang G, Berardi VL (1998) An investigation of neural networks in thyroid function diagnosis. Health Care Manage Sci 1(1):29–37CrossRef Zhang G, Berardi VL (1998) An investigation of neural networks in thyroid function diagnosis. Health Care Manage Sci 1(1):29–37CrossRef
26.
Zurück zum Zitat Zhang H, Sheng S (2004) Learning weighted naive Bayes with accurate ranking. In: Proceedings of the 4th IEEE International Conference on Data Mining, pp 567–570 Zhang H, Sheng S (2004) Learning weighted naive Bayes with accurate ranking. In: Proceedings of the 4th IEEE International Conference on Data Mining, pp 567–570
Metadaten
Titel
Incorporating receiver operating characteristics into naive Bayes for unbalanced data classification
verfasst von
Taeheung Kim
Byung Do Chung
Jong-Seok Lee
Publikationsdatum
02.02.2016
Verlag
Springer Vienna
Erschienen in
Computing / Ausgabe 3/2017
Print ISSN: 0010-485X
Elektronische ISSN: 1436-5057
DOI
https://doi.org/10.1007/s00607-016-0483-z

Weitere Artikel der Ausgabe 3/2017

Computing 3/2017 Zur Ausgabe

Premium Partner