Skip to main content
Top
Published in: Neural Computing and Applications 3/2018

03-04-2018 | Original Article

Similarity-based attribute weighting methods via clustering algorithms in the classification of imbalanced medical datasets

Author: Kemal Polat

Published in: Neural Computing and Applications | Issue 3/2018

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

In the fields of pattern recognition and machine learning, the use of data preprocessing algorithms has been increasing in recent years to achieve high classification performance. In particular, it has become inevitable to use the data preprocessing method prior to classification algorithms in classifying medical datasets with the nonlinear and imbalanced data distribution. In this study, a new data preprocessing method has been proposed for the classification of Parkinson, hepatitis, Pima Indians, single proton emission computed tomography (SPECT) heart, and thoracic surgery medical datasets with the nonlinear and imbalanced data distribution. These datasets were taken from UCI machine learning repository. The proposed data preprocessing method consists of three steps. In the first step, the cluster centers of each attribute were calculated using k-means, fuzzy c-means, and mean shift clustering algorithms in medical datasets including Parkinson, hepatitis, Pima Indians, SPECT heart, and thoracic surgery medical datasets. In the second step, the absolute differences between the data in each attribute and the cluster centers are calculated, and then, the average of these differences is calculated for each attribute. In the final step, the weighting coefficients are calculated by dividing the mean value of the difference to the cluster centers, and then, weighting is performed by multiplying the obtained weight coefficients by the attribute values in the dataset. Three different attribute weighting methods have been proposed: (1) similarity-based attribute weighting in k-means clustering, (2) similarity-based attribute weighting in fuzzy c-means clustering, and (3) similarity-based attribute weighting in mean shift clustering. In this paper, we aimed to aggregate the data in each class together with the proposed attribute weighting methods and to reduce the variance value within the class. Thus, by reducing the value of variance in each class, we have put together the data in each class and at the same time, we have further increased the discrimination between the classes. To compare with other methods in the literature, the random subsampling has been used to handle the imbalanced dataset classification. After attribute weighting process, four classification algorithms including linear discriminant analysis, k-nearest neighbor classifier, support vector machine, and random forest classifier have been used to classify imbalanced medical datasets. To evaluate the performance of the proposed models, the classification accuracy, precision, recall, area under the ROC curve, κ value, and F-measure have been used. In the training and testing of the classifier models, three different methods including the 50–50% train–test holdout, the 60–40% train–test holdout, and tenfold cross-validation have been used. The experimental results have shown that the proposed attribute weighting methods have obtained higher classification performance than random subsampling method in the handling of classifying of the imbalanced medical datasets.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Longadge R, Dongre SS, Malik L (2013) Class imbalance problem in data mining: review. Int J Comput Sci Netw (IJCSN) 2(1):83–87 Longadge R, Dongre SS, Malik L (2013) Class imbalance problem in data mining: review. Int J Comput Sci Netw (IJCSN) 2(1):83–87
5.
go back to reference Gong J, Kim H (2017) RHSBoost: improving classification performance in imbalance data. Comput Stat Data Anal 111:1–13MathSciNetCrossRef Gong J, Kim H (2017) RHSBoost: improving classification performance in imbalance data. Comput Stat Data Anal 111:1–13MathSciNetCrossRef
6.
go back to reference Shilaskar S, Ghato A, Chatur P (2017) Medical decision support system for extremely imbalanced datasets. Inf Sci 384:205–219MathSciNetCrossRef Shilaskar S, Ghato A, Chatur P (2017) Medical decision support system for extremely imbalanced datasets. Inf Sci 384:205–219MathSciNetCrossRef
7.
go back to reference Zhang J, Xiao W, Li Y, Zhang S, Yang W (2017) Class-specific cost regulation extreme learning machine for imbalanced classification. Neurocomputing 261:70–82CrossRef Zhang J, Xiao W, Li Y, Zhang S, Yang W (2017) Class-specific cost regulation extreme learning machine for imbalanced classification. Neurocomputing 261:70–82CrossRef
10.
go back to reference Weinberger KQ, Saul LK (2009) Distance metric learning for large margin nearest neighbor classification. J Mach Learn Res 10:207–244MATH Weinberger KQ, Saul LK (2009) Distance metric learning for large margin nearest neighbor classification. J Mach Learn Res 10:207–244MATH
13.
go back to reference Zhang Z (2014) Too much covariates in a multivariable model may cause the problem of overfitting. J Thorac Dis 6:E196–E197 Zhang Z (2014) Too much covariates in a multivariable model may cause the problem of overfitting. J Thorac Dis 6:E196–E197
14.
go back to reference Hastie T, Tibshirani R, Friedman JH (2003) The elements of statistical learning. Springer, New YorkMATH Hastie T, Tibshirani R, Friedman JH (2003) The elements of statistical learning. Springer, New YorkMATH
15.
go back to reference McLachlan GJ (2004) Discriminant analysis and statistical pattern recognition. Wiley, HobokenMATH McLachlan GJ (2004) Discriminant analysis and statistical pattern recognition. Wiley, HobokenMATH
16.
go back to reference Shen R, Ghosh D, Chinnaiyan A, Meng Z (2006) Eigengene-based linear discriminant model for tumor classification using gene expression microarray data. Bioinformatics 22(21):2635–2642CrossRef Shen R, Ghosh D, Chinnaiyan A, Meng Z (2006) Eigengene-based linear discriminant model for tumor classification using gene expression microarray data. Bioinformatics 22(21):2635–2642CrossRef
17.
go back to reference Cortes C, Vapnik V (1995) Support-vector network. Mach Learn 20(3):273–297MATH Cortes C, Vapnik V (1995) Support-vector network. Mach Learn 20(3):273–297MATH
18.
go back to reference Vapnik V (2014) Invited speaker. In: IPMU information processing and management. 15th international conference on information processing and management of uncertainty in knowledge-based systems, IPMU'2014, Montpellier, France,15–19 July 2014 Vapnik V (2014) Invited speaker. In: IPMU information processing and management. 15th international conference on information processing and management of uncertainty in knowledge-based systems, IPMU'2014, Montpellier, France,15–19 July 2014
20.
go back to reference Levine CB, Fahrbach KR, Siderowf AD, Estok RP, Ludensky VM, Ross SD (2003) Diagnosis and treatment of Parkinson’s disease: a systematic review of the literature. Evid Rep Technol Assess 57:1–4 (Summary) Levine CB, Fahrbach KR, Siderowf AD, Estok RP, Ludensky VM, Ross SD (2003) Diagnosis and treatment of Parkinson’s disease: a systematic review of the literature. Evid Rep Technol Assess 57:1–4 (Summary)
23.
go back to reference Clark M (2015) An introduction to machine learning with applications in R, Lecture notes. University of Notre Dame, Notre Dame Clark M (2015) An introduction to machine learning with applications in R, Lecture notes. University of Notre Dame, Notre Dame
Metadata
Title
Similarity-based attribute weighting methods via clustering algorithms in the classification of imbalanced medical datasets
Author
Kemal Polat
Publication date
03-04-2018
Publisher
Springer London
Published in
Neural Computing and Applications / Issue 3/2018
Print ISSN: 0941-0643
Electronic ISSN: 1433-3058
DOI
https://doi.org/10.1007/s00521-018-3471-8

Other articles of this Issue 3/2018

Neural Computing and Applications 3/2018 Go to the issue

Premium Partner