Skip to main content
Erschienen in: Evolutionary Intelligence 2/2021

24.09.2019 | Special Issue

Data augmentation for cancer classification in oncogenomics: an improved KNN based approach

verfasst von: Poonam Chaudhari, Himanshu Agarwal, Vikrant Bhateja

Erschienen in: Evolutionary Intelligence | Ausgabe 2/2021

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

There is currently a great need for research in gene expression data to help with cancer classification in the field of oncogenomics. This is especially true since the disease occurs sporadically and often does not show symptoms. Typically, gene expression data is disproportionate with a large number of features and a low number of samples. A small sample size is likely to adversely affect accuracy of classification, as the performance of a classifier depends largely on the data. There is a pressing need to generate data which could be provided as better input to classifiers. Primitive augmentation techniques like uniform random generation and addition of noise do not assure good probability distribution. Secondly, as we deal with critical applications, the augmented data needs to have greater likelihood to the original values. Thus, we propose an improved variant of K-nearest neighborhood (KNN) rule. We use Counting Quotient Filter, Euclidean distance and mean best value from the k-neighbors for each target sample to get synthetic samples. A comparison is drawn amongst the raw data from public domain (original data), data generated using standard K-nearest neighbor rule and data generated using improved K-nearest neighbor rule. The data generated through these approaches is then further classified using state-of-art classifiers like SVM, J48 and DNN. The samples generated through our improvisation technique yield better recall values than the standard implementation; ensuring sensitivity of data. Average classification accuracy from all the three classifiers conclude enhancement of 7.72% as compared to traditional KNN approach and 16% when raw data is considered as input to the classifiers. Thus, the proposed algorithm attains two objectives; firstly, ensuring sensitivity of data for critical applications and secondly, enhancing classification accuracy.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Bao L, Juan C, Li J, Zhang Y (2016) Boosted near-miss under-sampling on SVM ensembles for concept detection in large-scale imbalanced datasets. Neurocomputing 172:198–206CrossRef Bao L, Juan C, Li J, Zhang Y (2016) Boosted near-miss under-sampling on SVM ensembles for concept detection in large-scale imbalanced datasets. Neurocomputing 172:198–206CrossRef
2.
Zurück zum Zitat Beckmann M, Ebecken NFF, Lima B (2015) A KNN undersampling approach for data balancing. J Intell Learn Syst Appl 7(4):104–116 Beckmann M, Ebecken NFF, Lima B (2015) A KNN undersampling approach for data balancing. J Intell Learn Syst Appl 7(4):104–116
3.
Zurück zum Zitat Bharathi A, Natarajan AM (2011) Cancer classification using support vector machines and relevance vector machine based on analysis of variance features. J Comput Sci 7(9):1393–1399CrossRef Bharathi A, Natarajan AM (2011) Cancer classification using support vector machines and relevance vector machine based on analysis of variance features. J Comput Sci 7(9):1393–1399CrossRef
4.
Zurück zum Zitat Bhat RR, Viswanath V, Li X (2016) DeepCancer: detecting cancer through gene expressions via deep generative learning. In: IEEE 15th international conference on dependable, autonomic and secure computing, 15th international conference on pervasive intelligence and computing, 3rd international conference on big data intelligence and computing and cyber science and technology congress Bhat RR, Viswanath V, Li X (2016) DeepCancer: detecting cancer through gene expressions via deep generative learning. In: IEEE 15th international conference on dependable, autonomic and secure computing, 15th international conference on pervasive intelligence and computing, 3rd international conference on big data intelligence and computing and cyber science and technology congress
6.
Zurück zum Zitat Cao Z, Zhang S (2018) Sequence analysis simple tricks of convolutional neural network architectures improve DNA–protein binding prediction. Bioinformatics, ISSN 1460-2059 Cao Z, Zhang S (2018) Sequence analysis simple tricks of convolutional neural network architectures improve DNA–protein binding prediction. Bioinformatics, ISSN 1460-2059
8.
Zurück zum Zitat Chawla NV et al (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357CrossRef Chawla NV et al (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357CrossRef
9.
Zurück zum Zitat Clarkson K (1987) New applications of random sampling in computational geometry. Discrete Comput Geom 2(2):195–222MathSciNetCrossRef Clarkson K (1987) New applications of random sampling in computational geometry. Discrete Comput Geom 2(2):195–222MathSciNetCrossRef
10.
Zurück zum Zitat Cover T, Hart P (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory Arch 13(1):21–27CrossRef Cover T, Hart P (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory Arch 13(1):21–27CrossRef
11.
Zurück zum Zitat Domingos P, Hulten G (2001) Learning from infinite data in finite time. In: Levi E (ed) Advances in neural information processing systems, pp 673–680 Domingos P, Hulten G (2001) Learning from infinite data in finite time. In: Levi E (ed) Advances in neural information processing systems, pp 673–680
12.
Zurück zum Zitat Duda et al (2000) Chapter non parametric techniques. In: Pattern classification, Wiley Interscience Publication, New York Duda et al (2000) Chapter non parametric techniques. In: Pattern classification, Wiley Interscience Publication, New York
13.
Zurück zum Zitat Eghbal-zadeh H, Widmer G (2017) Likelihood estimation for generative adversarial networks, ICML Workshop on Implicit models, Machine Learning. Artificial Intelligence. arXiv:1707.07530 Eghbal-zadeh H, Widmer G (2017) Likelihood estimation for generative adversarial networks, ICML Workshop on Implicit models, Machine Learning. Artificial Intelligence. arXiv:​1707.​07530
14.
Zurück zum Zitat Gu J, Taylor CR, Phil D (2014) Practicing pathology in the era of big data and personalized medicine. Appl Immunohistochem Mol Morphol 22:1–9CrossRef Gu J, Taylor CR, Phil D (2014) Practicing pathology in the era of big data and personalized medicine. Appl Immunohistochem Mol Morphol 22:1–9CrossRef
15.
Zurück zum Zitat Hall P, Samworth BU (2008) Choice of neighbor order in nearest-neighbor classification. Ann Stat 36(5):2135–2152MathSciNetMATH Hall P, Samworth BU (2008) Choice of neighbor order in nearest-neighbor classification. Ann Stat 36(5):2135–2152MathSciNetMATH
18.
Zurück zum Zitat Hussain Z et al (2018) Differential data augmentation techniques for medical imaging classification tasks. In: AMIA annual symposium, pp 979–984 Hussain Z et al (2018) Differential data augmentation techniques for medical imaging classification tasks. In: AMIA annual symposium, pp 979–984
20.
Zurück zum Zitat Krawczyk B (2016) Learning from imbalanced data: open challenges and future directions. Prog Artif Intell 5(4):221–232CrossRef Krawczyk B (2016) Learning from imbalanced data: open challenges and future directions. Prog Artif Intell 5(4):221–232CrossRef
21.
Zurück zum Zitat Liu CH, Papadopoulou E, Lee D-T (2015) The k-nearest-neighbor Voronoi diagram revisited. J Algorithmica 71(2):429–449MathSciNetCrossRef Liu CH, Papadopoulou E, Lee D-T (2015) The k-nearest-neighbor Voronoi diagram revisited. J Algorithmica 71(2):429–449MathSciNetCrossRef
23.
Zurück zum Zitat Mohsena H, El-Dahshan ESA, El-Horbaty E-SM, Salem A-BM (2018) Classification using deep learning neural networks for brain tumors. Future Comput Inf J 3(1):68–71CrossRef Mohsena H, El-Dahshan ESA, El-Horbaty E-SM, Salem A-BM (2018) Classification using deep learning neural networks for brain tumors. Future Comput Inf J 3(1):68–71CrossRef
24.
Zurück zum Zitat More A (2016) Survey of resampling techniques for improving classification performance in unbalanced datasets. arXiv:1608.06048v1 More A (2016) Survey of resampling techniques for improving classification performance in unbalanced datasets. arXiv:1608.​06048v1
30.
Zurück zum Zitat O’Rourke J (1982) Computing the relative neighborhood graph in the L1 and L∞ metrics. Pattern Recogn 15(3):189–192CrossRef O’Rourke J (1982) Computing the relative neighborhood graph in the L1 and L∞ metrics. Pattern Recogn 15(3):189–192CrossRef
31.
32.
Zurück zum Zitat Rung J, Brazma A (2012) Reuse of public genome-wide gene expression data. Nat Rev Genet. ISSN 1471-0064 Rung J, Brazma A (2012) Reuse of public genome-wide gene expression data. Nat Rev Genet. ISSN 1471-0064
34.
Zurück zum Zitat Singh A, Dutta MK, Sharma DK (2016) Unique identification code for medical fundus images using blood vessel pattern for tele-ophthalmology applications. Comput Methods Programs Biomed 135:61–75CrossRef Singh A, Dutta MK, Sharma DK (2016) Unique identification code for medical fundus images using blood vessel pattern for tele-ophthalmology applications. Comput Methods Programs Biomed 135:61–75CrossRef
36.
Zurück zum Zitat Venkatesan E, Velmurugan T (2015) Performance analysis of decision tree algorithms for breast cancer classification. Indian J Sci Technol 8:1–8CrossRef Venkatesan E, Velmurugan T (2015) Performance analysis of decision tree algorithms for breast cancer classification. Indian J Sci Technol 8:1–8CrossRef
37.
Zurück zum Zitat WHA (2004) 57.13: Genomics and World Health, Fifty Seventh World Health Assembly Resolution WHA (2004) 57.13: Genomics and World Health, Fifty Seventh World Health Assembly Resolution
40.
Zurück zum Zitat Wong S et al (2016) Understanding data augmentation for classification: when to warp? In: International conference on digital image computing: techniques and applications (DICTA) Wong S et al (2016) Understanding data augmentation for classification: when to warp? In: International conference on digital image computing: techniques and applications (DICTA)
41.
Zurück zum Zitat Yadav BSM, Velagaleti SB (2018) Challenges in handling imbalanced big data: a survey. TROI 5(3):1–58 Yadav BSM, Velagaleti SB (2018) Challenges in handling imbalanced big data: a survey. TROI 5(3):1–58
42.
Zurück zum Zitat Zhang S, Li X, Zong M, Zhu X, Cheng D (2017) Learning k for kNN classification. ACM Trans Intell Syst Technol 8(3), Article 43 Zhang S, Li X, Zong M, Zhu X, Cheng D (2017) Learning k for kNN classification. ACM Trans Intell Syst Technol 8(3), Article 43
43.
Zurück zum Zitat Zhao D, Liu H, Zheng Y, He Y, Lu D, Lyu C (2018) A reliable method for colorectal cancer prediction based on feature selection and support vector machine. J Med Biol Eng Comput 57(4):901–912CrossRef Zhao D, Liu H, Zheng Y, He Y, Lu D, Lyu C (2018) A reliable method for colorectal cancer prediction based on feature selection and support vector machine. J Med Biol Eng Comput 57(4):901–912CrossRef
Metadaten
Titel
Data augmentation for cancer classification in oncogenomics: an improved KNN based approach
verfasst von
Poonam Chaudhari
Himanshu Agarwal
Vikrant Bhateja
Publikationsdatum
24.09.2019
Verlag
Springer Berlin Heidelberg
Erschienen in
Evolutionary Intelligence / Ausgabe 2/2021
Print ISSN: 1864-5909
Elektronische ISSN: 1864-5917
DOI
https://doi.org/10.1007/s12065-019-00283-w

Weitere Artikel der Ausgabe 2/2021

Evolutionary Intelligence 2/2021 Zur Ausgabe