Skip to main content
Top
Published in: Neural Processing Letters 6/2023

09-06-2023

Data Augmentation Generated by Generative Adversarial Network for Small Sample Datasets Clustering

Authors: Hui Yu, Qiao Feng Wang, Jian Yu Shi

Published in: Neural Processing Letters | Issue 6/2023

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

In the field of data mining, the performance of clustering is largely affected by the number of samples. However, obtaining enough data samples in some applications is difficult and expensive. To solve this problem, data augmentation like the oversampling methods have been adopted, but these methods mainly focus more on the local information of the data, without considering its potential distribution. In this paper, a new data augmentation method is proposed, which is the Wasserstein Generation Adversarial Network based on the Gaussian Mixture Model (GMM_WGAN) to generate datasets for small samples, to solve the problem of insufficient dataset size in clustering. It includes two steps, in the first step we use the Gaussian Mixture Model to capture the potential distribution of the real dataset, and in the second step, we use Wasserstein generative adversarial network to generate data samples to expand the small size dataset. We utilize five clustering algorithms to evaluate GMM_WGAN performance and compare it with the other seven data enhancement methods. Experiments on 10 small size datasets demonstrate that the proposed approach achieves greater result than others based on five evaluation metrics.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Appendix
Available only for authorised users
Literature
1.
go back to reference Jiao P, Yu W, Wang W, Li X, Sun Y (2018) Exploring temporal community structure and constant evolutionary pattern hiding in dynamic networks. Neurocomputing 314:224–233CrossRef Jiao P, Yu W, Wang W, Li X, Sun Y (2018) Exploring temporal community structure and constant evolutionary pattern hiding in dynamic networks. Neurocomputing 314:224–233CrossRef
2.
go back to reference Khan MT, Azam N, Khalid S, Aziz F (2022) Hierarchical lifelong topic modeling using rules extracted from network communities. PLoS ONE, 17 Khan MT, Azam N, Khalid S, Aziz F (2022) Hierarchical lifelong topic modeling using rules extracted from network communities. PLoS ONE, 17
3.
go back to reference Lian C, Ruan S, Denoeux T, Li H, Vera P (2018) Joint tumor segmentation in pet-ct images using co-clustering and fusion based on belief functions. IEEE Trans Image Process 28(2):755–766MathSciNetCrossRefMATH Lian C, Ruan S, Denoeux T, Li H, Vera P (2018) Joint tumor segmentation in pet-ct images using co-clustering and fusion based on belief functions. IEEE Trans Image Process 28(2):755–766MathSciNetCrossRefMATH
4.
go back to reference Yu H, Mao K-T, Shi J-Y, Huang H, Chen Z, Dong K, Yiu S-M (2018) Predicting and understanding comprehensive drug-drug interactions via semi-nonnegative matrix factorization. BMC Syst Biol 12(1):101–110 Yu H, Mao K-T, Shi J-Y, Huang H, Chen Z, Dong K, Yiu S-M (2018) Predicting and understanding comprehensive drug-drug interactions via semi-nonnegative matrix factorization. BMC Syst Biol 12(1):101–110
5.
go back to reference Yu H, Yuan CL, Yao JT, Wang XN (2019) A three-way clustering method based on an improved dbscan algorithm. Phys A Stat Mech Appl 535:122289CrossRef Yu H, Yuan CL, Yao JT, Wang XN (2019) A three-way clustering method based on an improved dbscan algorithm. Phys A Stat Mech Appl 535:122289CrossRef
6.
go back to reference Chao G (2019) Discriminative k-means Laplacian clustering. Neural Process Lett 49(1):393–405CrossRef Chao G (2019) Discriminative k-means Laplacian clustering. Neural Process Lett 49(1):393–405CrossRef
7.
go back to reference Han B, Wei Y, Kang L, Wang Q, Feng S (2022) Attributed multiplex graph clustering: a heuristic clustering-aware network embedding approach. Phys A Stat Mech Appl 592:126794CrossRef Han B, Wei Y, Kang L, Wang Q, Feng S (2022) Attributed multiplex graph clustering: a heuristic clustering-aware network embedding approach. Phys A Stat Mech Appl 592:126794CrossRef
8.
go back to reference Gu Z, Deng Z, Huang Y, Liu D, Zhang Z (2021) Subspace clustering via integrating sparse representation and adaptive graph learning. Neural Process Lett 53(6):4377–4388CrossRef Gu Z, Deng Z, Huang Y, Liu D, Zhang Z (2021) Subspace clustering via integrating sparse representation and adaptive graph learning. Neural Process Lett 53(6):4377–4388CrossRef
9.
go back to reference Pavel B (2006) A survey of clustering data mining techniques. In: Grouping multidimensional data, pp 25–71. Springer Pavel B (2006) A survey of clustering data mining techniques. In: Grouping multidimensional data, pp 25–71. Springer
10.
go back to reference Jain AK, Topchy A, Law MHC, Buhmann JM (2004) Landscape of clustering algorithms. In: Proceedings of the 17th international conference on pattern recognition, 2004. ICPR 2004., vol 1, pp 260–263. IEEE Jain AK, Topchy A, Law MHC, Buhmann JM (2004) Landscape of clustering algorithms. In: Proceedings of the 17th international conference on pattern recognition, 2004. ICPR 2004., vol 1, pp 260–263. IEEE
11.
go back to reference Kamiya K, Yuji A, Kato Y, Fujimura F, Takahashi M, Shoji N, Mori Y, Miyata K (2019) Keratoconus detection using deep learning of colour-coded maps with anterior segment optical coherence tomography: a diagnostic accuracy study. BMJ Open 9(9):e031313CrossRef Kamiya K, Yuji A, Kato Y, Fujimura F, Takahashi M, Shoji N, Mori Y, Miyata K (2019) Keratoconus detection using deep learning of colour-coded maps with anterior segment optical coherence tomography: a diagnostic accuracy study. BMJ Open 9(9):e031313CrossRef
12.
go back to reference Yu H, Zhang C, Wang G (2016) A tree-based incremental overlapping clustering method using the three-way decision theory. Knowl Based Syst 91:189–203CrossRef Yu H, Zhang C, Wang G (2016) A tree-based incremental overlapping clustering method using the three-way decision theory. Knowl Based Syst 91:189–203CrossRef
13.
go back to reference Wang P, Yao Y (2018) Ce3: a three-way clustering method based on mathematical morphology. Knowl Based Syst 155:54–65CrossRef Wang P, Yao Y (2018) Ce3: a three-way clustering method based on mathematical morphology. Knowl Based Syst 155:54–65CrossRef
14.
go back to reference Lu H, Zhao Q, Sang X, Lu J (2020) Community detection in complex networks using nonnegative matrix factorization and density-based clustering algorithm. Neural Process Lett 51(2):1731–1748CrossRef Lu H, Zhao Q, Sang X, Lu J (2020) Community detection in complex networks using nonnegative matrix factorization and density-based clustering algorithm. Neural Process Lett 51(2):1731–1748CrossRef
15.
go back to reference Zhu J, Jang-Jaccard J, Liu T, Zhou J (2021) Joint spectral clustering based on optimal graph and feature selection. Neural Process Lett 53(1):257–273CrossRef Zhu J, Jang-Jaccard J, Liu T, Zhou J (2021) Joint spectral clustering based on optimal graph and feature selection. Neural Process Lett 53(1):257–273CrossRef
16.
go back to reference Zhuang FZ, Luo P, He Q, Shi ZZ (2015) Survey on transfer learning research. J Softw 26(1):26–39MathSciNet Zhuang FZ, Luo P, He Q, Shi ZZ (2015) Survey on transfer learning research. J Softw 26(1):26–39MathSciNet
17.
go back to reference Liu Y, Zhou Y, Liu X, Dong F, Wang C, Wang Z (2019) Wasserstein gan-based small-sample augmentation for new-generation artificial intelligence: a case study of cancer-staging data in biology. Engineering 5(1):156–163CrossRef Liu Y, Zhou Y, Liu X, Dong F, Wang C, Wang Z (2019) Wasserstein gan-based small-sample augmentation for new-generation artificial intelligence: a case study of cancer-staging data in biology. Engineering 5(1):156–163CrossRef
18.
go back to reference Deng M, Deng A, Zhu J, Shi Y, Liu Y (2021) Intelligent fault diagnosis of rotating components in the absence of fault data: a transfer-based approach. Measurement 173:108601CrossRef Deng M, Deng A, Zhu J, Shi Y, Liu Y (2021) Intelligent fault diagnosis of rotating components in the absence of fault data: a transfer-based approach. Measurement 173:108601CrossRef
19.
go back to reference Tao X, Li Q, Guo W, Ren C, Li C, Liu R, Zou J (2019) Self-adaptive cost weights-based support vector machine cost-sensitive ensemble for imbalanced data classification. Inf Sci 487:31–56MathSciNetCrossRef Tao X, Li Q, Guo W, Ren C, Li C, Liu R, Zou J (2019) Self-adaptive cost weights-based support vector machine cost-sensitive ensemble for imbalanced data classification. Inf Sci 487:31–56MathSciNetCrossRef
20.
go back to reference Pan SJ, Yang Q (2009) A survey on transfer learning. IEEE Trans knowl Data Eng 22(10):1345–1359CrossRef Pan SJ, Yang Q (2009) A survey on transfer learning. IEEE Trans knowl Data Eng 22(10):1345–1359CrossRef
21.
go back to reference Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357CrossRefMATH Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357CrossRefMATH
22.
go back to reference Zhang T, Chen J, Li F, Pan T, He S (2020) A small sample focused intelligent fault diagnosis scheme of machines via multimodules learning with gradient penalized generative adversarial networks. IEEE Trans Ind Electronics 68(10):10130–10141CrossRef Zhang T, Chen J, Li F, Pan T, He S (2020) A small sample focused intelligent fault diagnosis scheme of machines via multimodules learning with gradient penalized generative adversarial networks. IEEE Trans Ind Electronics 68(10):10130–10141CrossRef
23.
go back to reference Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. Adv Neural Inf Process Syst, 27 Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. Adv Neural Inf Process Syst, 27
24.
25.
go back to reference Arjovsky M, Chintala S, Bottou L (2017) Wasserstein generative adversarial networks. In: International conference on machine learning, pp 214–223. PMLR Arjovsky M, Chintala S, Bottou L (2017) Wasserstein generative adversarial networks. In: International conference on machine learning, pp 214–223. PMLR
26.
go back to reference Kaloskampis I, Pugh D, Joshi C, Nolan L (2019) Synthetic data for public good-data science campus Kaloskampis I, Pugh D, Joshi C, Nolan L (2019) Synthetic data for public good-data science campus
27.
go back to reference Han H, Wang W-Yn, Mao B-H (2005) Borderline-smote: a new over-sampling method in imbalanced data sets learning. In: Advances in intelligent computing: international conference on intelligent computing, ICIC 2005, Hefei, China, 23–26 Aug 2005, Proceedings, Part I 1, pp 878–887. Springer Han H, Wang W-Yn, Mao B-H (2005) Borderline-smote: a new over-sampling method in imbalanced data sets learning. In: Advances in intelligent computing: international conference on intelligent computing, ICIC 2005, Hefei, China, 23–26 Aug 2005, Proceedings, Part I 1, pp 878–887. Springer
28.
go back to reference He H, Bai Y, Garcia EA, Li S (2008) Adasyn: Adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence), pp 1322–1328. IEEE He H, Bai Y, Garcia EA, Li S (2008) Adasyn: Adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence), pp 1322–1328. IEEE
29.
go back to reference Barua S, Islam MM, Yao X, Murase K (2012) Mwmote-majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans Knowl Data Eng 26(2):405–425CrossRef Barua S, Islam MM, Yao X, Murase K (2012) Mwmote-majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans Knowl Data Eng 26(2):405–425CrossRef
30.
go back to reference Gou C, Wu Y, Wang K, Wang F-Y, Ji Q (2016) Learning-by-synthesis for accurate eye detection. In: 2016 23rd international conference on pattern recognition (ICPR), pp 3362–3367. IEEE Gou C, Wu Y, Wang K, Wang F-Y, Ji Q (2016) Learning-by-synthesis for accurate eye detection. In: 2016 23rd international conference on pattern recognition (ICPR), pp 3362–3367. IEEE
31.
go back to reference Zhang K, Chen Q, Chen J, He S, Fudong Li, Zhou Z (2022) A multi-module generative adversarial network augmented with adaptive decoupling strategy for intelligent fault diagnosis of machines with small sample. Knowl Based Syst 239:107980CrossRef Zhang K, Chen Q, Chen J, He S, Fudong Li, Zhou Z (2022) A multi-module generative adversarial network augmented with adaptive decoupling strategy for intelligent fault diagnosis of machines with small sample. Knowl Based Syst 239:107980CrossRef
32.
go back to reference Ren J, Liu Y, Liu J (2019) Ewgan: Entropy-based wasserstein gan for imbalanced learning. Proc AAAI Conf Artif Intell 33:10011–10012 Ren J, Liu Y, Liu J (2019) Ewgan: Entropy-based wasserstein gan for imbalanced learning. Proc AAAI Conf Artif Intell 33:10011–10012
33.
go back to reference Yu Y, Guo L, Gao H, Liu Y (2022) Pcwgan-gp: A new method for imbalanced fault diagnosis of machines. IEEE Trans Instrument Measure 71:1–11 Yu Y, Guo L, Gao H, Liu Y (2022) Pcwgan-gp: A new method for imbalanced fault diagnosis of machines. IEEE Trans Instrument Measure 71:1–11
34.
go back to reference Fan J, Yuan X, Miao Z, Sun Z, Xe Mei, Zhou F (2022) Full attention wasserstein gan with gradient normalization for fault diagnosis under imbalanced data. IEEE Trans Instrument Measure 71:1–16 Fan J, Yuan X, Miao Z, Sun Z, Xe Mei, Zhou F (2022) Full attention wasserstein gan with gradient normalization for fault diagnosis under imbalanced data. IEEE Trans Instrument Measure 71:1–16
35.
go back to reference Reynolds DA (2009) Gaussian mixture models. Encyclopedia Biometrics 741:659–663CrossRef Reynolds DA (2009) Gaussian mixture models. Encyclopedia Biometrics 741:659–663CrossRef
36.
go back to reference Franti P, Virmajoki O, Hautamaki V (2006) Fast agglomerative clustering using a k-nearest neighbor graph. IEEE Trans Pattern Anal Mach Intell 28(11):1875–1881CrossRef Franti P, Virmajoki O, Hautamaki V (2006) Fast agglomerative clustering using a k-nearest neighbor graph. IEEE Trans Pattern Anal Mach Intell 28(11):1875–1881CrossRef
38.
go back to reference Gulrajani I, Ahmed F, Arjovsky M, Dumoulin V, Courville AC (2017) Improved training of wasserstein gans. Adv Neural Inf Process Syst, 30 Gulrajani I, Ahmed F, Arjovsky M, Dumoulin V, Courville AC (2017) Improved training of wasserstein gans. Adv Neural Inf Process Syst, 30
39.
go back to reference Gurumurthy S, Sarvadevabhatla RK, Babu RVh (2017) Deligan: Generative adversarial networks for diverse and limited data. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 166–174 Gurumurthy S, Sarvadevabhatla RK, Babu RVh (2017) Deligan: Generative adversarial networks for diverse and limited data. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 166–174
40.
go back to reference MacQueen J (1967) Classification and analysis of multivariate observations. 5th Berkeley Symp Math Statist Prob, pp 281–297 MacQueen J (1967) Classification and analysis of multivariate observations. 5th Berkeley Symp Math Statist Prob, pp 281–297
41.
go back to reference Rodriguez A, Laio A (2014) Clustering by fast search and find of density peaks. science, 344(6191):1492–1496 Rodriguez A, Laio A (2014) Clustering by fast search and find of density peaks. science, 344(6191):1492–1496
42.
go back to reference Bezdek JC, Ehrlich R, Full W (1984) Fcm: the fuzzy c-means clustering algorithm. Comput Ggeosci 10(2–3):191–203CrossRef Bezdek JC, Ehrlich R, Full W (1984) Fcm: the fuzzy c-means clustering algorithm. Comput Ggeosci 10(2–3):191–203CrossRef
43.
go back to reference Zhang T, Ramakrishnan R, Livny M (1996) Birch: an efficient data clustering method for very large databases. ACM Sigmod Record 25(2):103–114CrossRef Zhang T, Ramakrishnan R, Livny M (1996) Birch: an efficient data clustering method for very large databases. ACM Sigmod Record 25(2):103–114CrossRef
44.
go back to reference Ester M, Kriegel H-P, Sander J, Xu X et al (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. kdd 96:226–231 Ester M, Kriegel H-P, Sander J, Xu X et al (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. kdd 96:226–231
45.
go back to reference García S, Fernández A, Luengo J, Herrera F (2010) Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power. Inf Sci 180(10):2044–2064CrossRef García S, Fernández A, Luengo J, Herrera F (2010) Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power. Inf Sci 180(10):2044–2064CrossRef
46.
go back to reference Zimmerman DW, Zumbo BD (1993) Relative power of the wilcoxon test, the friedman test, and repeated-measures anova on ranks. J Exp Educ 62(1):75–86CrossRef Zimmerman DW, Zumbo BD (1993) Relative power of the wilcoxon test, the friedman test, and repeated-measures anova on ranks. J Exp Educ 62(1):75–86CrossRef
47.
go back to reference Nemenyi PB (1963) Distribution-free multiple comparisons. Princeton University, Princeton Nemenyi PB (1963) Distribution-free multiple comparisons. Princeton University, Princeton
Metadata
Title
Data Augmentation Generated by Generative Adversarial Network for Small Sample Datasets Clustering
Authors
Hui Yu
Qiao Feng Wang
Jian Yu Shi
Publication date
09-06-2023
Publisher
Springer US
Published in
Neural Processing Letters / Issue 6/2023
Print ISSN: 1370-4621
Electronic ISSN: 1573-773X
DOI
https://doi.org/10.1007/s11063-023-11315-z

Other articles of this Issue 6/2023

Neural Processing Letters 6/2023 Go to the issue