Skip to main content
Erschienen in: Soft Computing 22/2019

08.01.2019 | Methodologies and Application

Using virtual samples to improve learning performance for small datasets with multimodal distributions

verfasst von: Der-Chiang Li, Liang-Sian Lin, Chien-Chih Chen, Wei-Hao Yu

Erschienen in: Soft Computing | Ausgabe 22/2019

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

A small dataset that contains very few samples, a maximum of thirty as defined in traditional normal distribution statistics, often makes it difficult for learning algorithms to make precise predictions. In past studies, many virtual sample generation (VSG) approaches have been shown to be effective in overcoming this issue by adding virtual samples to training sets, with some methods creating samples based on their estimated sample distributions and directly treating the distributions as unimodal without considering that small data may actually present multimodal distributions. Accordingly, before estimating sample distributions, this paper employs density-based spatial clustering of applications with noise to cluster small data and applies the AICc (the corrected version of the Akaike information criterion for small datasets) to assess clustering results as an essential procedure in data pre-processing. Once the AICc shows that the clusters are appropriate to present the data dispersion of small datasets, each of their sample distributions is estimated by using the maximal p value (MPV) method to present multimodal distributions; otherwise, all of the data is inferred as having unimodal distributions. We call the proposed method multimodal MPV (MMPV). Based on the estimated distributions, virtual samples are created with a mechanism to evaluate suitable sample sizes. In the experiments, one real and two public datasets are examined, and the bagging (bootstrap aggregating) procedure is employed to build the models, where the models are support vector regressions with three kernel functions: linear, polynomial, and radial basis. The results show that the forecasting accuracies of the MMPV are significantly better than those of MPV, a VSG method developed based on fuzzy C-means, and REAL (using original training sets), based on most of the statistical results of the paired t test.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literatur
Zurück zum Zitat Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications, vol 27. ACM, New York, p 2 Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications, vol 27. ACM, New York, p 2
Zurück zum Zitat Akgül FG, Şenoğlu B, Arslan T (2016) An alternative distribution to Weibull for modeling the wind speed data: inverse Weibull distribution. Energy Convers Manag 114:234–240CrossRef Akgül FG, Şenoğlu B, Arslan T (2016) An alternative distribution to Weibull for modeling the wind speed data: inverse Weibull distribution. Energy Convers Manag 114:234–240CrossRef
Zurück zum Zitat Bernard A, Bos-Levenbach E (1953) The plotting of observations on probability-paper. Statistica Neerlandica 7:163–173MathSciNetCrossRef Bernard A, Bos-Levenbach E (1953) The plotting of observations on probability-paper. Statistica Neerlandica 7:163–173MathSciNetCrossRef
Zurück zum Zitat Blake C, Keogh E, Merz CJ (1998) UCI repository of machine learning databases. Department of Information and Computer Science, University of California, Irvine, CA Blake C, Keogh E, Merz CJ (1998) UCI repository of machine learning databases. Department of Information and Computer Science, University of California, Irvine, CA
Zurück zum Zitat Bowman K, Shenton L (2001) Weibull distributions when the shape parameter is defined. Comput Stat Data Anal 36:299–310MathSciNetCrossRef Bowman K, Shenton L (2001) Weibull distributions when the shape parameter is defined. Comput Stat Data Anal 36:299–310MathSciNetCrossRef
Zurück zum Zitat Breiman L (1996) Bagging predictors. Mach Learn 24:123–140MATH Breiman L (1996) Bagging predictors. Mach Learn 24:123–140MATH
Zurück zum Zitat Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe-level-smote: safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, pp 475–482 Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe-level-smote: safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, pp 475–482
Zurück zum Zitat Burnham KP, Anderson DR (2004) Multimodel inference: understanding AIC and BIC in model selection. Sociol Methods Res 33:261–304MathSciNetCrossRef Burnham KP, Anderson DR (2004) Multimodel inference: understanding AIC and BIC in model selection. Sociol Methods Res 33:261–304MathSciNetCrossRef
Zurück zum Zitat Bütikofer L, Stawarczyk B, Roos M (2015) Two regression methods for estimation of a two-parameter Weibull distribution for reliability of dental materials. Dent Mater 31:e33–e50CrossRef Bütikofer L, Stawarczyk B, Roos M (2015) Two regression methods for estimation of a two-parameter Weibull distribution for reliability of dental materials. Dent Mater 31:e33–e50CrossRef
Zurück zum Zitat Campello RJ, Moulavi D, Sander J (2013) Density-based clustering based on hierarchical density estimates. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, pp 160–172 Campello RJ, Moulavi D, Sander J (2013) Density-based clustering based on hierarchical density estimates. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, pp 160–172
Zurück zum Zitat Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: The second international conference on knowledge discovery and data mining (KDD'96). AAAI, pp 226–231 Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: The second international conference on knowledge discovery and data mining (KDD'96). AAAI, pp 226–231
Zurück zum Zitat Faloutsos C, Kamel I (1994) Beyond uniformity and independence: analysis of R-trees using the concept of fractal dimension. In: Proceedings of the thirteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems. ACM, pp 4–13 Faloutsos C, Kamel I (1994) Beyond uniformity and independence: analysis of R-trees using the concept of fractal dimension. In: Proceedings of the thirteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems. ACM, pp 4–13
Zurück zum Zitat Gail M, Gastwirth J (1978) A scale-free goodness-of-fit test for the exponential distribution based on the Gini statistic. J R Stat Soc Ser B (Methodological) 40:350–357MathSciNetMATH Gail M, Gastwirth J (1978) A scale-free goodness-of-fit test for the exponential distribution based on the Gini statistic. J R Stat Soc Ser B (Methodological) 40:350–357MathSciNetMATH
Zurück zum Zitat Han H, Wang W-Y, Mao B-H (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: International conference on intelligent computing. Springer, pp 878–887 Han H, Wang W-Y, Mao B-H (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: International conference on intelligent computing. Springer, pp 878–887
Zurück zum Zitat Huang C (2002) Information diffusion techniques and small-sample problem. Int J Inf Technol Decis Mak 1:229–249CrossRef Huang C (2002) Information diffusion techniques and small-sample problem. Int J Inf Technol Decis Mak 1:229–249CrossRef
Zurück zum Zitat Huang C, Moraga C (2004) A diffusion-neural-network for learning from small samples. Int J Approx Reason 35:137–161MathSciNetCrossRef Huang C, Moraga C (2004) A diffusion-neural-network for learning from small samples. Int J Approx Reason 35:137–161MathSciNetCrossRef
Zurück zum Zitat Li DC, Lin LS (2013) A new approach to assess product lifetime performance for small data sets. Eur J Oper Res 230:290–298MathSciNetCrossRef Li DC, Lin LS (2013) A new approach to assess product lifetime performance for small data sets. Eur J Oper Res 230:290–298MathSciNetCrossRef
Zurück zum Zitat Li DC, Lin LS (2014) Generating information for small data sets with a multi-modal distribution. Decis Support Syst 66:71–81CrossRef Li DC, Lin LS (2014) Generating information for small data sets with a multi-modal distribution. Decis Support Syst 66:71–81CrossRef
Zurück zum Zitat Li DC, Wu CS, Tsai T-I, Lina Y-S (2007) Using mega-trend-diffusion and artificial samples in small data set learning for early flexible manufacturing system scheduling knowledge. Comput Oper Res 34:966–982CrossRef Li DC, Wu CS, Tsai T-I, Lina Y-S (2007) Using mega-trend-diffusion and artificial samples in small data set learning for early flexible manufacturing system scheduling knowledge. Comput Oper Res 34:966–982CrossRef
Zurück zum Zitat Maciejewski T, Stefanowski J (2011) Local neighbourhood extension of SMOTE for mining imbalanced data. In: IEEE symposium on computational intelligence and data mining (CIDM). pp 104–111 Maciejewski T, Stefanowski J (2011) Local neighbourhood extension of SMOTE for mining imbalanced data. In: IEEE symposium on computational intelligence and data mining (CIDM). pp 104–111
Zurück zum Zitat MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, vol 14. Oakland, CA, USA. pp 281–297 MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, vol 14. Oakland, CA, USA. pp 281–297
Zurück zum Zitat Mirkin B (1996) Mathematical classification and clustering. Kluwer Academic Publishers, DordrechtCrossRef Mirkin B (1996) Mathematical classification and clustering. Kluwer Academic Publishers, DordrechtCrossRef
Zurück zum Zitat Niyogi P, Girosi F, Poggio T (1998) Incorporating prior information in machine learning by creating virtual examples. Proc IEEE 86:2196–2209CrossRef Niyogi P, Girosi F, Poggio T (1998) Incorporating prior information in machine learning by creating virtual examples. Proc IEEE 86:2196–2209CrossRef
Zurück zum Zitat Pai P-F (2006) System reliability forecasting by support vector machines with genetic algorithms. Math Comput Model 43:262–274MathSciNetCrossRef Pai P-F (2006) System reliability forecasting by support vector machines with genetic algorithms. Math Comput Model 43:262–274MathSciNetCrossRef
Zurück zum Zitat Quinlan JR (1996) Improved use of continuous attributes in C4.5. J Artif Intell Res 4:77–90CrossRef Quinlan JR (1996) Improved use of continuous attributes in C4.5. J Artif Intell Res 4:77–90CrossRef
Zurück zum Zitat Sáez JA, Luengo J, Stefanowski J, Herrera F (2015) SMOTE–IPF: addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering Information. Sciences 291:184–203 Sáez JA, Luengo J, Stefanowski J, Herrera F (2015) SMOTE–IPF: addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering Information. Sciences 291:184–203
Zurück zum Zitat Schubert E, Sander J, Ester M, Kriegel HP, Xu X (2017) DBSCAN revisited, revisited: why and how you should (still) use DBSCAN. ACM Trans Database Syst (TODS) 42:19MathSciNetCrossRef Schubert E, Sander J, Ester M, Kriegel HP, Xu X (2017) DBSCAN revisited, revisited: why and how you should (still) use DBSCAN. ACM Trans Database Syst (TODS) 42:19MathSciNetCrossRef
Zurück zum Zitat Sezer EA, Nefeslioglu HA, Gokceoglu C (2014) An assessment on producing synthetic samples by fuzzy C-means for limited number of data in prediction models. Appl Soft Comput 24:126–134CrossRef Sezer EA, Nefeslioglu HA, Gokceoglu C (2014) An assessment on producing synthetic samples by fuzzy C-means for limited number of data in prediction models. Appl Soft Comput 24:126–134CrossRef
Zurück zum Zitat Shao C, Song X, Yang X, Wu X (2016) Extended minimum-squared error algorithm for robust face recognition via auxiliary mirror samples. Soft Comput 20:3177–3187CrossRef Shao C, Song X, Yang X, Wu X (2016) Extended minimum-squared error algorithm for robust face recognition via auxiliary mirror samples. Soft Comput 20:3177–3187CrossRef
Zurück zum Zitat Song X, Shao C, Yang X, Wu X (2017) Sparse representation-based classification using generalized weighted extended dictionary. Soft Comput 21:4335–4348CrossRef Song X, Shao C, Yang X, Wu X (2017) Sparse representation-based classification using generalized weighted extended dictionary. Soft Comput 21:4335–4348CrossRef
Zurück zum Zitat Tang D, Zhu N, Yu F, Chen W, Tang T (2014) A novel sparse representation method based on virtual samples for face recognition. Neural Comput Appl 24:513–519CrossRef Tang D, Zhu N, Yu F, Chen W, Tang T (2014) A novel sparse representation method based on virtual samples for face recognition. Neural Comput Appl 24:513–519CrossRef
Zurück zum Zitat Yang J, Yu X, Xie Z-Q, Zhang J-P (2011) A novel virtual sample generation method based on Gaussian distribution. Knowl Based Syst 24:740–748CrossRef Yang J, Yu X, Xie Z-Q, Zhang J-P (2011) A novel virtual sample generation method based on Gaussian distribution. Knowl Based Syst 24:740–748CrossRef
Zurück zum Zitat Zhou J, Duan B, Huang J, Li N (2015) Incorporating prior knowledge and multi-kernel into linear programming support vector regression. Soft Comput 19:2047–2061CrossRef Zhou J, Duan B, Huang J, Li N (2015) Incorporating prior knowledge and multi-kernel into linear programming support vector regression. Soft Comput 19:2047–2061CrossRef
Metadaten
Titel
Using virtual samples to improve learning performance for small datasets with multimodal distributions
verfasst von
Der-Chiang Li
Liang-Sian Lin
Chien-Chih Chen
Wei-Hao Yu
Publikationsdatum
08.01.2019
Verlag
Springer Berlin Heidelberg
Erschienen in
Soft Computing / Ausgabe 22/2019
Print ISSN: 1432-7643
Elektronische ISSN: 1433-7479
DOI
https://doi.org/10.1007/s00500-018-03744-z

Weitere Artikel der Ausgabe 22/2019

Soft Computing 22/2019 Zur Ausgabe