Skip to main content

2019 | OriginalPaper | Buchkapitel

5. Machine Learning Methods for Imbalanced Data

verfasst von : Osamu Komori, Shinto Eguchi

Erschienen in: Statistical Methods for Imbalanced Data in Ecological and Biological Studies

Verlag: Springer Japan

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

We discuss high-dimensional data analysis in the framework of pattern recognition and machine learning, including single-component analysis and clustering analysis. Several boosting methods for tackling imbalances in sample sizes are investigated.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Bamber D (1975) The area above the ordinal dominance graph and the area below the receiver operating characteristic graph. J Math Psychol 12:387–415MathSciNetCrossRef Bamber D (1975) The area above the ordinal dominance graph and the area below the receiver operating characteristic graph. J Math Psychol 12:387–415MathSciNetCrossRef
3.
Zurück zum Zitat Chawla NV, Lazarevic A, Hall LO, Bowyer KW (2003) SMOTEBoost: improving prediction of the minority class in boosting. In: Lavrač N, Gamberger D, Todorovski L, Blockeel H (eds) Knowledge discovery in databases: PKDD 2003. Springer, Heidelberg, pp 107–119CrossRef Chawla NV, Lazarevic A, Hall LO, Bowyer KW (2003) SMOTEBoost: improving prediction of the minority class in boosting. In: Lavrač N, Gamberger D, Todorovski L, Blockeel H (eds) Knowledge discovery in databases: PKDD 2003. Springer, Heidelberg, pp 107–119CrossRef
4.
Zurück zum Zitat Dettling M, Bühlmann P (2003) Boosting for tumor classification with gene expression data. Bioinformatics 19:1061–1069CrossRef Dettling M, Bühlmann P (2003) Boosting for tumor classification with gene expression data. Bioinformatics 19:1061–1069CrossRef
5.
Zurück zum Zitat Do JH, Choi D (2008) Clustering approaches to identifying gene expression patterns from DNA microarray data. Mol Cells 25:279–288 Do JH, Choi D (2008) Clustering approaches to identifying gene expression patterns from DNA microarray data. Mol Cells 25:279–288
6.
Zurück zum Zitat Dudoit S, Fridlyand J, Speed TP (2002) Comparison of discrimination methods for the classification of tumors using gene expression data. J Am Stat Assoc 97:77–87MathSciNetCrossRef Dudoit S, Fridlyand J, Speed TP (2002) Comparison of discrimination methods for the classification of tumors using gene expression data. J Am Stat Assoc 97:77–87MathSciNetCrossRef
8.
Zurück zum Zitat Eisen MB, Spellman PT, Brown PO, Botstein D (1998) Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci U S A 95:14863–14868CrossRef Eisen MB, Spellman PT, Brown PO, Botstein D (1998) Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci U S A 95:14863–14868CrossRef
9.
Zurück zum Zitat Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 55:119–139MathSciNetCrossRef Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 55:119–139MathSciNetCrossRef
10.
Zurück zum Zitat Freund Y, Schapire RE (1999) A short introduction to boosting. J Jpn Soc Artif Intell 14:771–780 Freund Y, Schapire RE (1999) A short introduction to boosting. J Jpn Soc Artif Intell 14:771–780
12.
Zurück zum Zitat Friedman J, Hastie T, Tibshirani R (2000) Additive logistic regression: a statistical view of boosting. Ann Stat 28:337–407MathSciNetCrossRef Friedman J, Hastie T, Tibshirani R (2000) Additive logistic regression: a statistical view of boosting. Ann Stat 28:337–407MathSciNetCrossRef
13.
Zurück zum Zitat Fushiki T, Fujisawa H, Eguchi S (2006) Identification of biomarkers from mass spectrometry data using a “common” peak approach. BMC Bioinform 7:358CrossRef Fushiki T, Fujisawa H, Eguchi S (2006) Identification of biomarkers from mass spectrometry data using a “common” peak approach. BMC Bioinform 7:358CrossRef
14.
Zurück zum Zitat Galar M, Fernandez A, Barrenechea E, Bustince H, Herrera F (2012) A review on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches. IEEE Trans Syst Man Cybern Part C (Applications and Reviews) 42:463–484CrossRef Galar M, Fernandez A, Barrenechea E, Bustince H, Herrera F (2012) A review on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches. IEEE Trans Syst Man Cybern Part C (Applications and Reviews) 42:463–484CrossRef
15.
Zurück zum Zitat Golub TT, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286:531–537CrossRef Golub TT, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286:531–537CrossRef
16.
Zurück zum Zitat Hanley JA, McNeil BJ (1982) The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143:29–36CrossRef Hanley JA, McNeil BJ (1982) The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143:29–36CrossRef
17.
Zurück zum Zitat Hastie T, Tibshirani R, Friedman J (2001) The elements of statistical learning: data mining, inference, and prediction. Springer, New YorkCrossRef Hastie T, Tibshirani R, Friedman J (2001) The elements of statistical learning: data mining, inference, and prediction. Springer, New YorkCrossRef
18.
Zurück zum Zitat Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning: data mining, inference, and prediction, 2nd edn. Springer, New YorkCrossRef Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning: data mining, inference, and prediction, 2nd edn. Springer, New YorkCrossRef
19.
Zurück zum Zitat Joshi MV, Kumar V, Agarwal RC (2001) Evaluating boosting algorithms to classify rare classes: comparison and improvements. In: IBM research report, pp 1–20 Joshi MV, Kumar V, Agarwal RC (2001) Evaluating boosting algorithms to classify rare classes: comparison and improvements. In: IBM research report, pp 1–20
20.
Zurück zum Zitat Kawakita M, Minami M, Eguchi S, Lennert-Cody CE (2005) An introduction to the predictive technique AdaBoost with a comparison to generalized additive models. Fish Res 76:328–343CrossRef Kawakita M, Minami M, Eguchi S, Lennert-Cody CE (2005) An introduction to the predictive technique AdaBoost with a comparison to generalized additive models. Fish Res 76:328–343CrossRef
21.
Zurück zum Zitat Komori O (2011) A boosting method for maximization of the area under the ROC curve. Ann Inst Stat Math 63:961–979MathSciNetCrossRef Komori O (2011) A boosting method for maximization of the area under the ROC curve. Ann Inst Stat Math 63:961–979MathSciNetCrossRef
22.
Zurück zum Zitat Komori O, Eguchi S (2010) A boosting method for maximizing the partial area under the ROC curve. BMC Bioinform 11:314CrossRef Komori O, Eguchi S (2010) A boosting method for maximizing the partial area under the ROC curve. BMC Bioinform 11:314CrossRef
23.
Zurück zum Zitat Lugosi BG, Vayatis N (2004) On the Bayes-risk consistency of regularized boosting methods. Ann Stat 32:30–55MathSciNetMATH Lugosi BG, Vayatis N (2004) On the Bayes-risk consistency of regularized boosting methods. Ann Stat 32:30–55MathSciNetMATH
24.
Zurück zum Zitat Ma S, Huang J (2005) Regularized ROC method for disease classification and biomarker selection with microarray data. Bioinformatics 21:4356–4362CrossRef Ma S, Huang J (2005) Regularized ROC method for disease classification and biomarker selection with microarray data. Bioinformatics 21:4356–4362CrossRef
25.
Zurück zum Zitat Pepe MS (2003) The statistical evaluation of medical tests for classification and prediction. Oxford University Press, New YorkMATH Pepe MS (2003) The statistical evaluation of medical tests for classification and prediction. Oxford University Press, New YorkMATH
26.
Zurück zum Zitat Pepe MS, Cai T, Longton G (2006) Combining predictors for classification using the area under the receiver operating characteristic curve. Biometrics 62:221–229MathSciNetCrossRef Pepe MS, Cai T, Longton G (2006) Combining predictors for classification using the area under the receiver operating characteristic curve. Biometrics 62:221–229MathSciNetCrossRef
27.
Zurück zum Zitat Pepe MS, Thompson ML (2000) Combining diagnostic test results to increase accuracy. Biostatistics 1:123–140CrossRef Pepe MS, Thompson ML (2000) Combining diagnostic test results to increase accuracy. Biostatistics 1:123–140CrossRef
28.
Zurück zum Zitat Schapire RE (1990) The strength of weak learnability. Mach Learn 5:197–227 Schapire RE (1990) The strength of weak learnability. Mach Learn 5:197–227
29.
Zurück zum Zitat Schapire RE, Freund Y, Bartlett P, Lee WS (1998) Boosting the margin: a new explanation for the effectiveness of voting methods. Ann Stat 26:1651–1686MathSciNetCrossRef Schapire RE, Freund Y, Bartlett P, Lee WS (1998) Boosting the margin: a new explanation for the effectiveness of voting methods. Ann Stat 26:1651–1686MathSciNetCrossRef
30.
Zurück zum Zitat Sørlie T, Perou CM, Tibshirani R, Aas T, Geisler S, Johnsen H, Hastie T, Eisen MB, van de Rijn M, Jeffrey SS, Thorsen T, Quist H, Matese JC, Brown PO, Botstein D, Lønning PE, Børresen-Dale A (2001) Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proc Natl Acad Sci U S A 98:10869–10874CrossRef Sørlie T, Perou CM, Tibshirani R, Aas T, Geisler S, Johnsen H, Hastie T, Eisen MB, van de Rijn M, Jeffrey SS, Thorsen T, Quist H, Matese JC, Brown PO, Botstein D, Lønning PE, Børresen-Dale A (2001) Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proc Natl Acad Sci U S A 98:10869–10874CrossRef
31.
Zurück zum Zitat Takenouchi T, Ushijima M, Eguchi S (2007) GroupAdaBoost: accurate prediction and selection of important genes. IPSJ Digit Cour 3:145–152CrossRef Takenouchi T, Ushijima M, Eguchi S (2007) GroupAdaBoost: accurate prediction and selection of important genes. IPSJ Digit Cour 3:145–152CrossRef
32.
Zurück zum Zitat van’t Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AAM, Mao M, Peterse HL, van der Kooy K, Marton MJ, Witteveen AT, Schreiber GJ, Kerkhoven RM, Roberts C, Linsley PS, Bernards R, Friend SH (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature 415:530–536CrossRef van’t Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AAM, Mao M, Peterse HL, van der Kooy K, Marton MJ, Witteveen AT, Schreiber GJ, Kerkhoven RM, Roberts C, Linsley PS, Bernards R, Friend SH (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature 415:530–536CrossRef
33.
Zurück zum Zitat Wang Z, Chang YI, Ying Z, Zhu L, Yang Y (2007) A parsimonious threshold-independent protein feature selection method through the area under receiver operating characteristic curve. Bioinformatics 23:1794–2788 Wang Z, Chang YI, Ying Z, Zhu L, Yang Y (2007) A parsimonious threshold-independent protein feature selection method through the area under receiver operating characteristic curve. Bioinformatics 23:1794–2788
Metadaten
Titel
Machine Learning Methods for Imbalanced Data
verfasst von
Osamu Komori
Shinto Eguchi
Copyright-Jahr
2019
Verlag
Springer Japan
DOI
https://doi.org/10.1007/978-4-431-55570-4_5