Skip to main content
Erschienen in: Progress in Artificial Intelligence 4/2017

15.05.2017 | Regular Paper

SMOTE-GPU: Big Data preprocessing on commodity hardware for imbalanced classification

Erschienen in: Progress in Artificial Intelligence | Ausgabe 4/2017

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Nowadays, it is usual to work with large amounts of data since our capacity of collecting and storing information has increased significantly. The extraction of knowledge from these scenarios is commonly known as “Big Data,” and it is performed on large clusters with MapReduce platforms. Imbalanced classification poses a problem both in traditional and Big Data learning scenarios. Data sampling is one of the ways that allows to improve the performance on imbalanced problems. A commodity hardware-based method for Big Data problems can offload these computations from the expensive and highly demanded hardware that MapReduce platforms require. The characteristics of some sampling methods make them suitable to be adapted to commodity hardware, taking advantage of the parallel computation capabilities of graphics processing units. SMOTE is one of the most popular oversampling methods which is based on the nearest neighbor rule. The proposed SMOTE-GPU efficiently handles large datasets (several millions of instances) on a wide variety of commodity hardware, including a laptop computer.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Alcalá-Fdez, J., Sánchez, L., García, S., del Jesus, M., Ventura, S., Garrell, J., Otero, J., Romero, C., Bacardit, J., Rivas, V., Fernández, J., Herrera, F.: KEEL: a software tool to assess evolutionary algorithms for data mining problems. Soft Comput. 13(3), 307–318 (2009)CrossRef Alcalá-Fdez, J., Sánchez, L., García, S., del Jesus, M., Ventura, S., Garrell, J., Otero, J., Romero, C., Bacardit, J., Rivas, V., Fernández, J., Herrera, F.: KEEL: a software tool to assess evolutionary algorithms for data mining problems. Soft Comput. 13(3), 307–318 (2009)CrossRef
3.
Zurück zum Zitat Baldi, P., Sadowski, P., Whiteson, D.: Searching for exotic particles in high-energy physics with deep learning. Nat. Commun. 5 (2014) Baldi, P., Sadowski, P., Whiteson, D.: Searching for exotic particles in high-energy physics with deep learning. Nat. Commun. 5 (2014)
4.
Zurück zum Zitat Bradley, A.P.: The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit. 30(7), 1145–1159 (1997)CrossRef Bradley, A.P.: The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit. 30(7), 1145–1159 (1997)CrossRef
5.
Zurück zum Zitat Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: synthetic minority over-sampling technique. J. Artif. Int. Res. 16(1), 321–357 (2002)MATH Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: synthetic minority over-sampling technique. J. Artif. Int. Res. 16(1), 321–357 (2002)MATH
7.
Zurück zum Zitat Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRef Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRef
9.
Zurück zum Zitat Fernández, A., del Río, S., Chawla, N.V., Herrera, F.: An insight into imbalanced big data classification: outcomes and challenges. Complex Intell. Syst. (in press). doi:10.1007/s40747-017-0037-9 Fernández, A., del Río, S., Chawla, N.V., Herrera, F.: An insight into imbalanced big data classification: outcomes and challenges. Complex Intell. Syst. (in press). doi:10.​1007/​s40747-017-0037-9
11.
Zurück zum Zitat Gutiérrez, P.D., Lastra, M., Bacardit, J., Benítez, J.M., Herrera, F.: GPU–SME–kNN: scalable and memory efficient \(k\)NN and lazy learning using GPUs. Inf. Sci. 373, 165–182 (2016) Gutiérrez, P.D., Lastra, M., Bacardit, J., Benítez, J.M., Herrera, F.: GPU–SME–kNN: scalable and memory efficient \(k\)NN and lazy learning using GPUs. Inf. Sci. 373, 165–182 (2016)
12.
Zurück zum Zitat Gutiérrez, P.D., Lastra, M., Herrera, F., Benitez, J.M.: A high performance fingerprint matching system for large databases based on GPU. IEEE Trans. Inf. Forensics Secur. 9(1), 62–71 (2014)CrossRef Gutiérrez, P.D., Lastra, M., Herrera, F., Benitez, J.M.: A high performance fingerprint matching system for large databases based on GPU. IEEE Trans. Inf. Forensics Secur. 9(1), 62–71 (2014)CrossRef
13.
Zurück zum Zitat He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)CrossRef He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)CrossRef
14.
15.
Zurück zum Zitat Krawczyk, B.: Learning from imbalanced data: open challenges and future directions. Progr. Artif. Intell. 5(4), 221–232 (2016)CrossRef Krawczyk, B.: Learning from imbalanced data: open challenges and future directions. Progr. Artif. Intell. 5(4), 221–232 (2016)CrossRef
16.
Zurück zum Zitat López, V., Fernández, A., García, S., Palade, V., Herrera, F.: An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics. Inf. Sci. 250, 113–141 (2013)CrossRef López, V., Fernández, A., García, S., Palade, V., Herrera, F.: An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics. Inf. Sci. 250, 113–141 (2013)CrossRef
17.
Zurück zum Zitat Madden, S.: From databases to big data. IEEE Internet Comput. 16(3), 4–6 (2012)CrossRef Madden, S.: From databases to big data. IEEE Internet Comput. 16(3), 4–6 (2012)CrossRef
18.
Zurück zum Zitat Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkataraman, S., Liu, D., Freeman, J., Tsai, D., Amde, M., Owen, S., et al.: MLLIB: machine learning in apache spark. J. Mach. Learn. Res. 17(34), 1–7 (2016)MATHMathSciNet Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkataraman, S., Liu, D., Freeman, J., Tsai, D., Amde, M., Owen, S., et al.: MLLIB: machine learning in apache spark. J. Mach. Learn. Res. 17(34), 1–7 (2016)MATHMathSciNet
19.
Zurück zum Zitat Owen, S., Anil, R., Dunning, T., Friedman, E.: Mahout in Action, Manning Publications Co., Greenwich, CT, USA, ISBN:1935182684, 9781935182689 (2011) Owen, S., Anil, R., Dunning, T., Friedman, E.: Mahout in Action, Manning Publications Co., Greenwich, CT, USA, ISBN:1935182684, 9781935182689 (2011)
20.
Zurück zum Zitat Prati, R.C., Batista, G.E.A.P.A., Silva, D.F.: Class imbalance revisited: a new experimental setup to assess the performance of treatment methods. Knowl. Inf. Syst. 45(1), 247–270 (2015)CrossRef Prati, R.C., Batista, G.E.A.P.A., Silva, D.F.: Class imbalance revisited: a new experimental setup to assess the performance of treatment methods. Knowl. Inf. Syst. 45(1), 247–270 (2015)CrossRef
21.
Zurück zum Zitat Rajaraman, A., Ullman, J.: Mining of Massive Datasets. Cambridge University Press, Cambridge (2011)CrossRef Rajaraman, A., Ullman, J.: Mining of Massive Datasets. Cambridge University Press, Cambridge (2011)CrossRef
22.
Zurück zum Zitat Salomon-Ferrer, R., Götz, A., Poole, D., Le Grand, S., Walker, R.: Routine microsecond molecular dynamics simulations with amber on GPUS. 2. Explicit solvent particle mesh ewald. J. Chem. Theory Comput. 9(9), 3878–3888 (2013)CrossRef Salomon-Ferrer, R., Götz, A., Poole, D., Le Grand, S., Walker, R.: Routine microsecond molecular dynamics simulations with amber on GPUS. 2. Explicit solvent particle mesh ewald. J. Chem. Theory Comput. 9(9), 3878–3888 (2013)CrossRef
24.
Zurück zum Zitat Triguero, I., del Río, S., López, V., Bacardit, J., Benítez, J.M., Herrera, F.: ROSEFW-RF: the winner algorithm for the ECBDL’14 big data competition—an extremely imbalanced big data bioinformatics problem. Knowl. Based Syst. 87, 69–79 (2015)CrossRef Triguero, I., del Río, S., López, V., Bacardit, J., Benítez, J.M., Herrera, F.: ROSEFW-RF: the winner algorithm for the ECBDL’14 big data competition—an extremely imbalanced big data bioinformatics problem. Knowl. Based Syst. 87, 69–79 (2015)CrossRef
25.
Zurück zum Zitat White, T.: Hadoop: The Definitive Guide, 4th edn. O’Reilly Media Inc, Sebastopol (2015) White, T.: Hadoop: The Definitive Guide, 4th edn. O’Reilly Media Inc, Sebastopol (2015)
26.
Zurück zum Zitat Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, pp. 1–14. USENIX Association (2012) Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, pp. 1–14. USENIX Association (2012)
27.
Zurück zum Zitat Zikopoulos, P.C., Eaton, C., deRoos, D., Deutsch, T., Lapis, G.: Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data, 1st edn. McGraw-Hill, New York (2011) Zikopoulos, P.C., Eaton, C., deRoos, D., Deutsch, T., Lapis, G.: Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data, 1st edn. McGraw-Hill, New York (2011)
Metadaten
Titel
SMOTE-GPU: Big Data preprocessing on commodity hardware for imbalanced classification
Publikationsdatum
15.05.2017
Erschienen in
Progress in Artificial Intelligence / Ausgabe 4/2017
Print ISSN: 2192-6352
Elektronische ISSN: 2192-6360
DOI
https://doi.org/10.1007/s13748-017-0128-2

Weitere Artikel der Ausgabe 4/2017

Progress in Artificial Intelligence 4/2017 Zur Ausgabe