Skip to main content
Erschienen in: Knowledge and Information Systems 1/2021

21.11.2020 | Regular Paper

Dealing with heterogeneity in the context of distributed feature selection for classification

verfasst von: José Luis Morillo-Salas, Verónica Bolón-Canedo, Amparo Alonso-Betanzos

Erschienen in: Knowledge and Information Systems | Ausgabe 1/2021

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Advances in the information technologies have greatly contributed to the advent of larger datasets. These datasets often come from distributed sites, but even so, their large size usually means they cannot be handled in a centralized manner. A possible solution to this problem is to distribute the data over several processors and combine the different results. We propose a methodology to distribute feature selection processes based on selecting relevant and discarding irrelevant features. This preprocessing step is essential for current high-dimensional sets, since it allows the input dimension to be reduced. We pay particular attention to the problem of data imbalance, which occurs because the original dataset is unbalanced or because the dataset becomes unbalanced after data partitioning. Most works approach unbalanced scenarios by oversampling, while our proposal tests both over- and undersampling strategies. Experimental results demonstrate that our distributed approach to classification obtains comparable accuracy results to a centralized approach, while reducing computational time and efficiently dealing with data imbalance.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Fußnoten
4
https://​github.​com/​jlmorillo/​Heterogeneity_​distributed_​features. In the tables for standard and microarray datasets in Appendix, the following information is provided: mean and standard deviation (top row), maximum value (second row) and (in the lowest row) the combination that obtained the maximum value for:
 
Literatur
1.
Zurück zum Zitat Guyon I (2006) Feature extraction: foundations and applications, vol 207. Springer, BerlinCrossRef Guyon I (2006) Feature extraction: foundations and applications, vol 207. Springer, BerlinCrossRef
2.
Zurück zum Zitat Bolón-Canedo V, Sánchez-Maroño N, Alonso-Betanzos A, Brown G (2015) Distributed feature selection: an application to microarray data classification. Appl Soft Comput 30:136–150CrossRef Bolón-Canedo V, Sánchez-Maroño N, Alonso-Betanzos A, Brown G (2015) Distributed feature selection: an application to microarray data classification. Appl Soft Comput 30:136–150CrossRef
3.
Zurück zum Zitat Bolón-Canedo V, Sechidis K, Sánchez-Maroño N, Alonso-Betanzos A, Brown G (2019) Insights into distributed feature ranking. Inf Sci 496:378–398CrossRef Bolón-Canedo V, Sechidis K, Sánchez-Maroño N, Alonso-Betanzos A, Brown G (2019) Insights into distributed feature ranking. Inf Sci 496:378–398CrossRef
4.
Zurück zum Zitat Brankovic A, Hosseini M, Piroddi L (2019) A distributed feature selection algorithm based on distance correlation with an application to microarrays. IEEE/ACM Trans Comput Biol Bioinf 16(6):1802–1815 Brankovic A, Hosseini M, Piroddi L (2019) A distributed feature selection algorithm based on distance correlation with an application to microarrays. IEEE/ACM Trans Comput Biol Bioinf 16(6):1802–1815
5.
Zurück zum Zitat Morán-Fernández L, Bolón-Canedo V, Alonso-Betanzos A (2017) Centralized vs. distributed feature selection methods based on data complexity measures. Knowl Based Syst 117:27–45CrossRef Morán-Fernández L, Bolón-Canedo V, Alonso-Betanzos A (2017) Centralized vs. distributed feature selection methods based on data complexity measures. Knowl Based Syst 117:27–45CrossRef
6.
Zurück zum Zitat He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284CrossRef He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284CrossRef
7.
Zurück zum Zitat Haixiang G, Yijing L, Shang J, Mingyun G, Yuanyue H, Bing G (2017) Learning from class-imbalanced data: review of methods and applications. Expert Syst Appl 73:220–239CrossRef Haixiang G, Yijing L, Shang J, Mingyun G, Yuanyue H, Bing G (2017) Learning from class-imbalanced data: review of methods and applications. Expert Syst Appl 73:220–239CrossRef
8.
Zurück zum Zitat Murphy P, Pazzani M, Merz C, Brunk C (1994) Reducing misclassification costs. In: International conference of machine learning. Morgan Kauffman, New Brunswick, pp 217–225 Murphy P, Pazzani M, Merz C, Brunk C (1994) Reducing misclassification costs. In: International conference of machine learning. Morgan Kauffman, New Brunswick, pp 217–225
9.
Zurück zum Zitat Tahir MA, Kittler J, Mikolajczyk K, Yan F (2009) A multiple expert approach to the class imbalance problem using inverse random under sampling. In: Multiple Classifier Systems, pp 82–91 Tahir MA, Kittler J, Mikolajczyk K, Yan F (2009) A multiple expert approach to the class imbalance problem using inverse random under sampling. In: Multiple Classifier Systems, pp 82–91
10.
Zurück zum Zitat Solberg AH, Solberg R (1996) A large-scale evaluation of features for automatic detection of oil spills in ERS SAR images. In: International geoscience and remote sensing symposium. Lincoln, NE, pp 1484–1486 Solberg AH, Solberg R (1996) A large-scale evaluation of features for automatic detection of oil spills in ERS SAR images. In: International geoscience and remote sensing symposium. Lincoln, NE, pp 1484–1486
11.
Zurück zum Zitat Chawla NV, Herrera F, Garcia S, Fernandez A (2018) Smote for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. J Artif Intell Res 61:863–905MathSciNetCrossRef Chawla NV, Herrera F, Garcia S, Fernandez A (2018) Smote for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. J Artif Intell Res 61:863–905MathSciNetCrossRef
12.
13.
Zurück zum Zitat He H, Bai Y, Garcia EA, Li S (2008) Adasyn: adaptive synthetic sampling approach for imbalanced learning. In: IEEE joint conference in neural networks, IJCNN 2008 He H, Bai Y, Garcia EA, Li S (2008) Adasyn: adaptive synthetic sampling approach for imbalanced learning. In: IEEE joint conference in neural networks, IJCNN 2008
14.
Zurück zum Zitat Ling C, Li CX (1998) Data mining for direct marketing: problems and solutions. In: Proceedings of the fourth international conference on knowledge discovery and data mining, KDD’98, vol 98, pp 73–79 Ling C, Li CX (1998) Data mining for direct marketing: problems and solutions. In: Proceedings of the fourth international conference on knowledge discovery and data mining, KDD’98, vol 98, pp 73–79
15.
Zurück zum Zitat Junsomboon N, Phienthrakul T (2017) Combining over-sampling and under-sampling techniques for imbalance dataset. In: ICMLC 2017: proceedings of the 9th international conference on machine learning and computing (ICMLC), pp 243–247 Junsomboon N, Phienthrakul T (2017) Combining over-sampling and under-sampling techniques for imbalance dataset. In: ICMLC 2017: proceedings of the 9th international conference on machine learning and computing (ICMLC), pp 243–247
16.
Zurück zum Zitat Ramentol E, Caballero Y, Bello R, Herrera F (2012) SMOTE-RSB*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory Knowledge Inform Syst 33(2): 245–265 Ramentol E, Caballero Y, Bello R, Herrera F (2012) SMOTE-RSB*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory Knowledge Inform Syst 33(2): 245–265
17.
Zurück zum Zitat Sanguanmak Y, Hanskunatai A (2016) DBSM: the combination of DBSCAN and SMOTE for imbalanced data classification. In: 13th international joint conference on computer science and software engineering (JCSSE), pp 1–5 Sanguanmak Y, Hanskunatai A (2016) DBSM: the combination of DBSCAN and SMOTE for imbalanced data classification. In: 13th international joint conference on computer science and software engineering (JCSSE), pp 1–5
18.
Zurück zum Zitat Wang Q, Xin J, Wu J, Zheng N (2017) SVM classification of microaneurysms with imbalanced dataset based on borderline-SMOTE and data cleaning techniques. In: Verikas A, Radeva P, Nikolaev DP, Zhang W, Zhou J (eds) Ninth international conference on machine vision (ICMV 2016), vol 10341. International Society for Optics and Photonics, SPIE, pp 355–361 Wang Q, Xin J, Wu J, Zheng N (2017) SVM classification of microaneurysms with imbalanced dataset based on borderline-SMOTE and data cleaning techniques. In: Verikas A, Radeva P, Nikolaev DP, Zhang W, Zhou J (eds) Ninth international conference on machine vision (ICMV 2016), vol 10341. International Society for Optics and Photonics, SPIE, pp 355–361
19.
Zurück zum Zitat Zhang C, Gao W, Song J, Jiang J (2016) An imbalanced data classification algorithm of improved autoencoder neural network. In: 2016 Eighth international conference on advanced computational intelligence (ICACI). IEEE, pp 95–99 Zhang C, Gao W, Song J, Jiang J (2016) An imbalanced data classification algorithm of improved autoencoder neural network. In: 2016 Eighth international conference on advanced computational intelligence (ICACI). IEEE, pp 95–99
20.
Zurück zum Zitat Krawczyk B, Woźniak M, Schaefer G (2014) Cost-sensitive decision tree ensembles for effective imbalanced classification. Appl Soft Comput 14:554–562CrossRef Krawczyk B, Woźniak M, Schaefer G (2014) Cost-sensitive decision tree ensembles for effective imbalanced classification. Appl Soft Comput 14:554–562CrossRef
21.
Zurück zum Zitat Yang J, Zhou J, Zhu Z, Ma X, Ji Z (2016) Iterative ensemble feature selection for multiclass classification of imbalanced microarray data. J Biol Res (Thessalon) 23(Suppl 1):13CrossRef Yang J, Zhou J, Zhu Z, Ma X, Ji Z (2016) Iterative ensemble feature selection for multiclass classification of imbalanced microarray data. J Biol Res (Thessalon) 23(Suppl 1):13CrossRef
22.
Zurück zum Zitat A fraud detection model based on feature selection and undersampling applied to web payment systems. In: IEEE/WIC/ACM International conference on web intelligence and intelligent agent technology (WI-IAT) A fraud detection model based on feature selection and undersampling applied to web payment systems. In: IEEE/WIC/ACM International conference on web intelligence and intelligent agent technology (WI-IAT)
23.
Zurück zum Zitat Maldonado S, Weber R, Famili F (2014) Feature selection for high-dimensional class-imbalanced datasets using support vector machines. Inf Sci 286:228–246CrossRef Maldonado S, Weber R, Famili F (2014) Feature selection for high-dimensional class-imbalanced datasets using support vector machines. Inf Sci 286:228–246CrossRef
24.
Zurück zum Zitat Hall M (1999) Correlation-based feature selection for machine learning. PhD thesis, The University of Waikato Hall M (1999) Correlation-based feature selection for machine learning. PhD thesis, The University of Waikato
25.
Zurück zum Zitat Mitchell TM (1982) Generalization as search. Artif Intell 18:203–226. Reprinted in Shavlik JW, Dietterich TG (eds) (1990) Readings in machine learning. Morgan Kaufmann, San Francisco Mitchell TM (1982) Generalization as search. Artif Intell 18:203–226. Reprinted in Shavlik JW, Dietterich TG (eds) (1990) Readings in machine learning. Morgan Kaufmann, San Francisco
26.
Zurück zum Zitat Winston PH (1975) Learning structural description from examples. In: Winston PH (ed) The psychology of computer vision. McGraw-Hill, New York Winston PH (1975) Learning structural description from examples. In: Winston PH (ed) The psychology of computer vision. McGraw-Hill, New York
27.
Zurück zum Zitat Quinlan JR (1983) Learning efficient classification procedures and their application to chess end games. In: Michalski RS, Carbonell JG, Mitchell TM (eds) Machine learning: an artificial intelligence approach. Morgan Kaufmann, San Francisco Quinlan JR (1983) Learning efficient classification procedures and their application to chess end games. In: Michalski RS, Carbonell JG, Mitchell TM (eds) Machine learning: an artificial intelligence approach. Morgan Kaufmann, San Francisco
28.
Zurück zum Zitat Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kaufmann, San Francisco Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kaufmann, San Francisco
29.
Zurück zum Zitat Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and regression trees. Wadsworth, BelmontMATH Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and regression trees. Wadsworth, BelmontMATH
30.
Zurück zum Zitat Cortes C, Vapnik VN (1995) Support-vector networks. Mach Learn 20(3):273–297MATH Cortes C, Vapnik VN (1995) Support-vector networks. Mach Learn 20(3):273–297MATH
31.
Zurück zum Zitat Guyon I, Gunn S, Nikravesh M, Zadeh L (2006) Feature extraction. Foundations and applications. Springer, BerlinCrossRef Guyon I, Gunn S, Nikravesh M, Zadeh L (2006) Feature extraction. Foundations and applications. Springer, BerlinCrossRef
32.
Zurück zum Zitat Hand DJ, Mannila H, Smyth P (2001) Principles of data mining. MIT press, Cambridge Hand DJ, Mannila H, Smyth P (2001) Principles of data mining. MIT press, Cambridge
33.
Zurück zum Zitat Bolón-Canedo V, Sánchez-Maroño N, Cerviño-Rabuñal J (2013) Scaling up feature selection: a distributed filter approach. In: Conference of the Spanish Association for artificial intelligence. Springer, Berlin, pp 121–130 Bolón-Canedo V, Sánchez-Maroño N, Cerviño-Rabuñal J (2013) Scaling up feature selection: a distributed filter approach. In: Conference of the Spanish Association for artificial intelligence. Springer, Berlin, pp 121–130
34.
Zurück zum Zitat Bolón-Canedo V, Sánchez-Marono N, Cervino-Rabunal J (2014) Toward parallel feature selection from vertically partitioned data. In: Proceedings of ESANN 2014, pp 395–400 Bolón-Canedo V, Sánchez-Marono N, Cervino-Rabunal J (2014) Toward parallel feature selection from vertically partitioned data. In: Proceedings of ESANN 2014, pp 395–400
35.
Zurück zum Zitat Morán-Fernández L, Bolón-Canedo V, Alonso-Betanzos A (2016) Data complexity measures for analyzing the effect of smote over microarrays. In: ESANN Morán-Fernández L, Bolón-Canedo V, Alonso-Betanzos A (2016) Data complexity measures for analyzing the effect of smote over microarrays. In: ESANN
36.
Zurück zum Zitat de Haro Garcia A (2011) Scaling data mining algorithms. Application to instance and feature selection. PhD thesis, Universidad de Granada de Haro Garcia A (2011) Scaling data mining algorithms. Application to instance and feature selection. PhD thesis, Universidad de Granada
37.
Zurück zum Zitat Hall MA, Smith LA (1998) Practical feature subset selection for machine learning. In: Proceedings of the 21st Australasian computer science conference ACSC 98. Springer, Berlin, pp 181–191 Hall MA, Smith LA (1998) Practical feature subset selection for machine learning. In: Proceedings of the 21st Australasian computer science conference ACSC 98. Springer, Berlin, pp 181–191
39.
Zurück zum Zitat Kononenko I (1994) Estimating attributes: analysis and extensions of relief. In: Machine learning: ECML-94, pp 171–182 Kononenko I (1994) Estimating attributes: analysis and extensions of relief. In: Machine learning: ECML-94, pp 171–182
40.
Zurück zum Zitat Kira K, Rendell LA (1992) A practical approach to feature selection. In: Proceedings of the ninth international workshop on Machine learning. Morgan Kaufmann Publishers Inc., Los Altos, pp 249–256 Kira K, Rendell LA (1992) A practical approach to feature selection. In: Proceedings of the ninth international workshop on Machine learning. Morgan Kaufmann Publishers Inc., Los Altos, pp 249–256
41.
Zurück zum Zitat Hall MA (2000) Correlation-based feature selection for discrete and numeric class machine learning. In: Proceedings of the 17th international conference on machine learning, pp 856–863 Hall MA (2000) Correlation-based feature selection for discrete and numeric class machine learning. In: Proceedings of the 17th international conference on machine learning, pp 856–863
42.
Zurück zum Zitat Dash M, Liu H, Moto H (2003) Consistency-based search in feature selection. Artif Intell 151(1–2):155–176MathSciNetCrossRef Dash M, Liu H, Moto H (2003) Consistency-based search in feature selection. Artif Intell 151(1–2):155–176MathSciNetCrossRef
43.
Zurück zum Zitat Bramer M (2007) Principles of data mining. Springer, BerlinMATH Bramer M (2007) Principles of data mining. Springer, BerlinMATH
44.
Zurück zum Zitat Vapnik V (1999) The nature of statistical learning theory. Springer, BerlinMATH Vapnik V (1999) The nature of statistical learning theory. Springer, BerlinMATH
45.
Zurück zum Zitat Witten IH, Frank E (2005) Data mining practical machine learning tools and techniques. Morgan Kaufmann Publishers Inc., Los AltosMATH Witten IH, Frank E (2005) Data mining practical machine learning tools and techniques. Morgan Kaufmann Publishers Inc., Los AltosMATH
46.
Zurück zum Zitat Altman DG (1991) Practical statistics for medical research. Chapman & Hall, London Altman DG (1991) Practical statistics for medical research. Chapman & Hall, London
47.
Zurück zum Zitat Hollander M, Wolfe DA (1973) Nonparametric statistical methods. John Wiley, New YorkMATH Hollander M, Wolfe DA (1973) Nonparametric statistical methods. John Wiley, New YorkMATH
48.
Zurück zum Zitat Demšar J (2006) Statistical comparisons of classifiers over multiple datasets. J Mach Learn Res 7:1–30MathSciNetMATH Demšar J (2006) Statistical comparisons of classifiers over multiple datasets. J Mach Learn Res 7:1–30MathSciNetMATH
Metadaten
Titel
Dealing with heterogeneity in the context of distributed feature selection for classification
verfasst von
José Luis Morillo-Salas
Verónica Bolón-Canedo
Amparo Alonso-Betanzos
Publikationsdatum
21.11.2020
Verlag
Springer London
Erschienen in
Knowledge and Information Systems / Ausgabe 1/2021
Print ISSN: 0219-1377
Elektronische ISSN: 0219-3116
DOI
https://doi.org/10.1007/s10115-020-01526-4

Weitere Artikel der Ausgabe 1/2021

Knowledge and Information Systems 1/2021 Zur Ausgabe