Skip to main content

2020 | OriginalPaper | Buchkapitel

4. Dimensionality Reduction for Big Data

verfasst von : Julián Luengo, Diego García-Gil, Sergio Ramírez-Gallego, Salvador García, Francisco Herrera

Erschienen in: Big Data Preprocessing

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

In the new era of Big Data, exponential increase in volume is usually accompanied by an explosion in the number of features. Dimensionality reduction arises as a possible solution to enable large-scale learning with millions of dimensions. Nevertheless, as any other family of algorithms, reduction methods require an upgrade in its design so that they can work with such magnitudes. Particularly, they must be prepared to tackle the explosive combinatorial effects of “the curse of Big Dimensionality” while embracing the benefits from the “blessing side of dimensionality” (poorly correlated features). In this chapter we analyze the problems and benefits derived from “the curse of Big Dimensionality”, and how this problem has spread around many fields like life sciences or the Internet. Then we list all the contributions that address the large-scale dimensionality reduction problem. Next, and as a case study, we study in depth the design and behavior of one of the most popular selection frameworks in this field. Finally, we study all contributions related to dimensionality reduction in Big Data streams.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
2
Although feature generation machine is not a distributed method as such, it has been included here for its outstanding relevance in the comparison.
 
3
Broadcast operation in Spark sends a single copy of the variable to each node.
 
5
Note that the whole memory available in the cluster was only available from the 10-core value.
 
Literatur
1.
Zurück zum Zitat Aggarwal, C. C. (2015). Data mining: the textbook. Berlin: Springer.MATH Aggarwal, C. C. (2015). Data mining: the textbook. Berlin: Springer.MATH
2.
Zurück zum Zitat Alcalde-Barros, A., García-Gil, D., García, S., & Herrera, F. (2019). DPASF: A Flink library for streaming data preprocessing. Big Data Analytics, 4(1), 4.CrossRef Alcalde-Barros, A., García-Gil, D., García, S., & Herrera, F. (2019). DPASF: A Flink library for streaming data preprocessing. Big Data Analytics, 4(1), 4.CrossRef
4.
Zurück zum Zitat Battiti, R. (1994). Using mutual information for selecting features in supervised neural net learning. IEEE Transactions on Neural Networks, 5(4), 537–550.CrossRef Battiti, R. (1994). Using mutual information for selecting features in supervised neural net learning. IEEE Transactions on Neural Networks, 5(4), 537–550.CrossRef
5.
Zurück zum Zitat Bolón-Canedo, V., Sánchez-Maroño, N., & Alonso-Betanzos, A. (2015). Recent advances and emerging challenges of feature selection in the context of big data. Knowledge-Based Systems, 86, 33–45.CrossRef Bolón-Canedo, V., Sánchez-Maroño, N., & Alonso-Betanzos, A. (2015). Recent advances and emerging challenges of feature selection in the context of big data. Knowledge-Based Systems, 86, 33–45.CrossRef
6.
Zurück zum Zitat Bolón-Canedo, V., Sánchez-Maroño, N., & Alonso-Betanzos, A. (2015). Feature selection for high-dimensional data. Berlin: Springer Publishing Company, Incorporated.CrossRef Bolón-Canedo, V., Sánchez-Maroño, N., & Alonso-Betanzos, A. (2015). Feature selection for high-dimensional data. Berlin: Springer Publishing Company, Incorporated.CrossRef
7.
Zurück zum Zitat Bondell, H. D. & Reich, B. J. (2008). Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR. Biometrics, 64, 115–123.MathSciNetMATHCrossRef Bondell, H. D. & Reich, B. J. (2008). Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR. Biometrics, 64, 115–123.MathSciNetMATHCrossRef
8.
Zurück zum Zitat Brown, G., Pocock, A., Zhao, M.-J., & Luján, M. (2012). Conditional likelihood maximisation: A unifying framework for information theoretic feature selection. Journal of Machine Learning Research, 13, 27–66.MathSciNetMATH Brown, G., Pocock, A., Zhao, M.-J., & Luján, M. (2012). Conditional likelihood maximisation: A unifying framework for information theoretic feature selection. Journal of Machine Learning Research, 13, 27–66.MathSciNetMATH
10.
Zurück zum Zitat Chao, P., Bin, W., & Chao, D. (2012). Design and implementation of parallel term contribution algorithm based on MapReduce model. In 7th Open Cirrus Summit (pp. 43–47). Piscataway: IEEE. Chao, P., Bin, W., & Chao, D. (2012). Design and implementation of parallel term contribution algorithm based on MapReduce model. In 7th Open Cirrus Summit (pp. 43–47). Piscataway: IEEE.
11.
Zurück zum Zitat Chen, K., Wan, W. Q., & Li, Y. (2013). Differentially private feature selection under MapReduce framework. The Journal of China Universities of Posts and Telecommunications, 20(5), 85–103.CrossRef Chen, K., Wan, W. Q., & Li, Y. (2013). Differentially private feature selection under MapReduce framework. The Journal of China Universities of Posts and Telecommunications, 20(5), 85–103.CrossRef
12.
Zurück zum Zitat Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. Hoboken: Wiley-InterscienceMATHCrossRef Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. Hoboken: Wiley-InterscienceMATHCrossRef
13.
Zurück zum Zitat Dalavi, M., & Cheke, S. (2014). Hadoop MapReduce implementation of a novel scheme for term weighting in text categorization. In International Conference on Control, Instrumentation, Communication and Computational Technologies (ICCICCT) (pp. 994–999). Piscataway: IEEE. Dalavi, M., & Cheke, S. (2014). Hadoop MapReduce implementation of a novel scheme for term weighting in text categorization. In International Conference on Control, Instrumentation, Communication and Computational Technologies (ICCICCT) (pp. 994–999). Piscataway: IEEE.
14.
Zurück zum Zitat del Río, S., López, V., Benítez, J. M., & Herrera, F. (2014). On the use of MapReduce for imbalanced big data using random forest. Information Sciences, 285, 112–137.CrossRef del Río, S., López, V., Benítez, J. M., & Herrera, F. (2014). On the use of MapReduce for imbalanced big data using random forest. Information Sciences, 285, 112–137.CrossRef
15.
Zurück zum Zitat Díaz-Uriarte, R., & de Andrés, S. A. (2006). Gene selection and classification of microarray data using random forest. BMC Bioinformatics, 7(1), 3.CrossRef Díaz-Uriarte, R., & de Andrés, S. A. (2006). Gene selection and classification of microarray data using random forest. BMC Bioinformatics, 7(1), 3.CrossRef
17.
Zurück zum Zitat Fung, G. M., & Mangasarian, O. L. (2004). A feature selection newton method for support vector machine classification. Computational Optimization and Applications, 28(2), 185–202.MathSciNetMATHCrossRef Fung, G. M., & Mangasarian, O. L. (2004). A feature selection newton method for support vector machine classification. Computational Optimization and Applications, 28(2), 185–202.MathSciNetMATHCrossRef
18.
Zurück zum Zitat Gehler, P., & Nowozin, S. (2009). On feature combination for multiclass object classification. In 2009 IEEE 12th International Conference on Computer Vision (pp. 221–228). Gehler, P., & Nowozin, S. (2009). On feature combination for multiclass object classification. In 2009 IEEE 12th International Conference on Computer Vision (pp. 221–228).
19.
Zurück zum Zitat Guyon, I., Weston, J., Barnhill, S., & Vapnik, V. (2002). Gene selection for cancer classification using support vector machines. Machine Learning, 46(1), 389–422.MATHCrossRef Guyon, I., Weston, J., Barnhill, S., & Vapnik, V. (2002). Gene selection for cancer classification using support vector machines. Machine Learning, 46(1), 389–422.MATHCrossRef
20.
Zurück zum Zitat He, Q., Cheng, X., Zhuang, F., & Shi, Z. (2014). Parallel feature selection using positive approximation based on MapReduce. In 11th International Conference on Fuzzy Systems and Knowledge Discovery FSKD (pp. 397–402). He, Q., Cheng, X., Zhuang, F., & Shi, Z. (2014). Parallel feature selection using positive approximation based on MapReduce. In 11th International Conference on Fuzzy Systems and Knowledge Discovery FSKD (pp. 397–402).
22.
Zurück zum Zitat Katakis, I., Tsoumakas, G., & Vlahavas, I. (2005). On the utility of incremental feature selection for the classification of textual data streams. In P. Bozanis, & E. N. Houstis (Eds.), Advances in informatics (pp. 338–348). Berlin: Springer.CrossRef Katakis, I., Tsoumakas, G., & Vlahavas, I. (2005). On the utility of incremental feature selection for the classification of textual data streams. In P. Bozanis, & E. N. Houstis (Eds.), Advances in informatics (pp. 338–348). Berlin: Springer.CrossRef
23.
Zurück zum Zitat Kumar, M., & Rath, S. K. (2015). Classification of microarray using MapReduce based proximal support vector machine classifier. Knowledge-Based Systems, 89, 584–602.CrossRef Kumar, M., & Rath, S. K. (2015). Classification of microarray using MapReduce based proximal support vector machine classifier. Knowledge-Based Systems, 89, 584–602.CrossRef
24.
Zurück zum Zitat Li, J., & Liu, H. (2017). Challenges of feature selection for big data analytics. IEEE Intelligent Systems, 32(2), 9–15.CrossRef Li, J., & Liu, H. (2017). Challenges of feature selection for big data analytics. IEEE Intelligent Systems, 32(2), 9–15.CrossRef
25.
Zurück zum Zitat Li, Z., Lu, W., Sun, Z., & Xing, W. (2017). A parallel feature selection method study for text classification. Neural Computing and Applications, 28(1), 513–524.CrossRef Li, Z., Lu, W., Sun, Z., & Xing, W. (2017). A parallel feature selection method study for text classification. Neural Computing and Applications, 28(1), 513–524.CrossRef
26.
Zurück zum Zitat Mao, Q., & Tsang, I. W. (2013). Efficient multitemplate learning for structured prediction. IEEE Transactions on Neural Networks and Learning Systems, 24, 248–261.CrossRef Mao, Q., & Tsang, I. W. (2013). Efficient multitemplate learning for structured prediction. IEEE Transactions on Neural Networks and Learning Systems, 24, 248–261.CrossRef
27.
Zurück zum Zitat Meena, M. J., Chandran, K. R., Karthik, A., & Samuel, A. V. (2012). An enhanced ACO algorithm to select features for text categorization and its parallelization. Expert Systems with Applications, 39(5), 5861–5871.CrossRef Meena, M. J., Chandran, K. R., Karthik, A., & Samuel, A. V. (2012). An enhanced ACO algorithm to select features for text categorization and its parallelization. Expert Systems with Applications, 39(5), 5861–5871.CrossRef
28.
Zurück zum Zitat Michael, M., & Lin, W.-C. (1973). Experimental study of information measure and inter-intra class distance ratios on feature selection and orderings. IEEE Transactions on Systems, Man and Cybernetics, SMC-3(2), 172–181.CrossRef Michael, M., & Lin, W.-C. (1973). Experimental study of information measure and inter-intra class distance ratios on feature selection and orderings. IEEE Transactions on Systems, Man and Cybernetics, SMC-3(2), 172–181.CrossRef
29.
Zurück zum Zitat Nie, F., Xiang, S., Jia, Y., Zhang, C., & Yan, S. (2008). Trace ratio criterion for feature selection. In Proceedings of the 23rd National Conference on Artificial Intelligence - Volume 2, AAAI’08 (pp. 671–676). Nie, F., Xiang, S., Jia, Y., Zhang, C., & Yan, S. (2008). Trace ratio criterion for feature selection. In Proceedings of the 23rd National Conference on Artificial Intelligence - Volume 2, AAAI’08 (pp. 671–676).
30.
Zurück zum Zitat O’Leary, D. E. (2013). Artificial intelligence and big data. IEEE Intelligent Systems, 28, 96–99 (2013)CrossRef O’Leary, D. E. (2013). Artificial intelligence and big data. IEEE Intelligent Systems, 28, 96–99 (2013)CrossRef
31.
Zurück zum Zitat Ordozgoiti, B., Gómez-Canaval, S., & Mozo, A. (2015). Massively parallel unsupervised feature selection on spark. In New trends in databases and information systems. Communications in Computer and Information Science (Vol. 539, pp. 186–196). Berlin: Springer International Publishing. Ordozgoiti, B., Gómez-Canaval, S., & Mozo, A. (2015). Massively parallel unsupervised feature selection on spark. In New trends in databases and information systems. Communications in Computer and Information Science (Vol. 539, pp. 186–196). Berlin: Springer International Publishing.
32.
Zurück zum Zitat Peng, H., Long, F., & Ding, C. (2005). Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(8), 1226–1238.CrossRef Peng, H., Long, F., & Ding, C. (2005). Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(8), 1226–1238.CrossRef
33.
Zurück zum Zitat Peralta, D., del Río, S., Ramírez, S., Triguero, I., Benítez, J. M., & Herrera, F. (2015). Evolutionary feature selection for big data classification: A MapReduce approach. Mathematical Problems in Engineering, 2015, Article ID 246139. Peralta, D., del Río, S., Ramírez, S., Triguero, I., Benítez, J. M., & Herrera, F. (2015). Evolutionary feature selection for big data classification: A MapReduce approach. Mathematical Problems in Engineering, 2015, Article ID 246139.
34.
Zurück zum Zitat Press, W. H., Flannery, B. P., Teukolsky, S. A., & Vetterling, W. T. (1988). Numerical recipes in C: The art of scientific computing. New York: Cambridge University Press.MATH Press, W. H., Flannery, B. P., Teukolsky, S. A., & Vetterling, W. T. (1988). Numerical recipes in C: The art of scientific computing. New York: Cambridge University Press.MATH
35.
Zurück zum Zitat Ramírez-Gallego, S., Mouriño-Talín, H., Martínez-Rego, D., Bolón-Canedo, V., Benítez, J. M., Alonso-Betanzos, A., et al. (2018). An information theory-based feature selection framework for big data under apache spark. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 48(9), 1441–1453.CrossRef Ramírez-Gallego, S., Mouriño-Talín, H., Martínez-Rego, D., Bolón-Canedo, V., Benítez, J. M., Alonso-Betanzos, A., et al. (2018). An information theory-based feature selection framework for big data under apache spark. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 48(9), 1441–1453.CrossRef
37.
Zurück zum Zitat Singh, S., Kubica, J., Larsen, S. E., & Sorokina, D. (2009). Parallel large scale feature selection for logistic regression. In SIAM International Conference on Data Mining (SDM) (pp. 1172–1183). Singh, S., Kubica, J., Larsen, S. E., & Sorokina, D. (2009). Parallel large scale feature selection for logistic regression. In SIAM International Conference on Data Mining (SDM) (pp. 1172–1183).
39.
Zurück zum Zitat Sun, Z., & Li, Z. (2014). Data intensive parallel feature selection method study. In International Joint Conference on Neural Networks (IJCNN) (pp. 2256–2262). Sun, Z., & Li, Z. (2014). Data intensive parallel feature selection method study. In International Joint Conference on Neural Networks (IJCNN) (pp. 2256–2262).
40.
Zurück zum Zitat Tan, M., Tsang, I. W., & Wang, L. (2014). Towards ultrahigh dimensional feature selection for big data. Journal of Machine Learning Research, 15(1), 1371–1429.MathSciNetMATH Tan, M., Tsang, I. W., & Wang, L. (2014). Towards ultrahigh dimensional feature selection for big data. Journal of Machine Learning Research, 15(1), 1371–1429.MathSciNetMATH
41.
Zurück zum Zitat Tanupabrungsun, S., & Achalakul, T. (2013). Feature reduction for anomaly detection in manufacturing with MapReduce GA/kNN. In 19th IEEE International Conference on Parallel and Distributed Systems (ICPADS) (pp. 639–644). Tanupabrungsun, S., & Achalakul, T. (2013). Feature reduction for anomaly detection in manufacturing with MapReduce GA/kNN. In 19th IEEE International Conference on Parallel and Distributed Systems (ICPADS) (pp. 639–644).
42.
Zurück zum Zitat Triguero, I., del Río, S., López, V., Bacardit, J., Benítez, J. M., & Herrera, F. (2015). ROSEFW-RF: The winner algorithm for the ECBDL’14 big data competition: An extremely imbalanced big data bioinformatics problem. Knowledge-Based Systems, 87, 69–79.CrossRef Triguero, I., del Río, S., López, V., Bacardit, J., Benítez, J. M., & Herrera, F. (2015). ROSEFW-RF: The winner algorithm for the ECBDL’14 big data competition: An extremely imbalanced big data bioinformatics problem. Knowledge-Based Systems, 87, 69–79.CrossRef
43.
Zurück zum Zitat Wang, J., Zhao, P., Hoi, S. C. H., & Jin, R. (2014). Online feature selection and its applications. IEEE Transactions on Knowledge and Data Engineering, 26(3), 698–710.CrossRef Wang, J., Zhao, P., Hoi, S. C. H., & Jin, R. (2014). Online feature selection and its applications. IEEE Transactions on Knowledge and Data Engineering, 26(3), 698–710.CrossRef
44.
Zurück zum Zitat Yazidi, J., Bouaguel, W., & Essoussi, N. (2016). A parallel implementation of relief algorithm using MapReduce paradigm (pp. 418–425). Cham: Springer International Publishing. Yazidi, J., Bouaguel, W., & Essoussi, N. (2016). A parallel implementation of relief algorithm using MapReduce paradigm (pp. 418–425). Cham: Springer International Publishing.
45.
Zurück zum Zitat Yu, L., & Liu, H. (2003). Feature selection for high-dimensional data: A fast correlation-based filter solution. In Proceedings of the 20th International Conference on Machine Learning (ICML-03) (pp. 856–863). Yu, L., & Liu, H. (2003). Feature selection for high-dimensional data: A fast correlation-based filter solution. In Proceedings of the 20th International Conference on Machine Learning (ICML-03) (pp. 856–863).
46.
Zurück zum Zitat Zdravevski, E., Lameski, P., Kulakov, A., Jakimovski, B., Filiposka, S., & Trajanov, D. (2015). Feature ranking based on information gain for large classification problems with MapReduce. In 2015 IEEE Trustcom/BigDataSE/ISPA (Vol. 2, pp. 186–191). Zdravevski, E., Lameski, P., Kulakov, A., Jakimovski, B., Filiposka, S., & Trajanov, D. (2015). Feature ranking based on information gain for large classification problems with MapReduce. In 2015 IEEE Trustcom/BigDataSE/ISPA (Vol. 2, pp. 186–191).
47.
Zurück zum Zitat Zhai, Y., Ong, Y.-S., & Tsang, I. W. (2014). The emerging “big dimensionality”. IEEE Computational Intelligence Magazine, 9(3), 14–26.CrossRef Zhai, Y., Ong, Y.-S., & Tsang, I. W. (2014). The emerging “big dimensionality”. IEEE Computational Intelligence Magazine, 9(3), 14–26.CrossRef
48.
Zurück zum Zitat Zhai, Y., Tan, M., Ong, Y. S., & Tsang, I. W. (2012). Discovering support and affiliated features from very high dimensions. In J. Langford & J. Pineau (Eds.), Proceedings of the 29th International Conference on Machine Learning (ICML-12) (pp. 1455–1462). New York: ACM. Zhai, Y., Tan, M., Ong, Y. S., & Tsang, I. W. (2012). Discovering support and affiliated features from very high dimensions. In J. Langford & J. Pineau (Eds.), Proceedings of the 29th International Conference on Machine Learning (ICML-12) (pp. 1455–1462). New York: ACM.
49.
Zurück zum Zitat Zhao, Z., Zhang, R., Cox, J., Duling, D., & Sarle, W. (2013). Massively parallel feature selection: an approach based on variance preservation. Machine Learning, 92(1), 195–220.MathSciNetMATHCrossRef Zhao, Z., Zhang, R., Cox, J., Duling, D., & Sarle, W. (2013). Massively parallel feature selection: an approach based on variance preservation. Machine Learning, 92(1), 195–220.MathSciNetMATHCrossRef
50.
Zurück zum Zitat Zhong, L. W., & Kwok, J. T. (2012). Efficient sparse modeling with automatic feature grouping. IEEE Transactions on Neural Networks and Learning Systems, 23(9), 1436–1447.CrossRef Zhong, L. W., & Kwok, J. T. (2012). Efficient sparse modeling with automatic feature grouping. IEEE Transactions on Neural Networks and Learning Systems, 23(9), 1436–1447.CrossRef
Metadaten
Titel
Dimensionality Reduction for Big Data
verfasst von
Julián Luengo
Diego García-Gil
Sergio Ramírez-Gallego
Salvador García
Francisco Herrera
Copyright-Jahr
2020
DOI
https://doi.org/10.1007/978-3-030-39105-8_4

Premium Partner