nach oben

Erschienen in:

2020 | OriginalPaper | Buchkapitel

4. Dimensionality Reduction for Big Data

verfasst von : Julián Luengo, Diego García-Gil, Sergio Ramírez-Gallego, Salvador García, Francisco Herrera

Erschienen in: Big Data Preprocessing

Verlag: Springer International Publishing

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

In the new era of Big Data, exponential increase in volume is usually accompanied by an explosion in the number of features. Dimensionality reduction arises as a possible solution to enable large-scale learning with millions of dimensions. Nevertheless, as any other family of algorithms, reduction methods require an upgrade in its design so that they can work with such magnitudes. Particularly, they must be prepared to tackle the explosive combinatorial effects of “the curse of Big Dimensionality” while embracing the benefits from the “blessing side of dimensionality” (poorly correlated features). In this chapter we analyze the problems and benefits derived from “the curse of Big Dimensionality”, and how this problem has spread around many fields like life sciences or the Internet. Then we list all the contributions that address the large-scale dimensionality reduction problem. Next, and as a case study, we study in depth the design and behavior of one of the most popular selection frameworks in this field. Finally, we study all contributions related to dimensionality reduction in Big Data streams.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Vorheriges Kapitel Smart Data

Nächstes Kapitel Data Reduction for Big Data

https://en.wikipedia.org/wiki/Camera_phone.

Although feature generation machine is not a distributed method as such, it has been included here for its outstanding relevance in the comparison.

Broadcast operation in Spark sends a single copy of the variable to each node.

http://www.cloudera.com/content/cloudera/en/documentation/cdh5/v5-0-0/CDH5-homepage.html.

Note that the whole memory available in the cluster was only available from the 10-core value.

Aggarwal, C. C. (2015). Data mining: the textbook. Berlin: Springer.MATH

Alcalde-Barros, A., García-Gil, D., García, S., & Herrera, F. (2019). DPASF: A Flink library for streaming data preprocessing. Big Data Analytics, 4(1), 4.CrossRef

Apache Flink. (2019). Apache Flink. http://flink.apache.org/.

Battiti, R. (1994). Using mutual information for selecting features in supervised neural net learning. IEEE Transactions on Neural Networks, 5(4), 537–550.CrossRef

Bolón-Canedo, V., Sánchez-Maroño, N., & Alonso-Betanzos, A. (2015). Recent advances and emerging challenges of feature selection in the context of big data. Knowledge-Based Systems, 86, 33–45.CrossRef

Bolón-Canedo, V., Sánchez-Maroño, N., & Alonso-Betanzos, A. (2015). Feature selection for high-dimensional data. Berlin: Springer Publishing Company, Incorporated.CrossRef

Bondell, H. D. & Reich, B. J. (2008). Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR. Biometrics, 64, 115–123.MathSciNetMATHCrossRef

Brown, G., Pocock, A., Zhao, M.-J., & Luján, M. (2012). Conditional likelihood maximisation: A unifying framework for information theoretic feature selection. Journal of Machine Learning Research, 13, 27–66.MathSciNetMATH

Chang, C.-C., & Lin, C.-J. (2011). LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2, 27:1–27:27. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm

10.

Chao, P., Bin, W., & Chao, D. (2012). Design and implementation of parallel term contribution algorithm based on MapReduce model. In 7th Open Cirrus Summit (pp. 43–47). Piscataway: IEEE.

11.

Chen, K., Wan, W. Q., & Li, Y. (2013). Differentially private feature selection under MapReduce framework. The Journal of China Universities of Posts and Telecommunications, 20(5), 85–103.CrossRef

12.

Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. Hoboken: Wiley-InterscienceMATHCrossRef

13.

Dalavi, M., & Cheke, S. (2014). Hadoop MapReduce implementation of a novel scheme for term weighting in text categorization. In International Conference on Control, Instrumentation, Communication and Computational Technologies (ICCICCT) (pp. 994–999). Piscataway: IEEE.

14.

del Río, S., López, V., Benítez, J. M., & Herrera, F. (2014). On the use of MapReduce for imbalanced big data using random forest. Information Sciences, 285, 112–137.CrossRef

15.

Díaz-Uriarte, R., & de Andrés, S. A. (2006). Gene selection and classification of microarray data using random forest. BMC Bioinformatics, 7(1), 3.CrossRef

16.

Dua, D., & Graff, C. (2017). UCI machine learning repository. http://archive.ics.uci.edu/ml.

17.

Fung, G. M., & Mangasarian, O. L. (2004). A feature selection newton method for support vector machine classification. Computational Optimization and Applications, 28(2), 185–202.MathSciNetMATHCrossRef

18.

Gehler, P., & Nowozin, S. (2009). On feature combination for multiclass object classification. In 2009 IEEE 12th International Conference on Computer Vision (pp. 221–228).

19.

Guyon, I., Weston, J., Barnhill, S., & Vapnik, V. (2002). Gene selection for cancer classification using support vector machines. Machine Learning, 46(1), 389–422.MATHCrossRef

20.

He, Q., Cheng, X., Zhuang, F., & Shi, Z. (2014). Parallel feature selection using positive approximation based on MapReduce. In 11th International Conference on Fuzzy Systems and Knowledge Discovery FSKD (pp. 397–402).

21.

Hodge, V. J., OKeefe, S., & Austin, J. (2016). Hadoop neural network for parallel and distributed feature selection. Neural Networks, 78, 24–35. In press, https://doi.org/10.1016/j.neunet.2015.08.011.MATHCrossRef

22.

Katakis, I., Tsoumakas, G., & Vlahavas, I. (2005). On the utility of incremental feature selection for the classification of textual data streams. In P. Bozanis, & E. N. Houstis (Eds.), Advances in informatics (pp. 338–348). Berlin: Springer.CrossRef

23.

Kumar, M., & Rath, S. K. (2015). Classification of microarray using MapReduce based proximal support vector machine classifier. Knowledge-Based Systems, 89, 584–602.CrossRef

24.

Li, J., & Liu, H. (2017). Challenges of feature selection for big data analytics. IEEE Intelligent Systems, 32(2), 9–15.CrossRef

25.

Li, Z., Lu, W., Sun, Z., & Xing, W. (2017). A parallel feature selection method study for text classification. Neural Computing and Applications, 28(1), 513–524.CrossRef

26.

Mao, Q., & Tsang, I. W. (2013). Efficient multitemplate learning for structured prediction. IEEE Transactions on Neural Networks and Learning Systems, 24, 248–261.CrossRef

27.

Meena, M. J., Chandran, K. R., Karthik, A., & Samuel, A. V. (2012). An enhanced ACO algorithm to select features for text categorization and its parallelization. Expert Systems with Applications, 39(5), 5861–5871.CrossRef

28.

Michael, M., & Lin, W.-C. (1973). Experimental study of information measure and inter-intra class distance ratios on feature selection and orderings. IEEE Transactions on Systems, Man and Cybernetics, SMC-3(2), 172–181.CrossRef

29.

Nie, F., Xiang, S., Jia, Y., Zhang, C., & Yan, S. (2008). Trace ratio criterion for feature selection. In Proceedings of the 23rd National Conference on Artificial Intelligence - Volume 2, AAAI’08 (pp. 671–676).

30.

O’Leary, D. E. (2013). Artificial intelligence and big data. IEEE Intelligent Systems, 28, 96–99 (2013)CrossRef

31.

Ordozgoiti, B., Gómez-Canaval, S., & Mozo, A. (2015). Massively parallel unsupervised feature selection on spark. In New trends in databases and information systems. Communications in Computer and Information Science (Vol. 539, pp. 186–196). Berlin: Springer International Publishing.

32.

Peng, H., Long, F., & Ding, C. (2005). Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(8), 1226–1238.CrossRef

33.

Peralta, D., del Río, S., Ramírez, S., Triguero, I., Benítez, J. M., & Herrera, F. (2015). Evolutionary feature selection for big data classification: A MapReduce approach. Mathematical Problems in Engineering, 2015, Article ID 246139.

34.

Press, W. H., Flannery, B. P., Teukolsky, S. A., & Vetterling, W. T. (1988). Numerical recipes in C: The art of scientific computing. New York: Cambridge University Press.MATH

35.

Ramírez-Gallego, S., Mouriño-Talín, H., Martínez-Rego, D., Bolón-Canedo, V., Benítez, J. M., Alonso-Betanzos, A., et al. (2018). An information theory-based feature selection framework for big data under apache spark. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 48(9), 1441–1453.CrossRef

36.

Roush, W. (2019). MIT technology review. TR10: Peering into video’s future. http://www.technologyreview.com/Infotech/18284/

37.

Singh, S., Kubica, J., Larsen, S. E., & Sorokina, D. (2009). Parallel large scale feature selection for logistic regression. In SIAM International Conference on Data Mining (SDM) (pp. 1172–1183).

38.

Spark, A. (2019). Machine learning library (MLlib) for spark. http://spark.apache.org/docs/latest/mllib-guide.html.

39.

Sun, Z., & Li, Z. (2014). Data intensive parallel feature selection method study. In International Joint Conference on Neural Networks (IJCNN) (pp. 2256–2262).

40.

Tan, M., Tsang, I. W., & Wang, L. (2014). Towards ultrahigh dimensional feature selection for big data. Journal of Machine Learning Research, 15(1), 1371–1429.MathSciNetMATH

41.

Tanupabrungsun, S., & Achalakul, T. (2013). Feature reduction for anomaly detection in manufacturing with MapReduce GA/kNN. In 19th IEEE International Conference on Parallel and Distributed Systems (ICPADS) (pp. 639–644).

42.

Triguero, I., del Río, S., López, V., Bacardit, J., Benítez, J. M., & Herrera, F. (2015). ROSEFW-RF: The winner algorithm for the ECBDL’14 big data competition: An extremely imbalanced big data bioinformatics problem. Knowledge-Based Systems, 87, 69–79.CrossRef

43.

Wang, J., Zhao, P., Hoi, S. C. H., & Jin, R. (2014). Online feature selection and its applications. IEEE Transactions on Knowledge and Data Engineering, 26(3), 698–710.CrossRef

44.

Yazidi, J., Bouaguel, W., & Essoussi, N. (2016). A parallel implementation of relief algorithm using MapReduce paradigm (pp. 418–425). Cham: Springer International Publishing.

45.

Yu, L., & Liu, H. (2003). Feature selection for high-dimensional data: A fast correlation-based filter solution. In Proceedings of the 20th International Conference on Machine Learning (ICML-03) (pp. 856–863).

46.

Zdravevski, E., Lameski, P., Kulakov, A., Jakimovski, B., Filiposka, S., & Trajanov, D. (2015). Feature ranking based on information gain for large classification problems with MapReduce. In 2015 IEEE Trustcom/BigDataSE/ISPA (Vol. 2, pp. 186–191).

47.

Zhai, Y., Ong, Y.-S., & Tsang, I. W. (2014). The emerging “big dimensionality”. IEEE Computational Intelligence Magazine, 9(3), 14–26.CrossRef

48.

Zhai, Y., Tan, M., Ong, Y. S., & Tsang, I. W. (2012). Discovering support and affiliated features from very high dimensions. In J. Langford & J. Pineau (Eds.), Proceedings of the 29th International Conference on Machine Learning (ICML-12) (pp. 1455–1462). New York: ACM.

49.

Zhao, Z., Zhang, R., Cox, J., Duling, D., & Sarle, W. (2013). Massively parallel feature selection: an approach based on variance preservation. Machine Learning, 92(1), 195–220.MathSciNetMATHCrossRef

50.

Zhong, L. W., & Kwok, J. T. (2012). Efficient sparse modeling with automatic feature grouping. IEEE Transactions on Neural Networks and Learning Systems, 23(9), 1436–1447.CrossRef

Titel: Dimensionality Reduction for Big Data
verfasst von: Julián Luengo
Diego García-Gil
Sergio Ramírez-Gallego
Salvador García
Francisco Herrera
Verlag: Springer International Publishing
Buch: Big Data Preprocessing
Print ISBN: 978-3-030-39104-1

Electronic ISBN: 978-3-030-39105-8

Copyright-Jahr: 2020
DOI: https://doi.org/10.1007/978-3-030-39105-8_4

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Premium Partner