Skip to main content

2020 | OriginalPaper | Buchkapitel

5. Data Reduction for Big Data

verfasst von : Julián Luengo, Diego García-Gil, Sergio Ramírez-Gallego, Salvador García, Francisco Herrera

Erschienen in: Big Data Preprocessing

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Data reduction in data mining selects/generates the most representative instances in the input data in order to reduce the original complex instance space and better define the decision boundaries between classes. Theoretically, reduction techniques should enable the application of learning algorithms on large-scale problems. Nevertheless, standard algorithms suffer from the increment on size and complexity of today’s problems. The objective of this chapter is to provide several ideas, algorithms, and techniques to deal with the data reduction problem on Big Data. We begin by analyzing the first ideas on scalable data reduction in single-machine environments. Then we present a distributed data reduction method that solves many of the scalability problems derived from the sequential approaches. Next we provide a case of use of data reduction algorithms in Big Data. Lastly, we study a recent development on data reduction for high-speed streaming systems.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Angiulli, F. (2007). Fast nearest neighbor condensation for large data sets classification. IEEE Transactions on Knowledge and Data Engineering, 19(11), 1450–1464.CrossRef Angiulli, F. (2007). Fast nearest neighbor condensation for large data sets classification. IEEE Transactions on Knowledge and Data Engineering, 19(11), 1450–1464.CrossRef
2.
Zurück zum Zitat Arnaiz-González, Á., Díez-Pastor, J.-F., Rodríguez, J. J., & García-Osorio, C. (2016). Instance selection of linear complexity for big data. Knowledge-Based Systems, 107, 83–95.CrossRef Arnaiz-González, Á., Díez-Pastor, J.-F., Rodríguez, J. J., & García-Osorio, C. (2016). Instance selection of linear complexity for big data. Knowledge-Based Systems, 107, 83–95.CrossRef
3.
Zurück zum Zitat Arnaiz-González, Á., González-Rogel, A., Díez-Pastor, J.-F., & López-Nozal, C. (2017). MR-DIS: Democratic instance selection for big data by MapReduce. Progress in Artificial Intelligence, 6, 1–9.CrossRef Arnaiz-González, Á., González-Rogel, A., Díez-Pastor, J.-F., & López-Nozal, C. (2017). MR-DIS: Democratic instance selection for big data by MapReduce. Progress in Artificial Intelligence, 6, 1–9.CrossRef
4.
Zurück zum Zitat Bezdek, J. C., & Kuncheva, L. I. (2001). Nearest prototype classifier designs: An experimental study. International Journal of Intelligent Systems, 16(12), 1445–1473.CrossRef Bezdek, J. C., & Kuncheva, L. I. (2001). Nearest prototype classifier designs: An experimental study. International Journal of Intelligent Systems, 16(12), 1445–1473.CrossRef
5.
Zurück zum Zitat Cano, J. R., Herrera, F., & Lozano, M. (2003). Using evolutionary algorithms as instance selection for data reduction in KDD: An experimental study. IEEE Transactions on Evolutionary Computation, 7(6), 561–575.CrossRef Cano, J. R., Herrera, F., & Lozano, M. (2003). Using evolutionary algorithms as instance selection for data reduction in KDD: An experimental study. IEEE Transactions on Evolutionary Computation, 7(6), 561–575.CrossRef
6.
Zurück zum Zitat Cano, J. R., Herrera, F., & Lozano, M. (2005). Stratification for scaling up evolutionary prototype selection. Pattern Recognition Letters, 26(7), 953–963.CrossRef Cano, J. R., Herrera, F., & Lozano, M. (2005). Stratification for scaling up evolutionary prototype selection. Pattern Recognition Letters, 26(7), 953–963.CrossRef
7.
Zurück zum Zitat Cano, J. R., Herrera, F., & Lozano, M. (2006). On the combination of evolutionary algorithms and stratified strategies for training set selection in data mining. Applied Soft Computing, 6(3), 323–332.CrossRef Cano, J. R., Herrera, F., & Lozano, M. (2006). On the combination of evolutionary algorithms and stratified strategies for training set selection in data mining. Applied Soft Computing, 6(3), 323–332.CrossRef
8.
Zurück zum Zitat Chang, C. L. (1974). Finding prototypes for nearest neighbor classifiers. IEEE Transactions on Computers, 100(11), 1179–1184.CrossRef Chang, C. L. (1974). Finding prototypes for nearest neighbor classifiers. IEEE Transactions on Computers, 100(11), 1179–1184.CrossRef
9.
Zurück zum Zitat Chu, C.-T., Kim, S. K., Lin, Y.-A., Yu, Y. Y., Bradski, G., Ng, A. Y., et al. (2006). Map-reduce for machine learning on multicore. In Proceedings of the 19th International Conference on Neural Information Processing Systems, NIPS’06 (pp. 281–288). Chu, C.-T., Kim, S. K., Lin, Y.-A., Yu, Y. Y., Bradski, G., Ng, A. Y., et al. (2006). Map-reduce for machine learning on multicore. In Proceedings of the 19th International Conference on Neural Information Processing Systems, NIPS’06 (pp. 281–288).
10.
Zurück zum Zitat Cover, T., & Hart, P. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13(1), 21–27.CrossRef Cover, T., & Hart, P. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13(1), 21–27.CrossRef
11.
Zurück zum Zitat de Haro-García, A., & García-Pedrajas, N. (2009). A divide-and-conquer recursive approach for scaling up instance selection algorithms. Data Mining and Knowledge Discovery, 18(3), 392–418.MathSciNetCrossRef de Haro-García, A., & García-Pedrajas, N. (2009). A divide-and-conquer recursive approach for scaling up instance selection algorithms. Data Mining and Knowledge Discovery, 18(3), 392–418.MathSciNetCrossRef
12.
Zurück zum Zitat Derrac, J., García, S., & Herrera, F. (2010). IFS-CoCo: Instance and feature selection based on cooperative coevolution with nearest neighbor rule. Pattern Recognition, 43(6), 2082–2105.CrossRef Derrac, J., García, S., & Herrera, F. (2010). IFS-CoCo: Instance and feature selection based on cooperative coevolution with nearest neighbor rule. Pattern Recognition, 43(6), 2082–2105.CrossRef
13.
Zurück zum Zitat Derrac, J., García, S., & Herrera, F. (2010). Stratified prototype selection based on a steady-state memetic algorithm: A study of scalability. Memetic Computing, 2(3), 183–199.CrossRef Derrac, J., García, S., & Herrera, F. (2010). Stratified prototype selection based on a steady-state memetic algorithm: A study of scalability. Memetic Computing, 2(3), 183–199.CrossRef
15.
Zurück zum Zitat Eiben, A. E., & Smith, J. E. (2003). Introduction to evolutionary computing (Vol. 53). Berlin: Springer.CrossRef Eiben, A. E., & Smith, J. E. (2003). Introduction to evolutionary computing (Vol. 53). Berlin: Springer.CrossRef
16.
Zurück zum Zitat Gama, J., Ganguly, A., Omitaomu, O., Vatsavai, R., & Gaber, M. (2009). Knowledge discovery from data streams. Intelligent Data Analysis, 13(3), 403–404.CrossRef Gama, J., Ganguly, A., Omitaomu, O., Vatsavai, R., & Gaber, M. (2009). Knowledge discovery from data streams. Intelligent Data Analysis, 13(3), 403–404.CrossRef
17.
Zurück zum Zitat García, S., Cano, J. R., & Herrera, F. (2008). A memetic algorithm for evolutionary prototype selection: A scaling up approach. Pattern Recognition, 41(8), 2693–2709.CrossRef García, S., Cano, J. R., & Herrera, F. (2008). A memetic algorithm for evolutionary prototype selection: A scaling up approach. Pattern Recognition, 41(8), 2693–2709.CrossRef
18.
Zurück zum Zitat García, S., Derrac, J., Cano, J. R., & Herrera, F. (2012). Prototype selection for nearest neighbor classification: Taxonomy and empirical study. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(3), 417–435.CrossRef García, S., Derrac, J., Cano, J. R., & Herrera, F. (2012). Prototype selection for nearest neighbor classification: Taxonomy and empirical study. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(3), 417–435.CrossRef
19.
Zurück zum Zitat García, S., Luengo, J., & Herrera, F. (2014). Data preprocessing in data mining. Berlin: Springer Publishing Company, Incorporated. García, S., Luengo, J., & Herrera, F. (2014). Data preprocessing in data mining. Berlin: Springer Publishing Company, Incorporated.
20.
Zurück zum Zitat García-Osorio, C., de Haro-García, A., & García-Pedrajas, N. (2010). Democratic instance selection: A linear complexity instance selection algorithm based on classifier ensemble concepts. Artificial Intelligence, 174(5), 410–441.MathSciNetCrossRef García-Osorio, C., de Haro-García, A., & García-Pedrajas, N. (2010). Democratic instance selection: A linear complexity instance selection algorithm based on classifier ensemble concepts. Artificial Intelligence, 174(5), 410–441.MathSciNetCrossRef
21.
Zurück zum Zitat García-Pedrajas, N., & de Haro-García, A. (2012). Scaling up data mining algorithms: review and taxonomy. Progress in Artificial Intelligence, 1(1), 71–87.CrossRef García-Pedrajas, N., & de Haro-García, A. (2012). Scaling up data mining algorithms: review and taxonomy. Progress in Artificial Intelligence, 1(1), 71–87.CrossRef
22.
Zurück zum Zitat García-Pedrajas, N., de Haro-García, A., & Pérez-Rodríguez, J. (2013). A scalable approach to simultaneous evolutionary instance and feature selection. Information Sciences, 228, 150–174.MathSciNetCrossRef García-Pedrajas, N., de Haro-García, A., & Pérez-Rodríguez, J. (2013). A scalable approach to simultaneous evolutionary instance and feature selection. Information Sciences, 228, 150–174.MathSciNetCrossRef
23.
Zurück zum Zitat Han, D., Giraud-Carrier, C., & Li, S. (2015). Efficient mining of high-speed uncertain data streams. Applied Intelligence, 43(4), 773–785.CrossRef Han, D., Giraud-Carrier, C., & Li, S. (2015). Efficient mining of high-speed uncertain data streams. Applied Intelligence, 43(4), 773–785.CrossRef
24.
Zurück zum Zitat Hart, P. E. (1968). The condensed nearest neighbor rule. IEEE Transactions on Information Theory, 18, 515–516.CrossRef Hart, P. E. (1968). The condensed nearest neighbor rule. IEEE Transactions on Information Theory, 18, 515–516.CrossRef
25.
Zurück zum Zitat Iguyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3, 1157–1182.MATH Iguyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3, 1157–1182.MATH
26.
Zurück zum Zitat Kononenko, I. (1994). Estimating attributes: Analysis and extensions of relief. In Machine Learning: European Conference on Machine Learning ECML-94 (pp. 171–182). Berlin: Springer.CrossRef Kononenko, I. (1994). Estimating attributes: Analysis and extensions of relief. In Machine Learning: European Conference on Machine Learning ECML-94 (pp. 171–182). Berlin: Springer.CrossRef
27.
Zurück zum Zitat Lam, W., Keung, C.-K., & Liu, D. (2002). Discovering useful concept prototypes for classification based on filtering and abstraction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(8), 1075–1090.CrossRef Lam, W., Keung, C.-K., & Liu, D. (2002). Discovering useful concept prototypes for classification based on filtering and abstraction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(8), 1075–1090.CrossRef
28.
Zurück zum Zitat Liu, H., & Motoda, H. (2007). Computational methods of feature selection. Boca Raton: CRC Press.CrossRef Liu, H., & Motoda, H. (2007). Computational methods of feature selection. Boca Raton: CRC Press.CrossRef
29.
Zurück zum Zitat Liu, T., Moore, A. W., Gray, A. G., & Yang, K. (2004). An investigation of practical approximate nearest neighbor algorithms. In NIPS’04 Proceedings of the 17th International Conference on Advances in Neural Information Processing Systems NIPS (pp. 825–832). Liu, T., Moore, A. W., Gray, A. G., & Yang, K. (2004). An investigation of practical approximate nearest neighbor algorithms. In NIPS’04 Proceedings of the 17th International Conference on Advances in Neural Information Processing Systems NIPS (pp. 825–832).
30.
Zurück zum Zitat Maillo, J., Ramírez, S., Triguero, I., & Herrera, F. (2017). kNN-IS: An Iterative Spark-based design of the k-Nearest Neighbors classifier for big data. Knowledge-Based Systems, 117, 3–15.CrossRef Maillo, J., Ramírez, S., Triguero, I., & Herrera, F. (2017). kNN-IS: An Iterative Spark-based design of the k-Nearest Neighbors classifier for big data. Knowledge-Based Systems, 117, 3–15.CrossRef
31.
Zurück zum Zitat Navot, A., Shpigelman, L., Tishby, N., & Vaadia, E. (2006). Nearest neighbor based feature selection for regression and its application to neural activity. In Proceedings of the International Conference on Advances in Neural Information Processing Systems (pp. 996–1002). Navot, A., Shpigelman, L., Tishby, N., & Vaadia, E. (2006). Nearest neighbor based feature selection for regression and its application to neural activity. In Proceedings of the International Conference on Advances in Neural Information Processing Systems (pp. 996–1002).
32.
Zurück zum Zitat Palma-Mendoza, R. J., Rodriguez, D., & de-Marcos, L. (2018). Distributed reliefF-based feature selection in spark. Knowledge and Information Systems, 57, 1–20. Palma-Mendoza, R. J., Rodriguez, D., & de-Marcos, L. (2018). Distributed reliefF-based feature selection in spark. Knowledge and Information Systems, 57, 1–20.
33.
Zurück zum Zitat Peralta, D., del Río, S., Ramírez-Gallego, S., Triguero, I., Benitez, J. M., & Herrera, F. (2016). Evolutionary feature selection for big data classification: A MapReduce approach. Mathematical Problems in Engineering, 2015, 1–11, Article ID 246139 Peralta, D., del Río, S., Ramírez-Gallego, S., Triguero, I., Benitez, J. M., & Herrera, F. (2016). Evolutionary feature selection for big data classification: A MapReduce approach. Mathematical Problems in Engineering, 2015, 1–11, Article ID 246139
34.
Zurück zum Zitat Provost, F., & Kolluri, V. (1999). A survey of methods for scaling up inductive algorithms. Data Mining and Knowledge Discovery, 3(2), 131–169.CrossRef Provost, F., & Kolluri, V. (1999). A survey of methods for scaling up inductive algorithms. Data Mining and Knowledge Discovery, 3(2), 131–169.CrossRef
35.
Zurück zum Zitat Ramírez-Gallego, S., Krawczyk, B., García, S., Woźniak, M., Benítez, J. M., & Herrera, F. (2017). Nearest neighbor classification for high-speed big data streams using spark. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 47(10), 2727–2739.CrossRef Ramírez-Gallego, S., Krawczyk, B., García, S., Woźniak, M., Benítez, J. M., & Herrera, F. (2017). Nearest neighbor classification for high-speed big data streams using spark. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 47(10), 2727–2739.CrossRef
36.
Zurück zum Zitat Sánchez, J. S., Pla, F., & Ferri, F. J. (1997). Prototype selection for the nearest neighbour rule through proximity graphs. Pattern Recognition Letters, 18(6), 507–513.CrossRef Sánchez, J. S., Pla, F., & Ferri, F. J. (1997). Prototype selection for the nearest neighbour rule through proximity graphs. Pattern Recognition Letters, 18(6), 507–513.CrossRef
37.
Zurück zum Zitat Skalak, D. B. (1994). Prototype and feature selection by sampling and random mutation hill climbing algorithms. In 11th International Conference on Machine Learning (ML’94) (pp. 293–301). Skalak, D. B. (1994). Prototype and feature selection by sampling and random mutation hill climbing algorithms. In 11th International Conference on Machine Learning (ML’94) (pp. 293–301).
38.
Zurück zum Zitat Triguero, I., Derrac, J., Garcia, S., & Herrera, F. (2012). A taxonomy and experimental study on prototype generation for nearest neighbor classification. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42(1), 86–100.CrossRef Triguero, I., Derrac, J., Garcia, S., & Herrera, F. (2012). A taxonomy and experimental study on prototype generation for nearest neighbor classification. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42(1), 86–100.CrossRef
39.
Zurück zum Zitat Triguero, I., García, S., & Herrera, F. (2010). IPADE: Iterative prototype adjustment for nearest neighbor classification. IEEE Transactions on Neural Networks, 21(12), 1984–1990.CrossRef Triguero, I., García, S., & Herrera, F. (2010). IPADE: Iterative prototype adjustment for nearest neighbor classification. IEEE Transactions on Neural Networks, 21(12), 1984–1990.CrossRef
40.
Zurück zum Zitat Triguero, I., García, S., & Herrera, F. (2011). Differential evolution for optimizing the positioning of prototypes in nearest neighbor classification. Pattern Recognition, 44(4), 901–916.CrossRef Triguero, I., García, S., & Herrera, F. (2011). Differential evolution for optimizing the positioning of prototypes in nearest neighbor classification. Pattern Recognition, 44(4), 901–916.CrossRef
41.
Zurück zum Zitat Triguero, I., García-Gil, D., Maillo, J., Luengo, J., García, S., & Herrera, F. (2019). Transforming big data into smart data: An insight on the use of the k-nearest neighbors algorithm to obtain quality data. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 9(2), e1289. Triguero, I., García-Gil, D., Maillo, J., Luengo, J., García, S., & Herrera, F. (2019). Transforming big data into smart data: An insight on the use of the k-nearest neighbors algorithm to obtain quality data. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 9(2), e1289.
42.
Zurück zum Zitat Triguero, I., Gonzalez, S., Moyano, J. M., García, S., Alcala-Fdez, J., Luengo, J., et al. (2017). Keel 3.0: An open source software for multi-stage analysis in data mining. International Journal of Computational Intelligence Systems, 10, 1238–1249.CrossRef Triguero, I., Gonzalez, S., Moyano, J. M., García, S., Alcala-Fdez, J., Luengo, J., et al. (2017). Keel 3.0: An open source software for multi-stage analysis in data mining. International Journal of Computational Intelligence Systems, 10, 1238–1249.CrossRef
43.
Zurück zum Zitat Triguero, I., Peralta, D., Bacardit, J., García, S., & Herrera, F. (2015). MRPR: A MapReduce solution for prototype reduction in big data classification. Neurocomputing, 150, Part A, 331–345, 2015. Triguero, I., Peralta, D., Bacardit, J., García, S., & Herrera, F. (2015). MRPR: A MapReduce solution for prototype reduction in big data classification. Neurocomputing, 150, Part A, 331–345, 2015.
44.
Zurück zum Zitat Wilson, D. L. (1972). Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man, and Cybernetics, 2(3), 408–421.MathSciNetCrossRef Wilson, D. L. (1972). Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man, and Cybernetics, 2(3), 408–421.MathSciNetCrossRef
45.
Zurück zum Zitat Xue, B., Zhang, M., Browne, W. N., & Yao, X. (2016). A survey on evolutionary computation approaches to feature selection. IEEE Transactions on Evolutionary Computation, 20(4), 606–626.CrossRef Xue, B., Zhang, M., Browne, W. N., & Yao, X. (2016). A survey on evolutionary computation approaches to feature selection. IEEE Transactions on Evolutionary Computation, 20(4), 606–626.CrossRef
Metadaten
Titel
Data Reduction for Big Data
verfasst von
Julián Luengo
Diego García-Gil
Sergio Ramírez-Gallego
Salvador García
Francisco Herrera
Copyright-Jahr
2020
DOI
https://doi.org/10.1007/978-3-030-39105-8_5

Premium Partner