nach oben

Erschienen in:

2020 | OriginalPaper | Buchkapitel

7. Big Data Discretization

verfasst von : Julián Luengo, Diego García-Gil, Sergio Ramírez-Gallego, Salvador García, Francisco Herrera

Erschienen in: Big Data Preprocessing

Verlag: Springer International Publishing

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Data discretization task transforms continuous numerical data into discrete and bounded values, more understandable for humans and more manageable for a wide range of machine learning methods. With the advent of Big Data, a new wave of large-scale datasets with predominance of continuous features have arrived to industry and academia. However, standard discretizers do not respond well to huge sets of continuous points, and novel distributed discretization solutions are demanded. In this chapter, we review the most relevant contributions to this field in the literature. We begin by enumerating the early proposals on dealing with parallel discretization. Then, we present some distributed solutions capable of scaling on large-scale datasets. We finish with a study of the discretization methods capable of dealing with Big Data streams.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Vorheriges Kapitel Imperfect Big Data

Nächstes Kapitel Imbalanced Data Preprocessing for Big Data

If the points are in array format, a loop is used to evaluate points, else a distributed map function is used instead.

Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules. In Proceedings of the 20th Very Large Data Bases Conference (VLDB) (pp. 487–499).

Alcalde-Barros, A., García-Gil, D., García, S., & Herrera, F. (2019). DPASF: A Flink library for streaming data preprocessing. Big Data Analytics, 4(1), 4.CrossRef

Apache Flink. (2019). Apache Flink. http://flink.apache.org/.

Bechini, A., Marcelloni, F., & Segatori, A. (2016). A MapReduce solution for associative classification of big data. Information Sciences, 332, 33–55.CrossRef

Cano, A., Ventura, S., & Cios, K. J. (2014). Scalable CAIM discretization on multiple GPUs using concurrent kernels. The Journal of Supercomputing, 69(1), 273–292.CrossRef

Cerquides, J., & de Mántaras, R. L. (1997). Proposal and empirical comparison of a parallelizable distance-based discretization method. In Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, KDD’97 (pp. 139–142).

Dua, D., & Graff, C. (2017). UCI machine learning repository. http://archive.ics.uci.edu/ml.

Fayyad, U. M., & Irani, K. B. (1993). Multi-interval discretization of continuous-valued attributes for classification learning. In IJCAI.

Fayyad, U. M., & Irani, K. B. (1992). On the handling of continuous-valued attributes in decision tree generation. Machine Learning, 8(1), 87–102.MATH

10.

Fayyad, U. M., & Irani, K. B. (1993). Multi-interval discretization of continuous-valued attributes for classification learning. In Proceedings of the 13th International Joint Conference on Artificial Intelligence (IJCAI) (pp. 1022–1029).

11.

García, S., Luengo, J., & Herrera, F. (2015). Data preprocessing in data mining. New York: Springer.CrossRef

12.

García, S., Luengo, J., Sáez, J. A., López, V., & Herrera, F. (2013). A survey of discretization techniques: Taxonomy and empirical analysis in supervised learning. IEEE Transactions on Knowledge and Data Engineering, 25(4), 734–750.CrossRef

13.

Hu, H.-W., Chen, Y.-L., & Tang, K. (2009). A dynamic discretization approach for constructing decision trees with a continuous label. IEEE Transactions on Knowledge and Data Engineering, 21(11), 1505–1514.CrossRef

14.

Liu, H., Hussain, F., Tan, C. L., & Dash, M. (2002). Discretization: An enabling technique. Data Mining and Knowledge Discovery, 6(4), 393–423.MathSciNetCrossRef

15.

Machine Learning Library (MLlib) for Spark. (2019) MLlib. http://spark.apache.org/docs/latest/mllib-guide.html.

16.

Parthasarathy, S., & Ramakrishnan, A. (2002). Parallel incremental 2D-discretization on dynamic datasets. In International Conference on Parallel and Distributed Processing Systems (pp. 247–254).

17.

Pinto, C. (2006). Discretization from data streams: applications to histograms and data mining. In In Proceedings of the 2006 ACM symposium on Applied computing (SAC06 (pp. 662–667).

18.

Quinlan, J. R. (1993). C4.5: programs for machine learning. San Francisco, CA: Morgan Kaufmann Publishers Inc.

19.

Ramírez-Gallego, S., García, S., Benítez, J. M., & Herrera, F. (2016). Multivariate discretization based on evolutionary cut points selection for classification. IEEE Transactions on Cybernetics, 46(3), 595–608.CrossRef

20.

Ramírez-Gallego, S., García, S., Benítez, J. M., & Herrera, F. (2018). A distributed evolutionary multivariate discretizer for big data processing on Apache spark. Swarm and Evolutionary Computation, 38, 240–250.CrossRef

21.

Ramírez-Gallego, S., García, S., & Herrera, F. (2018). Online entropy-based discretization for data streaming classification. Future Generation Computer Systems, 86, 59–70.CrossRef

22.

Ramírez-Gallego, S., García, S., Talín, H. M., Martínez-Rego, D., Bolón-Canedo, V., Alonso-Betanzos, A., et al. (2016). Data discretization: taxonomy and big data challenge. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 6(1), 5–21.

23.

van Leeuwen, J., & Wood, D. (1993). Interval heaps. The Computer Journal, 36(3), 209–216.CrossRef

24.

Vitter, J. S. (1985). Random sampling with a reservoir. ACM Transactions on Mathematical Software, 11(1), 37–57.MathSciNetCrossRef

25.

Webb, G. I. (2014). Contrary to popular belief incremental discretization can be sound, computationally efficient and extremely useful for streaming data. In Proceedings of the 2014 IEEE International Conference on Data Mining, ICDM ’14 (pp. 1031–1036). Washington, DC: IEEE Computer Society.CrossRef

26.

Witten, I. H., Frank, E., Hall, M. A., & Pal, C. J. (2017). Data mining: practical machine learning tools and techniques. Cambridge, MA: Morgan Kaufmann Publisher.

27.

Wu, X., & Kumar, V. (Eds.). (2009). The top ten algorithms in data mining. Chapman & Hall/CRC Data Mining and Knowledge Discovery. New York: CRC Press.

28.

Wu, X., Zhu, X., Wu, G.-Q., & Ding, W. (2014). Data mining with big data. IEEE Transactions on Knowledge and Data Engineering, 26(1), 97–107.CrossRef

29.

Xu, Y., Wang, X., & Xiao, D. (2012). A two step parallel discretization algorithm based on dynamic clustering. In Proceedings of the 2012 International Conference on Computer Science and Electronics Engineering - Volume 03, ICCSEE ’12 (pp. 192–196).

30.

Yang, Y., & Webb, G. I. (2009). Discretization for naive-Bayes learning: managing discretization bias and variance. Machine Learning, 74(1), 39–74.CrossRef

31.

Zhang, Y., Yu, J., & Wang, J. (2014) Parallel implementation of chi2 algorithm in MapReduce framework. In International Conference on Human Centered Computing (pp. 890–899). Heidelberg: Springer.

32.

Zhao, Y., Niu, Z., Peng, X., & Dai. L. (2011). A discretization algorithm of numerical attributes for digital library evaluation based on data mining technology. In Proceedings of the 13th International Conference on Asia-pacific Digital Libraries: For Cultural Heritage, Knowledge Dissemination, and Future Creation, ICADL’11 (pp. 70–76).

33.

Zighed, D. A., Rabaséda, S., & Rakotomalala, R. (1998). FUSINTER: A method for discretization of continuous attributes. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 06(03), 307–326.CrossRef

Titel: Big Data Discretization
verfasst von: Julián Luengo
Diego García-Gil
Sergio Ramírez-Gallego
Salvador García
Francisco Herrera
Verlag: Springer International Publishing
Buch: Big Data Preprocessing
Print ISBN: 978-3-030-39104-1

Electronic ISBN: 978-3-030-39105-8

Copyright-Jahr: 2020
DOI: https://doi.org/10.1007/978-3-030-39105-8_7

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Premium Partner