nach oben

Progress in Artificial Intelligence

Erschienen in:

10.02.2017 | Regular Paper

MR-DIS: democratic instance selection for big data by MapReduce

verfasst von: Álvar Arnaiz-González, Alejandro González-Rogel, José-Francisco Díez-Pastor, Carlos López-Nozal

Erschienen in: Progress in Artificial Intelligence | Ausgabe 3/2017

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Instance selection is a popular preprocessing task in knowledge discovery and data mining. Its purpose is to reduce the size of data sets maintaining their predictive capabilities. The usual emerging problem at this point is that these methods quite often suffer of high computational complexity, which becomes highly inconvenient for processing huge data sets. In this paper, a parallel implementation for the instance selection algorithm Democratic Instance Selection (DIS) is presented. The main advantages of the DIS algorithm turn out to be its computational complexity, linear in the number of instances, as well as its internal structure, intuitively parallelizable. The purpose of this paper is threefold: firstly, the design of the DIS algorithm by following the MapReduce model; secondly, its implementation in the popular big data framework Spark; and finally, its empirical comparison over large-scale data sets. The results show that the processing time is reduced in a linear manner as the number of Spark executors increases, what makes it suitable for big data applications. In addition, the algorithm is publicly accessible to the scientific community.

Vorheriger Artikel Comparing multi-objective metaheuristics for solving a three-objective formulation of multiple sequence alignment

Nächster Artikel The effect of human thought on data: an analysis of self-reported data in supervised learning and neural networks

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

The subset selected by the algorithm is indistinctly referred to as filtered or selected set in the present paper.

We recommend the work of S. García et al. [10] for readers interested in this field.

In Spark, the process located between the map and the reduce phases is usually referred to as shuffle.

Author: Alejandro González-Rogel, https://bitbucket.org/agr00095/tfg-alg.-seleccion-instancias-spark.

In the Spark framework, each worker node has one or more executors, each one of which completes a task. A processor was assigned to each executor in the experimental work.

In [12] the percentage of instances used for error estimation in massive data sets was \(0.1\%\), but the use of a parallel implementation of 1-NN permits an increase in this percentage, improving the precision of the estimation.

Google Cloud Platform: https://cloud.google.com/dataproc/.

Amdahl, G.M.: Validity of the single processor approach to achieving large scale computing capabilities. In: Proceedings of the April 18–20, 1967, Spring Joint Computer Conference, AFIPS ’67 (Spring), pp. 483–485. ACM, New York (1967). doi:10.1145/1465482.1465560

Angiulli, F., Folino, G.: Distributed nearest neighbor-based condensation of very large data sets. IEEE Trans. Knowl. Data Eng. 19(12), 1593–1606 (2007). doi:10.1109/TKDE.2007.190665 CrossRef

Arnaiz-González, Á., Díez-Pastor, J.F., Rodríguez, J.J., García-Osorio, C.I.: Instance selection of linear complexity for big data. Knowl. Based Syst. 107, 83–95 (2016). doi:10.1016/j.knosys.2016.05.056 CrossRef

Asimov, D.: The grand tour: a tool for viewing multidimensional data. SIAM J. Sci. Stat. Comput. 6(1), 128–143 (1985)MathSciNetCrossRefMATH

Brighton, H., Mellish, C.: Advances in instance selection for instance-based learning algorithms. Data Min. Knowl. Discov. 6(2), 153–172 (2002). doi:10.1023/A:1014043630878 MathSciNetCrossRefMATH

Cano, J.R., Herrera, F., Lozano, M.: Stratification for scaling up evolutionary prototype selection. Pattern Recognit. Lett. 26(7), 953–963 (2005). doi:10.1016/j.patrec.2004.09.043 CrossRef

Chen, M., Mao, S., Liu, Y.: Big data: a survey. Mob. Netw. Appl. 19(2), 171–209 (2014). doi:10.1007/s11036-013-0489-0 CrossRef

de Haro-García, A., García-Pedrajas, N.: A divide-and-conquer recursive approach for scaling up instance selection algorithms. Data Min. Knowl. Discov. 18(3), 392–418 (2009). doi:10.1007/s10618-008-0121-2

Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008). doi:10.1145/1327452.1327492 CrossRef

10.

Garcia, S., Derrac, J., Cano, J., Herrera, F.: Prototype selection for nearest neighbor classification: taxonomy and empirical study. IEEE Trans. Pattern Anal. Mach. Intell. 34(3), 417–435 (2012). doi:10.1109/TPAMI.2011.142 CrossRef

11.

García, S., Luengo, J., Herrera, F.: Data Preprocessing in Data Mining. Springer, Berlin (2014)

12.

García-Osorio, C., de Haro-García, A., García-Pedrajas, N.: Democratic instance selection: a linear complexity instance selection algorithm based on classifier ensemble concepts. Artif. Intell. 174(56), 410–441 (2010). doi:10.1016/j.artint.2010.01.001 MathSciNetCrossRef

13.

Grama, A.Y., Gupta, A., Kumar, V.: Isoefficiency: measuring the scalability of parallel algorithms and architectures. IEEE Parallel Distrib. Technol. 1(3), 12–21 (1993). doi:10.1109/88.242438 CrossRef

14.

Hart, P.: The condensed nearest neighbor rule (corresp.). IEEE Trans. Inf. Theory 14(3), 515–516 (1968)CrossRef

15.

Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, STOC ’98, pp. 604–613. ACM, New York (1998). doi:10.1145/276698.276876

16.

Laney, D.: 3-d data management: controlling data volume, velocity and variety, Technical Report META Group Research Note (2001)

17.

Leyva, E., González, A., Pérez, R.: Three new instance selection methods based on local sets: a comparative study with several approaches from a bi-objective perspective. Pattern Recognit. 48(4), 1523–1537 (2015). doi:10.1016/j.patcog.2014.10.001 CrossRef

18.

Lichman, M.: UCI machine learning repository (2013). http://archive.ics.uci.edu/ml

19.

Maillo, J., Ramírez, S., Triguero, I., Herrera, F.: kNN-IS: An iterative spark-based design of the k-nearest neighbors classifier for big data. Knowledge-Based Systems (2016). doi:10.1016/j.knosys.2016.06.012

20.

Minelli, M., Chambers, M., Dhiraj, A.: Big Data, Big Analytics: Emerging Business Intelligence and Analytic Trends for Today’s Businesses. Wiley, London (2012). doi:10.1002/9781118562260.fmatter

21.

Ramírez-Gallego, S., García, S., Mouriño Talín, H., Martínez-Rego, D., Bolón-Canedo, V., Alonso-Betanzos, A., Benítez, J.M., Herrera, F.: Data discretization: taxonomy and big data challenge. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 6(1), 5–21 (2016). doi:10.1002/widm.1173 CrossRef

22.

Triguero, I., Peralta, D., Bacardit, J., García, S., Herrera, F.: Mrpr: a mapreduce solution for prototype reduction in big data classification. Neurocomputing 150 Part A, 331–345 (2015). doi:10.1016/j.neucom.2014.04.078

23.

Tsai, C.F., Lin, W.C., Ke, S.W.: Big data mining with parallel computing: a comparison of distributed and mapreduce methodologies. J. Syst. Softw. 122, 83–92 (2016). doi:10.1016/j.jss.2016.09.007 CrossRef

24.

Wilson, D.R., Martinez, T.R.: Instance pruning techniques. In: Machine Learning: Proceedings of the Fourteenth International Conference (ICML97), pp. 404–411. Morgan Kaufmann (1997)

25.

Wu, X., Zhu, X., Wu, G.Q., Ding, W.: Data mining with big data. IEEE Trans. Knowl. Data Eng. 26(1), 97–107 (2014). doi:10.1109/TKDE.2013.109 CrossRef

26.

Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. HotCloud 10, 10–10 (2010)

Titel: MR-DIS: democratic instance selection for big data by MapReduce
verfasst von: Álvar Arnaiz-González
Alejandro González-Rogel
José-Francisco Díez-Pastor
Carlos López-Nozal
Publikationsdatum: 10.02.2017
Verlag: Springer Berlin Heidelberg
Erschienen in: Progress in Artificial Intelligence / Ausgabe 3/2017
Print ISSN: 2192-6352
Elektronische ISSN: 2192-6360
DOI: https://doi.org/10.1007/s13748-017-0117-5

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Weitere Artikel der Ausgabe 3/2017

Application of ensemble classifiers in accurate diagnosis of myocardial ischemia conditions

Comparing multi-objective metaheuristics for solving a three-objective formulation of multiple sequence alignment

Handling swarm of UAVs based on evolutionary multi-objective optimization

Preference stability along time: the time cohesiveness measure

The effect of human thought on data: an analysis of self-reported data in supervised learning and neural networks

A genetic algorithm approach to customizing a glucose model based on usual therapeutic parameters

Premium Partner