nach oben

Cluster Computing

Erschienen in:

06.07.2018

A new MapReduce associative classifier based on a new storage format for large-scale imbalanced data

verfasst von: Mehrdad Almasi, Mohammad Saniee Abadeh

Erschienen in: Cluster Computing | Ausgabe 4/2018

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

The process of knowledge discovery from big and high dimensional datasets has become a popular research topic. The classification problem is a key task in bioinformatics, business intelligence, decision science, astronomy, physics, etc. Building associative classifiers has been a notable research interest in recent years because of their superior accuracy. In associative classifiers, using under-sampling or over-sampling methods for imbalanced big datasets reduces accuracy or increases running time, respectively. Hence, there is a significant need to create efficient associative classifiers for imbalanced big data problems. These classifiers should be able to handle challenges such as memory usage, running time and efficiently exploring the search space. To this end, efficient calculation of measures is a primary objective for associative classifiers. In this paper, we propose a new efficient associative classifier for big imbalanced datasets. The proposed method is based on Rare-PEARs (a multi-objective evolutionary algorithm that efficiently discovers rare and reliable association rules) and is able to evaluate rules in a distributed manner by using a new storing data format. This format simplifies measures calculation and is fully compatible with the MapReduce programming model. We have applied the proposed method (RPII) on a well-known big dataset (ECBDL’14) and have compared our results with seven other learning methods. The experimental results show that RPII outperform other methods in sensitivity and final score measures (the values of sensitivity and final score measures were approximately 0.74 and 0.54 respectively). The results demonstrate that the proposed method is a good candidate for large-scale classification problems; furthermore, it achieves reasonable execution time when the target platform is a typical computer clusters.

Vorheriger Artikel A novel task scheduling approach based on dynamic queues and hybrid meta-heuristic algorithms for cloud computing environment

Nächster Artikel Enabling secure auditing and deduplicating data without owner-relationship exposure in cloud storage

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Nur mit Berechtigung zugänglich

http://bioinformatics.oxfordjournals.org/content/28/19/2441.

Xu, Q., Wang Z., Wang, F., Li J.: Thermal comfort research on human CT data modeling. Multimed. Tools Appl. 1–6 (2017)

Yang, J., Li, J., Liu, S.: A new algorithm of stock data mining in Internet of Multimedia Things. J. Supercomput. 1–6 (2017)

Li, G., Zhang, Z., Wang, L., Chen, Q., Pan, J.: One-class collaborative filtering based on rating prediction and ranking prediction. Knowl.-Based Syst. 124, 46–54 (2017)CrossRef

Li, G., Ou, W.: Pairwise probabilistic matrix factorization for implicit feedback collaborative filtering. Neurocomputing 204, 17–25 (2016)CrossRef

Yang, J., Li, J., Liu, S.: A novel technique applied to the economic investigation of recommender system. Multimed. Tools Appl. 1–6 (2017)

Xu, Q., Wu, J., Chen, Q.: A novel mobile personalized recommended method based on money flow model for stock exchange. Math. Prob. Eng. (2014)

Xu, Q.: A novel machine learning strategy based on two-dimensional numerical models in financial engineering. Math. Prob. Eng. (2013)

Corbellini, A., Godoy, D., Mateos, C., Schiaffino, S., Zunino, A.: DPM: a novel distributed large-scale social graph processing framework for link prediction algorithms. Future Gener. Comput. Syst. 78, 474–480 (2018)CrossRef

Corbellini, A., Mateos, C., Godoy, D., Zunino, A., Schiaffino, S.: An architecture and platform for developing distributed recommendation algorithms on large-scale social networks. J. Inf. Sci. 41(5), 686–704 (2015)CrossRef

10.

Samovsky, M., Kacur, T.: Cloud-based classification of text documents using the Gridgain platform. In: 2012 7th IEEE International Symposium on Applied Computational Intelligence and Informatics (SACI), 2012 May 24, pp. 241–245 (2012)

11.

Christopher, M.B.: Pattern Recognition and Machine Learning. Springer, New York (2016)

12.

Wedyan, S.: Review and comparison of associative classification data mining approaches. Int. J. Comput. Inf. Syst. Control Eng. 8(1), 34–45 (2014)

13.

Nguyen, L.T., Vo, B., Hong, T.P., Thanh, H.C.: CAR-Miner: an efficient algorithm for mining class-association rules. Expert Syst. Appl. 40(6), 2305–2311 (2013)CrossRef

14.

Sun, Y., Wang, Y., Wong, A.K.: Boosting an associative classifier. IEEE Trans. Knowl. Data Eng. 18(7), 988–992 (2006)CrossRef

15.

Witten, I.H., Frank, E., Hall, M.A., Pal, C.J.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, Burlington (2016)

16.

Agrawal, R., Imieliński, T., Swami, A.: Mining association rules between sets of items in large databases. InAcm sigmod Record 22(2), 207–216. ACM (1993)

17.

Mahafzah, B.A., Al-Badarneh, A.F., Zakaria, M.Z.: A new sampling technique for association rule mining. J. Inf. Sci. 35(3), 358–376 (2009)CrossRef

18.

Bechini, A., Marcelloni, F., Segatori, A.: A MapReduce solution for associative classification of big data. Inf. Sci. 332, 33–55 (2016)CrossRef

19.

Thabtah, F.: A review of associative classification mining. Knowl. Eng. Rev. 22(1), 37–65 (2007)CrossRef

20.

Almasi, M., Abadeh, M.S.: Rare-PEARs: a new multi objective evolutionary algorithm to mine rare and non-redundant quantitative association rules. Knowl.-Based Syst. 89, 366–384 (2015)CrossRef

21.

Krishnamoorthy, S., Sadasivam, G.S., Rajalakshmi, M., Kowsalyaa, K., Dhivya, M.: Privacy Preserving Fuzzy Association Rule Mining in Data Clusters Using Particle Swarm Optimization. Int. J. Intell. Inf. Technol. (IJIIT) 13(2), 1–20 (2017)CrossRef

22.

Martín, D., Alcalá-Fdez, J., Rosete, A., Herrera, F.: NICGAR: a Niching Genetic Algorithm to mine a diverse set of interesting quantitative association rules. Inf. Sci. 355, 208–228 (2016)CrossRef

23.

Ma, B.L., Liu, B.: Integrating classification and association rule mining. In: Proceedings of the fourth international conference on knowledge discovery and data mining (1998)

24.

Li, W., Han, J., Pei, J.: CMAR: Accurate and efficient classification based on multiple class-association rules. InData Mining, 2001. ICDM 2001, Proceedings IEEE International Conference on 2001, pp. 369–376 (2001)

25.

Baralis, E., Chiusano, S., Garza, P.: A lazy approach to associative classification. IEEE Trans. Knowl. Data Eng. 20(2):156–171 (2008)CrossRef

26.

Fan, W., Bifet, A.: Mining big data: current status, and forecast to the future. ACM sIGKDD Explor. Newsl. 14(2), 1–5 (2013)CrossRef

27.

Luna, J.M., Cano, A., Pechenizkiy, M.: Ventura S.: Speeding-up association rule mining with inverted index compression. IEEE Trans. Cybernet. 46(12), 3059–3072 (2016)CrossRef

28.

White, T.: Hadoop: The Definitive Guide. O’Reilly Media, Inc., Sebastopol (2012)

29.

Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S.: Stoica, I: spark: cluster computing with working sets. HotCloud 10(10–10), 95 (2010)

30.

Shanahan, J.G., Dai, L.: Large scale distributed data science using apache spark. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 2015 Aug 10, pp. 2323–2324. ACM (2015)

31.

Landset, S., Khoshgoftaar, T.M., Richter, A.N., Hasanin, T.: A survey of open source tools for machine learning with big data in the Hadoop ecosystem. J. Big Data. 2(1), 24 (2015)CrossRef

32.

Pentreath, N.: Machine Learning with Spark. Packt Publishing Ltd, Birmingham (2015)

33.

http://cruncher.ncl.ac.uk/bdcomp/TrainingSet.arff.gz and http://cruncher.ncl.ac.uk/bdcomp/TestSet.arff.gz and http://cruncher.ncl.ac.uk/bdcomp

34.

Triguero, I:, del Río, S., López, V., Bacardit, J., Benítez, J.M., Herrera, F.: ROSEFW-RF: the winner algorithm for the ECBDL’14 big data competition: an extremely imbalanced big data bioinformatics problem. Knowl.-Based Syst. 87:69–79 (2015)CrossRef

35.

http://cruncher.ncl.ac.uk/bdcomp/BDCOMP-final.pdf

36.

Thusoo, A., Shao, Z., Anthony, S., Borthakur, D., Jain, N., Sen Sarma, J., Murthy, R., Liu, H.: Data warehousing and analytics infrastructure at facebook. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, 2010 Jun 6, pp. 1013–1020. ACM (2010)

37.

Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRef

38.

Fei, X., Li, X., Shen, C.: Parallelized text classification algorithm for processing large scale TCM clinical data with MapReduce. In: Information and Automation, 2015 IEEE International Conference, 1983–1986. IEEE (2015)

39.

Qasem, M.H., Sarhan, A.A., Qaddoura, R., Mahafzah, B.A.: Matrix multiplication of big data using mapreduce: a review. In: Proceedings of the 2nd International Conference on the Applications of Information Technology in Developing Renewable Energy Processes and Systems (IT-DREPS 2017), University of Petra, Amman, Jordan, 52-57, (2017)

40.

Maillo, J., Ramírez, S., Triguero, I., Herrera, F.: kNN-IS: an iterative spark-based design of the k-nearest neighbors classifier for big data. Knowl.-Based Syst. 117, 3–15 (2017)CrossRef

41.

Perera, S.: Hadoop MapReduce Cookbook. Packt Publishing Ltd, Birmingham (2013)

42.

Lin, D.I., Kedem, Z.M.: Pincer-search: an efficient algorithm for discovering the maximum frequent set. IEEE Trans. Knowl. Data Eng. 14(3), 553–566 (2002)CrossRef

43.

Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. InACM Sigmod Record 2000, 29(2), 1–12 (2000)CrossRef

44.

Savasere, A., Omiecinski, ER., Navathe, SB.: An efficient algorithm for mining association rules in large databases. Georgia Institute of Technology, Georgia (1995)

45.

Ghosh, A., Nath, B.: Multi-objective rule mining using genetic algorithms. Inf. Sci. 163(1), 123–133 (2004)MathSciNetCrossRef

46.

Kuo, R.J., Shih, C.W.: Association rule mining through the ant colony system for National Health Insurance Research Database in Taiwan. Comput. Math. Appl. 54(11), 1303–1318 (2007)MathSciNetMATHCrossRef

47.

Sarath, K.N., Ravi, V.: Association rule mining using binary particle swarm optimization. Eng. Appl. Artif. Intell. 26(8), 1832–1840 (2013)CrossRef

48.

Kuo, R.J., Chao, C.M., Chiu, Y.T.: Application of particle swarm optimization to association rule mining. Appl. Soft Comput. 11(1), 326–336 (2011)CrossRef

49.

Martín, D., Rosete, A., Alcalá-Fdez, J., Herrera, F.: QAR-CIP-NSGA-II: a new multi-objective evolutionary algorithm to mine quantitative association rules. Inf. Sci. 258, 1–28 (2014)MathSciNetCrossRef

50.

Mata, J., Alvarez, J.L., Riquelme, J.C.: Mining numeric association rules with genetic algorithms. In: Smith, G. (ed.), Artificial Neural Nets and Genetic Algorithms. Springer, Vienna, pp. 264–267 (2001)MATHCrossRef

51.

Yan, X., Zhang, C., Zhang, S.: Genetic algorithm-based strategy for identifying association rules without specifying actual minimum support. Expert Syst. Appl. 36(2), 3066–3076 (2009)CrossRef

52.

Alatas, B., Akin, E., Karci, A.: MODENAR: multi-objective differential evolution algorithm for mining numeric association rules. Appl. Soft Comput. 8(1), 646–656 (2008)CrossRef

53.

Qodmanan, H.R., Nasiri, M., Minaei-Bidgoli, B.: Multi objective association rule mining with genetic algorithm without specifying minimum support and minimum confidence. Expert Syst. Appl. 38(1), 288–298 (2011)CrossRef

54.

Ramaswamy, S., Mahajan, S., Silberschatz, A.: On the discovery of interesting patterns in association rules. InVLDB 98, 368–379 (1998)

55.

Djenouri, Y., Djenouri, D., Habbas, Z., Belhadi, A.: How to exploit high performance computing in population-based metaheuristics for solving association rule mining problem. Distrib. Parallel Databases 1–29 (2018)

56.

Segatori, A., Bechini, A., Ducange, P., Marcelloni, F.: A distributed fuzzy associative classifier for big data. IEEE Trans. Cybernet. (2017)

57.

Venturini, L., Baralis, E., Garza, P.: Scaling associative classification for very large datasets. J. Big Data 4(1), 44 (2017)CrossRef

58.

Yu, P., Wild, D.J.: Discovering associations in biomedical datasets by link-based associative classifier (LAC). PLoS ONE 7(12), e51018 (2012)CrossRef

59.

Uriarte-Arcia, A.V., López-Yáñez, I., Yáñez-Márquez, C.: One-hot vector hybrid associative classifier for medical data classification. PLoS ONE 9(4), e95715 (2014)CrossRef

60.

Yoon, Y., Lee, G.G.: Two scalable algorithms for associative text classification. Inf. Proc. Manag. 49(2), 484–496 (2013)CrossRef

61.

Costa, G., Ortale, R., Ritacco, E.: X-class: associative classification of xml documents by structure. ACM Trans. Inf. Syst. (TOIS) 31(1), 3 (2013)CrossRef

62.

Ajlouni, M.D., Hadi, W.E., Alwedyan, J.: Detecting phishing websites using associative classification. Image 5(23), 36–40 (2013)

63.

Wang, C., Hu, L., Guo, M., Liu, X., Zou, Q.: imDC: an ensemble learning method for imbalanced classification with miRNA data. Genet. Mol. Res. 14(1), 123–133 (2015)CrossRef

64.

Liu, Y., Zhang, J., Li, A., Zhang, Y., Li, Y., Yuan, X., He, Z., Liu, Z., Tuo, S.: Identification of PIWI-interacting RNA modules by weighted correlation network analysis. Clust. Comput. 1–1 (2017)

65.

Bacardit, J., Widera, P., Márquez-Chamorro, A., Divina, F., Aguilar-Ruiz, J.S., Krasnogor, N.: Contact map prediction using a large-scale ensemble of rule sets and the fusion of multiple predicted structural features. Bioinformatics 28(19), 2441–2448 (2012)CrossRef

66.

Mahafzah, B.A., Jaradat, B.A.: The hybrid dynamic parallel scheduling algorithm for load balancing on chained-cubic tree interconnection networks. J. Supercomput. 52(3), 224–252 (2010)CrossRef

67.

Mahafzah, B.A., Jaradat, B.A.: The load balancing problem in OTIS-Hypercube interconnection networks. J. Supercomput. 46(3), 276–297 (2008)CrossRef

68.

https://moa.cms.waikato.ac.nz/overview/ a Hadoop-powered Weka implementation

69.

Sun, Y., Wong, A.K., Kamel, M.S.: Classification of imbalanced data: a review. Int. J. Pattern Recog. Artif. Intell. 23(04), 687–719 (2009)CrossRef

70.

Han, L., Ong, H.Y.: Parallel data intensive applications using MapReduce: a data mining case study in biomedical sciences. Clust. Comput. 18(1), 403–418 (2015)CrossRef

71.

Park, B.J., Oh, S.K., Pedrycz, W.: The design of polynomial function-based neural network predictors for detection of software defects. Inf. Sci. 229, 40–57 (2013)MathSciNetMATHCrossRef

72.

Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)MATHCrossRef

73.

Rodríguez-Mazahua, L., Rodríguez-Enríquez, C.A., Sánchez-Cervantes, J.L., Cervantes, J., García-Alcaraz, J.L., Alor-Hernández, G.: A general perspective of Big Data: applications, tools, challenges and trends. J. Supercomput. 72(8), 3073–3113 (2016)CrossRef

74.

Lee, J., Lapira, E., Bagheri, B., Kao, H.A.: Recent advances and trends in predictive manufacturing systems in big data environment. Manuf. Lett. 1(1), 38–41 (2013)CrossRef

75.

Costa, F.F.: Big data in biomedicine. Drug Discov. Today 19(4), 433–440 (2014)CrossRef

76.

Xu, Q., Li, M.: A new cluster computing technique for social media data analysis. Clust. Comput. 1–8 (2017)

77.

Garcı, S., Triguero, I., Carmona, C.J., Herrera, F.: Evolutionary-based selection of generalized instances for imbalanced classification. Knowl.-Based Syst. 25(1), 3–12 (2012)CrossRef

78.

García, S., Herrera, F.: Evolutionary undersampling for classification with imbalanced datasets: proposals and taxonomy. Evol. Comput. 17(3), 275–306 (2009)MathSciNetCrossRef

79.

Idris, A., Iftikhar, A., ur Rehman, Z.: Intelligent churn prediction for telecom using GP-AdaBoost learning and PSO undersampling. Clust. Comput. 1–5 (2017)

80.

Batista, G.E., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data. ACM Sigkdd Explor. Newsl. 6(1), 20–29 (2004)CrossRef

81.

He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)CrossRef

82.

Del Río, S., López, V., Benítez, J.M., Herrera, F.: On the use of MapReduce for imbalanced big data using Random Forest. Inf. Sci. 285, 112–137 (2014)CrossRef

83.

LóPez, V., FernáNdez, A., Del Jesus, M.J., Herrera, F.: A hierarchical genetic fuzzy system based on genetic programming for addressing classification with highly imbalanced and borderline data-sets. Knowl.-Based Syst. 38, 85–104 (2013)CrossRef

84.

Kubat, M., Matwin, S.: Addressing the curse of imbalanced training sets: one-sided selection. InICML 97, 179–186 (1997)

85.

Corbellini, A., Mateos, C., Zunino, A., Godoy, D., Schiaffino, S.: Persisting big-data: the NoSQL landscape. Inf. Syst. 63, 1–23 (2017)CrossRef

86.

Berzal, F., Cubero, J.C., Marín, N., Sánchez, D., Serrano, J.M., Vila, A.: Association rule evaluation for classification purposes. Actas del III Taller Nacional de Mineria de Datos y Aprendizaje. 135–44 (2005)

87.

https://www.spss-tutorials.com/spss-independent-samples-t-test/

88.

Ho, T.K., Basu, M.: Complexity measures of supervised classification problems. IEEE Trans. Pattern Anal. Mach. Intell. 24(3), 289–300 (2002)CrossRef

89.

Leyva, E., Gonzalez, A., Perez, R.: A set of complexity measures designed for applying meta-learning to instance selection. IEEE Trans. Knowl. Data Eng. 27(2), 354–367 (2015)CrossRef

90.

http://sci2s.ugr.es/keel/imbalanced.php?order=insR#sub10

Titel: A new MapReduce associative classifier based on a new storage format for large-scale imbalanced data
verfasst von: Mehrdad Almasi
Mohammad Saniee Abadeh
Publikationsdatum: 06.07.2018
Verlag: Springer US
Erschienen in: Cluster Computing / Ausgabe 4/2018
Print ISSN: 1386-7857
Elektronische ISSN: 1573-7543
DOI: https://doi.org/10.1007/s10586-018-2812-9

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 4/2018

DEFAD: ensemble classifier for DDOS enabled flood attack defense in distributed network environment

Enabling secure auditing and deduplicating data without owner-relationship exposure in cloud storage

Optimizing of metadata management in large-scale file systems

CECT: computationally efficient congestion-avoidance and traffic engineering in software-defined cloud data centers

An implementation of matrix–matrix multiplication on the Intel KNL processor with AVX-512

A self-scalable distributed network simulation environment based on cloud computing

Premium Partner