nach oben

Cluster Computing

Erschienen in:

01.12.2015

A distributed frequent itemset mining algorithm using Spark for Big Data analytics

verfasst von: Feng Zhang, Min Liu, Feng Gui, Weiming Shen, Abdallah Shami, Yunlong Ma

Erschienen in: Cluster Computing | Ausgabe 4/2015

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Frequent itemset mining is an essential step in the process of association rule mining. Conventional approaches for mining frequent itemsets in big data era encounter significant challenges when computing power and memory space are limited. This paper proposes an efficient distributed frequent itemset mining algorithm (DFIMA) which can significantly reduce the amount of candidate itemsets by applying a matrix-based pruning approach. The proposed algorithm has been implemented using Spark to further improve the efficiency of iterative computation. Numeric experiment results using standard benchmark datasets by comparing the proposed algorithm with the existing algorithm, parallel FP-growth, show that DFIMA has better efficiency and scalability. In addition, a case study has been carried out to validate the feasibility of DFIMA.

Vorheriger Artikel Decomposition tree: a spatio-temporal indexing method for movement big data

Nächster Artikel Runtime self-monitoring approach of business process compliance in cloud environments

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Sandhu, R., Sood, S.K.: Scheduling of big data applications on distributed cloud based on QoS parameters. Clust. Comput. 18, 1–12 (2014). doi:10.1007/s10586-014-0416-6

Han, L., Ong, H.Y.: Parallel data intensive applications using MapReduce: a data mining case study in biomedical sciences. Clust. Comput. 18(1), 403–418 (2015). doi:10.1007/s10586-014-0405-9 CrossRef

Chen, Y., Li, F., Fan, J.: Mining association rules in big data with NGEP. Clust. Comput. 18, 1–9 (2015). doi:10.1007/s10586-014-0419-3 CrossRef

Agrawal, R., Srikant, R.: Fast algorithms for mining association rules. In: Proceedings of the 20th International Conference on Very Large Data Bases, VLDB, vol. 1215, pp. 487–499 (1994)

Agrawal, R., Shafer, J.C.: Parallel mining of association rules. IEEE Trans. Knowl. Data Eng. 8(6), 962–969 (1996). doi:10.1109/69.553164 CrossRef

Grahne, G., Zhu, J.: Fast algorithms for frequent itemset mining using fp-trees. IEEE Trans. Knowl. Data Eng. 17(10), 1347–1362 (2005). doi:10.1109/TKDE.2005.166 CrossRef

Mohamed, M.H., Darwieesh, M.M.: Efficient mining frequent itemsets algorithms. Int. J. Mach. Learn. Cybern. 5(6), 823–833 (2014). doi:10.1007/s13042-013-0172-6 CrossRef

Totad, S.G., Geeta, R.B., Reddy, P.P.: Batch incremental processing for FP-tree construction using FP-Growth algorithm. Knowl. Inf. Syst. 33(2), 475–490 (2012). doi:10.1007/s10115-012-0514-9 CrossRef

Zhen-yu, L., Wei-xiang, X., Xumin, L.: Efficiently using matrix in mining maximum frequent itemset. In: WKDD’10. Third International Conference on Knowledge Discovery and Data Mining, 2010, pp. 50–54. IEEE (2010)

10.

Ye, Y., Chiang, C. C.: A parallel apriori algorithm for frequent itemsets mining. In: Fourth International Conference on, Software Engineering Research, Management and Applications, 2006, pp. 87–94. IEEE (2006)

11.

Upadhyaya, S.: Parallel approaches to machine learning—a comprehensive survey. J. Parallel Distrib. Comput. 73(3), 284–292 (2013). doi:10.1016/j.jpdc.2012.11.001 CrossRef

12.

Lin, M.Y., Lee, P.Y., Hsueh, S.C.: Apriori-based frequent itemset mining algorithms on MapReduce. In: Proceedings of the 6th International Conference on Ubiquitous Information Management and Communication, ACM 76 (2012). doi:10.1145/2184751.2184842

13.

Moens, S., Aksehirli, E., Goethals, B.: Frequent itemset mining for big data. In: IEEE International Conference on Big Data, pp. 111–118, IEEE (2013)

14.

Pacheco, P.S.: Parallel Programming with MPI. Morgan Kaufmann Publishers Inc, San Francisco (1997)

15.

Li, S., Hoefler, T., Hu, C., et al.: Improved MPI collectives for MPI processes in shared address spaces. Clust. Comput. 17(4), 1139–1155 (2014). doi:10.1007/s10586-014-0361-4 CrossRef

16.

Otey, M.E., Wang, C., Parthasarathy, S., et al.: Mining frequent itemsets in distributed and dynamic databases. In: Third IEEE International Conference on Data Mining, ICDM 2003, pp. 617–620. IEEE (2003)

17.

Kaosar, M.G., Xu, Z., Yi, X.: Distributed Association rule mining with minimum communication overhead. In: Proceedings of the Eighth Australasian Data Mining Conference, vol. 101, pp. 17–23. Australian Computer Society Inc (2009)

18.

Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008). doi:10.1145/1327452.1327492 CrossRef

19.

Zaharia, M., Chowdhury, M., Franklin, M.J., et al.: Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, pp. 10–10 (2010)

20.

Jiang, H., Chen, Y., Qiao, Z., et al.: Scaling up MapReduce-based big data processing on Multi-GPU systems. Clust. Comput. 18(1), 369–383 (2015). doi:10.1007/s10586-014-0400-1 CrossRef

21.

Inokuchi, A., Washio, T., Motoda, H.: An apriori-based algorithm for mining frequent substructures from graph data. Lect. Notes Comput. Sci. 1910, 13–23 (2000)CrossRef

22.

Pramudiono, I., Kitsuregawa, M.: Parallel FP-growth on PC cluster. In: Advances in Knowledge Discovery and Data Mining, pp. 467–473. Springer, Berlin (2003)

23.

Gu, H., Hang, H., Lv, Q., et al.: Fusing text and frienships for location inference in online social networks. In: 2012 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT), vol. 1, pp. 158–165. IEEE (2012)

24.

Gu, H., Xie, X., Lv, Q., et al.: Etree: effective and efficient event modeling for real-time online social media networks. In: 2011 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), vol. 1, pp. 300–307. IEEE (2011)

25.

Priyadarsini, S., Viswanathan, R.: Web usage mining for better understanding of user pattern to improve productivity of E-business. Int. J. Appl. Eng. Res. 9(11), 1753–1763 (2014)

26.

Boukerche, A., Samarah, S.: A novel algorithm for mining association rules in wireless ad hoc sensor networks. IEEE Trans. Parallel Distrib. Syst. 19(7), 865–877 (2008)CrossRef

27.

Zhou, L., Wang, X.: Research of the FP-growth algorithm based on cloud environments. J. Softw. 9(3), 676–683 (2014). doi:10.4304/jsw.9.3.676-683

28.

Li, H., Wang, Y., Zhang, D., et al.: Pfp: parallel fp-growth for query recommendation. In: Proceedings of the 2008 ACM Conference on Recommender Systems, ACM 107–114 (2008). doi:10.1145/1454008.1454027

29.

Yu, K.M., Zhou, J., Hsiao, W.C.: Load balancing approach parallel algorithm for frequent pattern mining. In: Parallel Computing Technologies, pp. 623–631. Springer, Berlin (2007)

30.

Pei, J., Han, J., Mao, R.: CLOSET: an efficient algorithm for mining frequent closed itemsets. In: ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, vol. 4(2), pp. 21–30 (2000)

31.

Chen, M., Gao, X., Li, H.: An efficient parallel FP-Growth algorithm. In: CyberC’09. International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery, 2009, pp. 283–286. IEEE (2009)

32.

Farzanyar, Z., Cercone, N.: Efficient mining of frequent itemsets in social network data based on MapReduce framework. In: Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, ACM 1183–1188 (2013). doi:10.1145/2492517.2500301

33.

Fumarola, F., Malerba, D.: A parallel algorithm for approximate frequent itemset mining using MapReduce. In: 2014 International Conference on High Performance Computing & Simulation (HPCS), pp. 335–342. IEEE (2014)

34.

Moens, S., Aksehirli, E., Goethals, B.: Frequent itemset mining for big data. In: 2013 IEEE International Conference on Big Data, pp. 111–118. IEEE (2013)

35.

Yu, K., Zhou, J., Zhou, J., et al.: A load-balanced distributed parallel mining algorithm. Expert Syst. Appl. 37(3), 2459–2464 (2010). doi:10.1016/j.eswa.2009.07.074 MathSciNetCrossRef

36.

Ozkural, E., Ucar, B., Aykanat, C.: Parallel frequent item set mining with selective item replication. IEEE Trans. Parallel Distrib. Syst. 23(10), 1632–1640 (2011). doi:10.1109/TPDS.2011.32 CrossRef

37.

Aouad, L.M., Le-Khac, N.A., Kechadi, T.M.: Performance study of distributed Apriori-like frequent itemsets mining. Knowl. Inf. Syst. 23(1), 55–72 (2010). doi:10.1007/s10115-009-0205-3 CrossRef

38.

Chen, Z., Cai, S., Song, Q., et al.: An improved Apriori algorithm based on pruning optimization and transaction reduction. In: 2011 2nd International Conference on Artificial Intelligence, Management Science and Electronic Commerce (AIMSEC), pp. 908–1911. IEEE (2011)

39.

Zaharia, M., Chowdhury, M., Das, T., et al.: Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, USENIX Association 2-2 (2012)

40.

Haveliwala, T.H.: Topic-sensitive pagerank: a context-sensitive ranking algorithm for web search. IEEE Trans. Knowl. Data Eng. 15(4), 784–796 (2003). doi:10.1109/TKDE.2003.1208999 CrossRef

41.

Xin, R.S., Rosen, J., Zaharia, M., et al.: Shark: SQL and rich analytics at scale. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, ACM 13–24 (2013). doi:10.1145/2463676.2465288

42.

Goethals, B., Zaki, M.J.: FIMI’03: Workshop on frequent itemset mining implementations. In: Third IEEE International Conference on Data Mining Workshop on Frequent Itemset Mining Implementations, pp.1–13. IEEE (2003)

Titel: A distributed frequent itemset mining algorithm using Spark for Big Data analytics
verfasst von: Feng Zhang
Min Liu
Feng Gui
Weiming Shen
Abdallah Shami
Yunlong Ma
Publikationsdatum: 01.12.2015
Verlag: Springer US
Erschienen in: Cluster Computing / Ausgabe 4/2015
Print ISSN: 1386-7857
Elektronische ISSN: 1573-7543
DOI: https://doi.org/10.1007/s10586-015-0477-1

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 4/2015

Unleashing GPU acceleration for symmetric band linear algebra kernels and model reduction

Stochastic model: reCAPTCHA controller based co-variance matrix analysis on frequency distribution using trust evaluation and re-eval by Aumann agreement theorem against DDoS attack in MANET

Holistic multiobjective planning of datacenters powered by renewable energy

Decomposition tree: a spatio-temporal indexing method for movement big data

Auto-localized multimedia platform based on a modular Cyber Physical System aligned in a two-dimensional grid

Runtime self-monitoring approach of business process compliance in cloud environments

Premium Partner