Skip to main content
Erschienen in: Cluster Computing 4/2015

01.12.2015

A distributed frequent itemset mining algorithm using Spark for Big Data analytics

verfasst von: Feng Zhang, Min Liu, Feng Gui, Weiming Shen, Abdallah Shami, Yunlong Ma

Erschienen in: Cluster Computing | Ausgabe 4/2015

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Frequent itemset mining is an essential step in the process of association rule mining. Conventional approaches for mining frequent itemsets in big data era encounter significant challenges when computing power and memory space are limited. This paper proposes an efficient distributed frequent itemset mining algorithm (DFIMA) which can significantly reduce the amount of candidate itemsets by applying a matrix-based pruning approach. The proposed algorithm has been implemented using Spark to further improve the efficiency of iterative computation. Numeric experiment results using standard benchmark datasets by comparing the proposed algorithm with the existing algorithm, parallel FP-growth, show that DFIMA has better efficiency and scalability. In addition, a case study has been carried out to validate the feasibility of DFIMA.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
4.
Zurück zum Zitat Agrawal, R., Srikant, R.: Fast algorithms for mining association rules. In: Proceedings of the 20th International Conference on Very Large Data Bases, VLDB, vol. 1215, pp. 487–499 (1994) Agrawal, R., Srikant, R.: Fast algorithms for mining association rules. In: Proceedings of the 20th International Conference on Very Large Data Bases, VLDB, vol. 1215, pp. 487–499 (1994)
9.
Zurück zum Zitat Zhen-yu, L., Wei-xiang, X., Xumin, L.: Efficiently using matrix in mining maximum frequent itemset. In: WKDD’10. Third International Conference on Knowledge Discovery and Data Mining, 2010, pp. 50–54. IEEE (2010) Zhen-yu, L., Wei-xiang, X., Xumin, L.: Efficiently using matrix in mining maximum frequent itemset. In: WKDD’10. Third International Conference on Knowledge Discovery and Data Mining, 2010, pp. 50–54. IEEE (2010)
10.
Zurück zum Zitat Ye, Y., Chiang, C. C.: A parallel apriori algorithm for frequent itemsets mining. In: Fourth International Conference on, Software Engineering Research, Management and Applications, 2006, pp. 87–94. IEEE (2006) Ye, Y., Chiang, C. C.: A parallel apriori algorithm for frequent itemsets mining. In: Fourth International Conference on, Software Engineering Research, Management and Applications, 2006, pp. 87–94. IEEE (2006)
12.
Zurück zum Zitat Lin, M.Y., Lee, P.Y., Hsueh, S.C.: Apriori-based frequent itemset mining algorithms on MapReduce. In: Proceedings of the 6th International Conference on Ubiquitous Information Management and Communication, ACM 76 (2012). doi:10.1145/2184751.2184842 Lin, M.Y., Lee, P.Y., Hsueh, S.C.: Apriori-based frequent itemset mining algorithms on MapReduce. In: Proceedings of the 6th International Conference on Ubiquitous Information Management and Communication, ACM 76 (2012). doi:10.​1145/​2184751.​2184842
13.
Zurück zum Zitat Moens, S., Aksehirli, E., Goethals, B.: Frequent itemset mining for big data. In: IEEE International Conference on Big Data, pp. 111–118, IEEE (2013) Moens, S., Aksehirli, E., Goethals, B.: Frequent itemset mining for big data. In: IEEE International Conference on Big Data, pp. 111–118, IEEE (2013)
14.
Zurück zum Zitat Pacheco, P.S.: Parallel Programming with MPI. Morgan Kaufmann Publishers Inc, San Francisco (1997) Pacheco, P.S.: Parallel Programming with MPI. Morgan Kaufmann Publishers Inc, San Francisco (1997)
16.
Zurück zum Zitat Otey, M.E., Wang, C., Parthasarathy, S., et al.: Mining frequent itemsets in distributed and dynamic databases. In: Third IEEE International Conference on Data Mining, ICDM 2003, pp. 617–620. IEEE (2003) Otey, M.E., Wang, C., Parthasarathy, S., et al.: Mining frequent itemsets in distributed and dynamic databases. In: Third IEEE International Conference on Data Mining, ICDM 2003, pp. 617–620. IEEE (2003)
17.
Zurück zum Zitat Kaosar, M.G., Xu, Z., Yi, X.: Distributed Association rule mining with minimum communication overhead. In: Proceedings of the Eighth Australasian Data Mining Conference, vol. 101, pp. 17–23. Australian Computer Society Inc (2009) Kaosar, M.G., Xu, Z., Yi, X.: Distributed Association rule mining with minimum communication overhead. In: Proceedings of the Eighth Australasian Data Mining Conference, vol. 101, pp. 17–23. Australian Computer Society Inc (2009)
19.
Zurück zum Zitat Zaharia, M., Chowdhury, M., Franklin, M.J., et al.: Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, pp. 10–10 (2010) Zaharia, M., Chowdhury, M., Franklin, M.J., et al.: Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, pp. 10–10 (2010)
21.
Zurück zum Zitat Inokuchi, A., Washio, T., Motoda, H.: An apriori-based algorithm for mining frequent substructures from graph data. Lect. Notes Comput. Sci. 1910, 13–23 (2000)CrossRef Inokuchi, A., Washio, T., Motoda, H.: An apriori-based algorithm for mining frequent substructures from graph data. Lect. Notes Comput. Sci. 1910, 13–23 (2000)CrossRef
22.
Zurück zum Zitat Pramudiono, I., Kitsuregawa, M.: Parallel FP-growth on PC cluster. In: Advances in Knowledge Discovery and Data Mining, pp. 467–473. Springer, Berlin (2003) Pramudiono, I., Kitsuregawa, M.: Parallel FP-growth on PC cluster. In: Advances in Knowledge Discovery and Data Mining, pp. 467–473. Springer, Berlin (2003)
23.
Zurück zum Zitat Gu, H., Hang, H., Lv, Q., et al.: Fusing text and frienships for location inference in online social networks. In: 2012 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT), vol. 1, pp. 158–165. IEEE (2012) Gu, H., Hang, H., Lv, Q., et al.: Fusing text and frienships for location inference in online social networks. In: 2012 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT), vol. 1, pp. 158–165. IEEE (2012)
24.
Zurück zum Zitat Gu, H., Xie, X., Lv, Q., et al.: Etree: effective and efficient event modeling for real-time online social media networks. In: 2011 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), vol. 1, pp. 300–307. IEEE (2011) Gu, H., Xie, X., Lv, Q., et al.: Etree: effective and efficient event modeling for real-time online social media networks. In: 2011 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), vol. 1, pp. 300–307. IEEE (2011)
25.
Zurück zum Zitat Priyadarsini, S., Viswanathan, R.: Web usage mining for better understanding of user pattern to improve productivity of E-business. Int. J. Appl. Eng. Res. 9(11), 1753–1763 (2014) Priyadarsini, S., Viswanathan, R.: Web usage mining for better understanding of user pattern to improve productivity of E-business. Int. J. Appl. Eng. Res. 9(11), 1753–1763 (2014)
26.
Zurück zum Zitat Boukerche, A., Samarah, S.: A novel algorithm for mining association rules in wireless ad hoc sensor networks. IEEE Trans. Parallel Distrib. Syst. 19(7), 865–877 (2008)CrossRef Boukerche, A., Samarah, S.: A novel algorithm for mining association rules in wireless ad hoc sensor networks. IEEE Trans. Parallel Distrib. Syst. 19(7), 865–877 (2008)CrossRef
28.
Zurück zum Zitat Li, H., Wang, Y., Zhang, D., et al.: Pfp: parallel fp-growth for query recommendation. In: Proceedings of the 2008 ACM Conference on Recommender Systems, ACM 107–114 (2008). doi:10.1145/1454008.1454027 Li, H., Wang, Y., Zhang, D., et al.: Pfp: parallel fp-growth for query recommendation. In: Proceedings of the 2008 ACM Conference on Recommender Systems, ACM 107–114 (2008). doi:10.​1145/​1454008.​1454027
29.
Zurück zum Zitat Yu, K.M., Zhou, J., Hsiao, W.C.: Load balancing approach parallel algorithm for frequent pattern mining. In: Parallel Computing Technologies, pp. 623–631. Springer, Berlin (2007) Yu, K.M., Zhou, J., Hsiao, W.C.: Load balancing approach parallel algorithm for frequent pattern mining. In: Parallel Computing Technologies, pp. 623–631. Springer, Berlin (2007)
30.
Zurück zum Zitat Pei, J., Han, J., Mao, R.: CLOSET: an efficient algorithm for mining frequent closed itemsets. In: ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, vol. 4(2), pp. 21–30 (2000) Pei, J., Han, J., Mao, R.: CLOSET: an efficient algorithm for mining frequent closed itemsets. In: ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, vol. 4(2), pp. 21–30 (2000)
31.
Zurück zum Zitat Chen, M., Gao, X., Li, H.: An efficient parallel FP-Growth algorithm. In: CyberC’09. International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery, 2009, pp. 283–286. IEEE (2009) Chen, M., Gao, X., Li, H.: An efficient parallel FP-Growth algorithm. In: CyberC’09. International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery, 2009, pp. 283–286. IEEE (2009)
32.
Zurück zum Zitat Farzanyar, Z., Cercone, N.: Efficient mining of frequent itemsets in social network data based on MapReduce framework. In: Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, ACM 1183–1188 (2013). doi:10.1145/2492517.2500301 Farzanyar, Z., Cercone, N.: Efficient mining of frequent itemsets in social network data based on MapReduce framework. In: Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, ACM 1183–1188 (2013). doi:10.​1145/​2492517.​2500301
33.
Zurück zum Zitat Fumarola, F., Malerba, D.: A parallel algorithm for approximate frequent itemset mining using MapReduce. In: 2014 International Conference on High Performance Computing & Simulation (HPCS), pp. 335–342. IEEE (2014) Fumarola, F., Malerba, D.: A parallel algorithm for approximate frequent itemset mining using MapReduce. In: 2014 International Conference on High Performance Computing & Simulation (HPCS), pp. 335–342. IEEE (2014)
34.
Zurück zum Zitat Moens, S., Aksehirli, E., Goethals, B.: Frequent itemset mining for big data. In: 2013 IEEE International Conference on Big Data, pp. 111–118. IEEE (2013) Moens, S., Aksehirli, E., Goethals, B.: Frequent itemset mining for big data. In: 2013 IEEE International Conference on Big Data, pp. 111–118. IEEE (2013)
36.
38.
Zurück zum Zitat Chen, Z., Cai, S., Song, Q., et al.: An improved Apriori algorithm based on pruning optimization and transaction reduction. In: 2011 2nd International Conference on Artificial Intelligence, Management Science and Electronic Commerce (AIMSEC), pp. 908–1911. IEEE (2011) Chen, Z., Cai, S., Song, Q., et al.: An improved Apriori algorithm based on pruning optimization and transaction reduction. In: 2011 2nd International Conference on Artificial Intelligence, Management Science and Electronic Commerce (AIMSEC), pp. 908–1911. IEEE (2011)
39.
Zurück zum Zitat Zaharia, M., Chowdhury, M., Das, T., et al.: Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, USENIX Association 2-2 (2012) Zaharia, M., Chowdhury, M., Das, T., et al.: Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, USENIX Association 2-2 (2012)
41.
Zurück zum Zitat Xin, R.S., Rosen, J., Zaharia, M., et al.: Shark: SQL and rich analytics at scale. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, ACM 13–24 (2013). doi:10.1145/2463676.2465288 Xin, R.S., Rosen, J., Zaharia, M., et al.: Shark: SQL and rich analytics at scale. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, ACM 13–24 (2013). doi:10.​1145/​2463676.​2465288
42.
Zurück zum Zitat Goethals, B., Zaki, M.J.: FIMI’03: Workshop on frequent itemset mining implementations. In: Third IEEE International Conference on Data Mining Workshop on Frequent Itemset Mining Implementations, pp.1–13. IEEE (2003) Goethals, B., Zaki, M.J.: FIMI’03: Workshop on frequent itemset mining implementations. In: Third IEEE International Conference on Data Mining Workshop on Frequent Itemset Mining Implementations, pp.1–13. IEEE (2003)
Metadaten
Titel
A distributed frequent itemset mining algorithm using Spark for Big Data analytics
verfasst von
Feng Zhang
Min Liu
Feng Gui
Weiming Shen
Abdallah Shami
Yunlong Ma
Publikationsdatum
01.12.2015
Verlag
Springer US
Erschienen in
Cluster Computing / Ausgabe 4/2015
Print ISSN: 1386-7857
Elektronische ISSN: 1573-7543
DOI
https://doi.org/10.1007/s10586-015-0477-1

Weitere Artikel der Ausgabe 4/2015

Cluster Computing 4/2015 Zur Ausgabe

Premium Partner