Skip to main content
Erschienen in: The Journal of Supercomputing 8/2017

30.01.2017

HFIM: a Spark-based hybrid frequent itemset mining algorithm for big data processing

verfasst von: Krishan Kumar Sethi, Dharavath Ramesh

Erschienen in: The Journal of Supercomputing | Ausgabe 8/2017

Einloggen

Aktivieren Sie unsere intelligente Suche um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Frequent itemset mining is one of the data mining techniques applied to discover frequent patterns, used in prediction, association rule mining, classification, etc. Apriori algorithm is an iterative algorithm, which is used to find frequent itemsets from transactional dataset. It scans complete dataset in each iteration to generate the large frequent itemsets of different cardinality, which seems better for small data but not feasible for big data. The MapReduce framework provides the distributed environment to run the Apriori on big transactional data. However, MapReduce is not suitable for iterative process and declines the performance. We introduce a novel algorithm named Hybrid Frequent Itemset Mining (HFIM), which utilizes the vertical layout of dataset to solve the problem of scanning the dataset in each iteration. Vertical dataset carries information to find support of each itemsets. Moreover, we also include some enhancements to reduce number of candidate itemsets. The proposed algorithm is implemented over Spark framework, which incorporates the concept of resilient distributed datasets and performs in-memory processing to optimize the execution time of operation. We compare the performance of HFIM with another Spark-based implementation of Apriori algorithm for various datasets. Experimental results show that the HFIM performs better in terms of execution time and space consumption.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Hashem IAT, Yaqoob I, Anuar NB, Mokhtar S, Gani A, Khan SU (2015) The rise of “big data” on cloud computing: review and open research issues. Infor Syst 47:98–115CrossRef Hashem IAT, Yaqoob I, Anuar NB, Mokhtar S, Gani A, Khan SU (2015) The rise of “big data” on cloud computing: review and open research issues. Infor Syst 47:98–115CrossRef
2.
Zurück zum Zitat Philip Chen CL, Zhang CY (2014) Data-intensive applications, challenges, techniques and technologies: a survey on big data. Inf. Sci. 275:314–347CrossRef Philip Chen CL, Zhang CY (2014) Data-intensive applications, challenges, techniques and technologies: a survey on big data. Inf. Sci. 275:314–347CrossRef
3.
Zurück zum Zitat Han J, Kamber M, Pei J (2011) Data mining: concepts and techniques. Elsevier, New YorkMATH Han J, Kamber M, Pei J (2011) Data mining: concepts and techniques. Elsevier, New YorkMATH
4.
Zurück zum Zitat Agrawal R, Srikant R (1994) Fast algorithms for mining association rules. In: Proceeding VLDB ’94 of 20th International Conference Very Large Data Bases, vol 1215, pp 487–499 Agrawal R, Srikant R (1994) Fast algorithms for mining association rules. In: Proceeding VLDB ’94 of 20th International Conference Very Large Data Bases, vol 1215, pp 487–499
5.
Zurück zum Zitat Bayardo RJ Jr (1998) Efficiently mining long patterns from databases. ACM Sigmod Record 27(2):85–93CrossRef Bayardo RJ Jr (1998) Efficiently mining long patterns from databases. ACM Sigmod Record 27(2):85–93CrossRef
6.
Zurück zum Zitat Pacheco PS (1997) Parallel programming with MPI. Morgan Kaufmann, San FranciscoMATH Pacheco PS (1997) Parallel programming with MPI. Morgan Kaufmann, San FranciscoMATH
8.
Zurück zum Zitat Isard M, Budiu M, Yu Y, Birrell A, Fetterly D (2007) Dryad: Distributed data-parallel programs from sequential building blocks. In: ACM SIGOPS Operative System Review pp 59–72 Isard M, Budiu M, Yu Y, Birrell A, Fetterly D (2007) Dryad: Distributed data-parallel programs from sequential building blocks. In: ACM SIGOPS Operative System Review pp 59–72
9.
Zurück zum Zitat Karau H, Konwinski A, Wendell P, Zaharia M (2015) Learning spark: lightning-fast big data analysis. O’Reilly Media, Inc Karau H, Konwinski A, Wendell P, Zaharia M (2015) Learning spark: lightning-fast big data analysis. O’Reilly Media, Inc
11.
Zurück zum Zitat Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113CrossRef Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113CrossRef
12.
Zurück zum Zitat Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauley M, Stoica I (2012) Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation. USENIX Association Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauley M, Stoica I (2012) Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation. USENIX Association
15.
Zurück zum Zitat Luper D, Cameron D, Miller J, Arabnia HR (2007) Spatial and Temporal Target Association through Semantic Analysis and GPS Data Mining. In: IKE (vol 7, pp 25–28) Luper D, Cameron D, Miller J, Arabnia HR (2007) Spatial and Temporal Target Association through Semantic Analysis and GPS Data Mining. In: IKE (vol 7, pp 25–28)
16.
Zurück zum Zitat Jafri R, Ali SA, Arabnia HR, Fatima S (2014) Computer vision-based object recognition for the visually impaired in an indoors environment: a survey. Vis Comput 30(11):1197–1222CrossRef Jafri R, Ali SA, Arabnia HR, Fatima S (2014) Computer vision-based object recognition for the visually impaired in an indoors environment: a survey. Vis Comput 30(11):1197–1222CrossRef
17.
Zurück zum Zitat Arabnia HR, Fang WC, Lee C, Zhang Y (2010) Context-aware middleware and intelligent agents for smart environments. IEEE Intell Syst 25(2):10–11CrossRef Arabnia HR, Fang WC, Lee C, Zhang Y (2010) Context-aware middleware and intelligent agents for smart environments. IEEE Intell Syst 25(2):10–11CrossRef
18.
Zurück zum Zitat Ter Mors A, Valk J, Witteveen C, Arabnia HR, Mun Y (2004) Coordinating autonomous planners Ter Mors A, Valk J, Witteveen C, Arabnia HR, Mun Y (2004) Coordinating autonomous planners
19.
Zurück zum Zitat Jafri R, Arabnia HR (2008) Fusion of face and gait for automatic human recognition. In: IEEE Fifth International Conference on Information Technology: New Generations, ITNG 2008 (pp 167–173) Jafri R, Arabnia HR (2008) Fusion of face and gait for automatic human recognition. In: IEEE Fifth International Conference on Information Technology: New Generations, ITNG 2008 (pp 167–173)
20.
Zurück zum Zitat Rahbarinia B, Pedram MM, Arabnia HR, Alavi Z (2010) A multi-objective scheme to hide sequential patterns. In: IEEE the 2nd International Conference on Computer and Automation Engineering (ICCAE), 2010 (vol 1, pp 153–158) Rahbarinia B, Pedram MM, Arabnia HR, Alavi Z (2010) A multi-objective scheme to hide sequential patterns. In: IEEE the 2nd International Conference on Computer and Automation Engineering (ICCAE), 2010  (vol 1, pp 153–158)
21.
Zurück zum Zitat Jafri R, Ali SA, Arabnia HR (2013) Computer vision-based object recognition for the visually impaired using visual tags. In: Proceedings of the International Conference on Image Processing, Computer Vision, and Pattern Recognition (IPCV 2013: Las Vegas, USA), pp 400–406 Jafri R, Ali SA, Arabnia HR (2013) Computer vision-based object recognition for the visually impaired using visual tags. In: Proceedings of the International Conference on Image Processing, Computer Vision, and Pattern Recognition (IPCV 2013: Las Vegas, USA), pp 400–406
22.
Zurück zum Zitat Ye Y, Chiang CC (2006) A parallel Apriori algorithm for frequent itemsets mining. In Proceedings of Fourth International Conference Software Engineering Research Management and applications SERA 2006:87–94 Ye Y, Chiang CC (2006) A parallel Apriori algorithm for frequent itemsets mining. In Proceedings of Fourth International Conference Software Engineering Research Management and applications SERA 2006:87–94
23.
Zurück zum Zitat Bodon F (2010) A fast apriori implementation. In: Proceedings of the IEEE ICDM Workshop on Frequent Itemset Mining Implementations (FIMI’03), vol 90 Bodon F (2010) A fast apriori implementation. In: Proceedings of the IEEE ICDM Workshop on Frequent Itemset Mining Implementations (FIMI’03), vol 90
24.
Zurück zum Zitat Bodon F (2004) Surprising Results of Trie-based FIM Algorithms. FIMI Bodon F (2004) Surprising Results of Trie-based FIM Algorithms. FIMI
25.
Zurück zum Zitat Lin MY, Lee PY, Hsueh SC (2012) Apriori-based frequent itemset mining algorithms on MapReduce. In: Proceedings of 6th International Conference Ubiquitous Information Management Communication–ICUIMC ’12. 1 Lin MY, Lee PY, Hsueh SC (2012) Apriori-based frequent itemset mining algorithms on MapReduce. In: Proceedings of 6th International Conference Ubiquitous Information Management Communication–ICUIMC ’12. 1
26.
Zurück zum Zitat Li N, Zeng L, He Q, Shi Z (2012) Parallel Implementation of Apriori Algorithm Based on MapReduce. In: ACIS International Conference Software Engineering, Artificial Intelligence Networking and Parallel/Distributed Computing, pp 236–241 Li N, Zeng L, He Q, Shi Z (2012) Parallel Implementation of Apriori Algorithm Based on MapReduce. In: ACIS International Conference Software Engineering, Artificial Intelligence Networking and Parallel/Distributed Computing, pp 236–241
27.
Zurück zum Zitat Yu Run-Ming et al (2014) An efficient Frequent Patterns Mining Algorithm based on MapReduce Framework, Software Intelligence Technologies and Applications & International Conference on Frontiers of Internet of Things Yu Run-Ming et al (2014) An efficient Frequent Patterns Mining Algorithm based on MapReduce Framework, Software Intelligence Technologies and Applications & International Conference on Frontiers of Internet of Things
28.
29.
Zurück zum Zitat Lin X (2014) MR-Apriori: Association Rules Algorithm Based on MapReduce. In: 5th IEEE International Conference on Software Engineering and Service Science (ICSESS) Lin X (2014) MR-Apriori: Association Rules Algorithm Based on MapReduce. In: 5th IEEE International Conference on Software Engineering and Service Science (ICSESS)
30.
Zurück zum Zitat Yang XY, Liu Z, Fu Y (2010) MapReduce as a programming model for association rules algorithm on Hadoop. In: IEEE 3rd International Conference on Information Sciences and Interaction Sciences (ICIS) Yang XY, Liu Z, Fu Y (2010) MapReduce as a programming model for association rules algorithm on Hadoop. In: IEEE 3rd International Conference on Information Sciences and Interaction Sciences (ICIS)
31.
Zurück zum Zitat Qiu H, Gu R, Yuan C, Huang Y (2014) YAFIM: A parallel frequent itemset mining algorithm with Spark. In: Proceedings of International Parallel Distribution Process of Symposium IPDPS, pp 1664–1671 Qiu H, Gu R, Yuan C, Huang Y (2014) YAFIM: A parallel frequent itemset mining algorithm with Spark. In: Proceedings of International Parallel Distribution Process of Symposium IPDPS, pp 1664–1671
32.
Zurück zum Zitat Yang S, Xu G, Wang Z, Zhou F (2015) The Parallel Improved Apriori Algorithm Research Based on Spark. In: Proceedings of 2015 9th International Conference Frontier of Computer Science and Technology FCST 2015, pp 354–359 Yang S, Xu G, Wang Z, Zhou F (2015) The Parallel Improved Apriori Algorithm Research Based on Spark. In: Proceedings of 2015 9th International Conference Frontier of Computer Science and Technology FCST 2015, pp 354–359
33.
Zurück zum Zitat Rathee S, Kaul M, Kashyap A (2015) R-Apriori: an efficient Apriori based algorithm on Spark. In: Proceedings of the 8th Workshop on Ph.D. Workshop in Information and Knowledge Management. ACM, pp 27–34 Rathee S, Kaul M, Kashyap A (2015) R-Apriori: an efficient Apriori based algorithm on Spark. In: Proceedings of the 8th Workshop on Ph.D. Workshop in Information and Knowledge Management. ACM, pp 27–34
34.
Zurück zum Zitat Gui F, Ma Y, Zhang F, Liu M, Li F, Shen W, Bai H (2015) A distributed frequent itemset mining algorithm based on Spark. In: IEEE 19th International Conference Computer Supported Cooperative Work Design, vol 18, pp 271–275 Gui F, Ma Y, Zhang F, Liu M, Li F, Shen W, Bai H (2015) A distributed frequent itemset mining algorithm based on Spark. In: IEEE 19th International Conference Computer Supported Cooperative Work Design, vol 18, pp 271–275
35.
Zurück zum Zitat Zaki MJ, et al (1997) Parallel algorithms for discovery of association rules. In: Data Mining and Knowledge Discovery 1.4, pp 343–373 Zaki MJ, et al (1997) Parallel algorithms for discovery of association rules. In: Data Mining and Knowledge Discovery 1.4, pp 343–373
39.
Zurück zum Zitat Dharavath R et al (2014) An Apriori-Based Vertical Fragmentation Technique for Heterogeneous Distributed Database Transactions. Intelligent Computing, Networking, and Informatics. Springer India, pp 687–695 Dharavath R et al (2014) An Apriori-Based Vertical Fragmentation Technique for Heterogeneous Distributed Database Transactions. Intelligent Computing, Networking, and Informatics. Springer India, pp 687–695
Metadaten
Titel
HFIM: a Spark-based hybrid frequent itemset mining algorithm for big data processing
verfasst von
Krishan Kumar Sethi
Dharavath Ramesh
Publikationsdatum
30.01.2017
Verlag
Springer US
Erschienen in
The Journal of Supercomputing / Ausgabe 8/2017
Print ISSN: 0920-8542
Elektronische ISSN: 1573-0484
DOI
https://doi.org/10.1007/s11227-017-1963-4

Weitere Artikel der Ausgabe 8/2017

The Journal of Supercomputing 8/2017 Zur Ausgabe