Skip to main content
Top

2018 | OriginalPaper | Chapter

MapFIM+: Memory Aware Parallelized Frequent Itemset Mining In Very Large Datasets

Authors : Khanh-Chuong Duong, Mostafa Bamha, Arnaud Giacometti, Dominique Li, Arnaud Soulet, Christel Vrain

Published in: Transactions on Large-Scale Data- and Knowledge-Centered Systems XXXIX

Publisher: Springer Berlin Heidelberg

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Mining frequent itemsets in large datasets has received much attention in recent years relying on MapReduce programming model. For instance, many efficient Frequent Itemset Mining (a.k.a. FIM) algorithms have been parallelized to MapReduce principle such as Parallel Apriori, Parallel FP-Growth and Dist-Eclat. However, most approaches focus on job partitioning and/or load balancing without considering the extensibility depending on required memory assumptions. Thus, a challenge in designing parallel FIM algorithms consists therefore in finding ways to guarantee that data structures used during the mining process always fit in the local memory of processing nodes during all computation steps. In this paper, we propose MapFIM+, a two-phase approach to frequent itemset mining in very large datasets benefiting both from a MapReduce-based distributed Apriori method and local in-memory FIM methods. In our approach, MapReduce is first used to generate frequent itemsets until getting local memory-fitted prefix-projected databases, and an optimized local in-memory mining process is then launched to generate all remaining frequent itemsets from each prefix-projected database on individual processing nodes. Indeed, MapFIM+ improves our previous algorithm MapFIM by using an exact evaluation of prefix-projected database sizes during the MapReduce phase. This improvement makes MapFIM+ more efficient, especially for databases leading to huge candidate sets, by significantly reducing communication and disk I/O costs. Performance evaluation shows that MapFIM+ is more efficient and more extensible than existing MapReduce based frequent itemset mining approaches.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Footnotes
1
In our work, in order to generate each candidate once, we use a prefix-based join operation. More precisely, given two sets of frequent k-itemsets \(\mathcal {L}_{k}\) and \(\mathcal {L}'_{k}\), the join of \(\mathcal {L}_{k}\) and \(\mathcal {L}'_{k}\) is defined by: \(\mathcal {L}_{k} \bowtie \mathcal {L}'_{k} = \{ (i_1, \dots , i_k,i_{k+1}) ~|~ (i_1, \dots , i_{k-1}, i_{k}) \in \mathcal {L}_{k} \wedge (i_1, \dots , i_{k-1},i_{k+1}) \in \mathcal {L}'_{k} \wedge i_1< \dots< i_{k} < i_{k+1} \}\).
 
2
In our configuration, there is no real difference of performance between Hadoop 1.2.1 and Hadoop 2.7.3.
 
Literature
2.
go back to reference Agrawal, R., Srikant, R., et al.: Fast algorithms for mining association rules. In: Proceedings of VLDB 1994, vol. 1215, pp. 487–499 (1994) Agrawal, R., Srikant, R., et al.: Fast algorithms for mining association rules. In: Proceedings of VLDB 1994, vol. 1215, pp. 487–499 (1994)
3.
go back to reference Al Hajj Hassan, M., Bamha, M.: Towards scalability and data skew handling in groupby-joins using MapReduce model. In: Proceedings of ICCS 2015, pp. 70–79 (2015)CrossRef Al Hajj Hassan, M., Bamha, M.: Towards scalability and data skew handling in groupby-joins using MapReduce model. In: Proceedings of ICCS 2015, pp. 70–79 (2015)CrossRef
4.
go back to reference Al Hajj Hassan, M., Bamha, M., Loulergue, F.: Handling data-skew effects in join operations using MapReduce. In: Proceedings of ICCS 2014, pp. 145–158. IEEE (2014)CrossRef Al Hajj Hassan, M., Bamha, M., Loulergue, F.: Handling data-skew effects in join operations using MapReduce. In: Proceedings of ICCS 2014, pp. 145–158. IEEE (2014)CrossRef
5.
go back to reference Beedkar, K., Berberich, K., Gemulla, R., Miliaraki, I.: Closing the gap: sequence mining at scale. ACM Trans. Database Syst. 40(2), 8:1–8:44 (2015)MathSciNetCrossRef Beedkar, K., Berberich, K., Gemulla, R., Miliaraki, I.: Closing the gap: sequence mining at scale. ACM Trans. Database Syst. 40(2), 8:1–8:44 (2015)MathSciNetCrossRef
6.
go back to reference Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRef Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRef
7.
go back to reference Duong, K.-C., Bamha, M., Giacometti, A., Li, D., Soulet, A., Vrain, C.: MapFIM: memory aware parallelized frequent itemset mining in very large datasets. In: Benslimane, D., Damiani, E., Grosky, W.I., Hameurlain, A., Sheth, A., Wagner, R.R. (eds.) DEXA 2017. LNCS, vol. 10438, pp. 478–495. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-64468-4_36CrossRef Duong, K.-C., Bamha, M., Giacometti, A., Li, D., Soulet, A., Vrain, C.: MapFIM: memory aware parallelized frequent itemset mining in very large datasets. In: Benslimane, D., Damiani, E., Grosky, W.I., Hameurlain, A., Sheth, A., Wagner, R.R. (eds.) DEXA 2017. LNCS, vol. 10438, pp. 478–495. Springer, Cham (2017). https://​doi.​org/​10.​1007/​978-3-319-64468-4_​36CrossRef
8.
go back to reference Li, H., Wang, Y., Zhang, D., Zhang, M., Chang, E.Y: PFP: parallel FP-growth for query recommendation. In: Proceedings of RecSys 2008, pp. 107–114. ACM (2008) Li, H., Wang, Y., Zhang, D., Zhang, M., Chang, E.Y: PFP: parallel FP-growth for query recommendation. In: Proceedings of RecSys 2008, pp. 107–114. ACM (2008)
9.
go back to reference Li, N., Zeng, L., He, Q., Shi, Z.: Parallel implementation of apriori algorithm based on MapReduce. In: Proceedings of SNDP 2012, pp. 236–241. IEEE (2012) Li, N., Zeng, L., He, Q., Shi, Z.: Parallel implementation of apriori algorithm based on MapReduce. In: Proceedings of SNDP 2012, pp. 236–241. IEEE (2012)
10.
go back to reference Lin, M.-Y., Lee, P.-Y., Hsueh, S.-C.: Apriori-based frequent itemset mining algorithms on MapReduce. In Proceedings of ICUIMC 2012, pp. 76:1–76:8 (2012) Lin, M.-Y., Lee, P.-Y., Hsueh, S.-C.: Apriori-based frequent itemset mining algorithms on MapReduce. In Proceedings of ICUIMC 2012, pp. 76:1–76:8 (2012)
11.
go back to reference Lucchese, C., Orlando, S., Perego, R., Silvestri, F.: WebDocs: a real-life huge transactional dataset. In: FIMI, vol. 126 (2004) Lucchese, C., Orlando, S., Perego, R., Silvestri, F.: WebDocs: a real-life huge transactional dataset. In: FIMI, vol. 126 (2004)
12.
go back to reference Apache Mahout: Scalable machine learning and data mining (2012) Apache Mahout: Scalable machine learning and data mining (2012)
13.
go back to reference Makanju, A., Farzanyar, Z., An, A., Cercone, N., Hu Z.Z., Hu, Y.: Deep parallelization of parallel FP-growth using parent-child MapReduce. In: Proceedings of BigData 2016, pp. 1422–1431. IEEE (2016) Makanju, A., Farzanyar, Z., An, A., Cercone, N., Hu Z.Z., Hu, Y.: Deep parallelization of parallel FP-growth using parent-child MapReduce. In: Proceedings of BigData 2016, pp. 1422–1431. IEEE (2016)
14.
go back to reference Miliaraki, I., Berberich, K., Gemulla, R., Zoupanos, S.: Mind the gap: large-scale frequent sequence mining. In: Proceedings of SIGMOD 2013, pp. 797–808. ACM (2013) Miliaraki, I., Berberich, K., Gemulla, R., Zoupanos, S.: Mind the gap: large-scale frequent sequence mining. In: Proceedings of SIGMOD 2013, pp. 797–808. ACM (2013)
15.
go back to reference Moens, S., Aksehirli, E., Goethals, B.: Frequent itemset mining for big data. In Proceedings of BigData 2013, pp. 111–118. IEEE (2013) Moens, S., Aksehirli, E., Goethals, B.: Frequent itemset mining for big data. In Proceedings of BigData 2013, pp. 111–118. IEEE (2013)
18.
go back to reference Savasere, A., Omiecinski, E., Navathe, S.B.: An efficient algorithm for mining association rules in large databases. In: Proceedings of VLDB 1995, pp. 432–444. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (1995) Savasere, A., Omiecinski, E., Navathe, S.B.: An efficient algorithm for mining association rules in large databases. In: Proceedings of VLDB 1995, pp. 432–444. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (1995)
19.
go back to reference Uno, T., Asai, T., Uchida, Y., Arimura, H.: LCM: an efficient algorithm for enumerating frequent closed item sets. In: FIMI, vol. 90. Citeseer (2003) Uno, T., Asai, T., Uchida, Y., Arimura, H.: LCM: an efficient algorithm for enumerating frequent closed item sets. In: FIMI, vol. 90. Citeseer (2003)
20.
go back to reference Zaki, M.J., Parthasarathy, S., Ogihara, M., Li, W., et al.: New algorithms for fast discovery of association rules. In: KDD, vol. 97, pp. 283–286 (1997) Zaki, M.J., Parthasarathy, S., Ogihara, M., Li, W., et al.: New algorithms for fast discovery of association rules. In: KDD, vol. 97, pp. 283–286 (1997)
21.
go back to reference Zhou, L., Zhong, Z., Chang, J., Li, J., Huang, J.Z., Feng, S.: Balanced parallel FP-growth with MapReduce. In: Proceedings of YC-ICT 2010, pp. 243–246 (2010) Zhou, L., Zhong, Z., Chang, J., Li, J., Huang, J.Z., Feng, S.: Balanced parallel FP-growth with MapReduce. In: Proceedings of YC-ICT 2010, pp. 243–246 (2010)
Metadata
Title
MapFIM+: Memory Aware Parallelized Frequent Itemset Mining In Very Large Datasets
Authors
Khanh-Chuong Duong
Mostafa Bamha
Arnaud Giacometti
Dominique Li
Arnaud Soulet
Christel Vrain
Copyright Year
2018
Publisher
Springer Berlin Heidelberg
DOI
https://doi.org/10.1007/978-3-662-58415-6_7

Premium Partner