Skip to main content
Top
Published in: Knowledge and Information Systems 1/2017

22-03-2016 | Regular paper

A highly scalable parallel algorithm for maximally informative k-itemset mining

Authors: Saber Salah, Reza Akbarinia, Florent Masseglia

Published in: Knowledge and Information Systems | Issue 1/2017

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

The discovery of informative itemsets is a fundamental building block in data analytics and information retrieval. While the problem has been widely studied, only few solutions scale. This is particularly the case when (1) the data set is massive, calling for large-scale distribution, and/or (2) the length k of the informative itemset to be discovered is high. In this paper, we address the problem of parallel mining of maximally informative k-itemsets (miki) based on joint entropy. We propose PHIKS (Parallel Highly Informative \(\underline{K}\)-ItemSet), a highly scalable, parallel miki mining algorithm. PHIKS renders the mining process of large-scale databases (up to terabytes of data) succinct and effective. Its mining process is made up of only two efficient parallel jobs. With PHIKS, we provide a set of significant optimizations for calculating the joint entropies of miki having different sizes, which drastically reduces the execution time, the communication cost and the energy consumption, in a distributed computational platform. PHIKS has been extensively evaluated using massive real-world data sets. Our experimental results confirm the effectiveness of our proposal by the significant scale-up obtained with high itemsets length and over very large databases.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: Proceedings of international conference on very large data bases (VLDB), pp 487–499 Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: Proceedings of international conference on very large data bases (VLDB), pp 487–499
3.
go back to reference Anand R (2012) Mining of massive datasets. Cambridge University Press, New York Anand R (2012) Mining of massive datasets. Cambridge University Press, New York
4.
go back to reference Berberich K, Bedathur S (2013) Computing n-gram statistics in mapreduce. In: Proceedings of the 16th international conference on extending database technology (EDBT), pp 101–112 Berberich K, Bedathur S (2013) Computing n-gram statistics in mapreduce. In: Proceedings of the 16th international conference on extending database technology (EDBT), pp 101–112
5.
go back to reference Berry M (2008) Survey of text mining II clustering, classification, and retrieval. Springer, New YorkCrossRef Berry M (2008) Survey of text mining II clustering, classification, and retrieval. Springer, New YorkCrossRef
6.
go back to reference Bizer C, Boncz PA, Brodie ML, Erling O (2011) The meaningful use of big data: four perspectives—four challenges. SIGMOD Rec 40(4):56–60CrossRef Bizer C, Boncz PA, Brodie ML, Erling O (2011) The meaningful use of big data: four perspectives—four challenges. SIGMOD Rec 40(4):56–60CrossRef
8.
go back to reference Chandrashekar G, Sahin F (2014) A survey on feature selection methods. Comput Elect Eng 40(1):16–28CrossRef Chandrashekar G, Sahin F (2014) A survey on feature selection methods. Comput Elect Eng 40(1):16–28CrossRef
9.
10.
go back to reference Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1):107–113CrossRef Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1):107–113CrossRef
12.
go back to reference Hastie T (2009) The elements of statistical learning: data mining, inference, and prediction. Springer, New York. ISBN: 978-0387848570 Hastie T (2009) The elements of statistical learning: data mining, inference, and prediction. Springer, New York. ISBN: 978-0387848570
15.
go back to reference Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182MATH Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182MATH
17.
19.
go back to reference Heikinheimo H, Hinkkanen E, Mannila H, Mielikäinen T, Seppänen JK (2007) Finding low-entropy sets and trees from binary data. In: Proceedings of ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 350–359 Heikinheimo H, Hinkkanen E, Mannila H, Mielikäinen T, Seppänen JK (2007) Finding low-entropy sets and trees from binary data. In: Proceedings of ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 350–359
21.
go back to reference Knobbe AJ, Ho EKY (2006) Maximally informative k-itemsets and their efficient discovery. In: Proceedings of ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 237–244 Knobbe AJ, Ho EKY (2006) Maximally informative k-itemsets and their efficient discovery. In: Proceedings of ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 237–244
22.
go back to reference Kotsiantis SB (2007) Supervised machine learning: a review of classification techniques. In: Proceedings of international conference on emerging artificial intelligence applications in computer engineering, pp 3–24 Kotsiantis SB (2007) Supervised machine learning: a review of classification techniques. In: Proceedings of international conference on emerging artificial intelligence applications in computer engineering, pp 3–24
23.
go back to reference Li H, Wang Y, Zhang D, Zhang M, Chang EY (2008) Pfp: parallel fp-growth for query recommendation. In Proceedings of the ACM conference on recommender systems (RecSys), pp 107–114 Li H, Wang Y, Zhang D, Zhang M, Chang EY (2008) Pfp: parallel fp-growth for query recommendation. In Proceedings of the ACM conference on recommender systems (RecSys), pp 107–114
24.
go back to reference Miliaraki I, Berberich K, Gemulla R, Zoupanos S (2013) Mind the gap: Large-scale frequent sequence mining. In: Proceedings of the 2013 ACM SIGMOD international conference on management of data (SIGMOD), pp 797–808 Miliaraki I, Berberich K, Gemulla R, Zoupanos S (2013) Mind the gap: Large-scale frequent sequence mining. In: Proceedings of the 2013 ACM SIGMOD international conference on management of data (SIGMOD), pp 797–808
25.
go back to reference Moens S, Aksehirli E, Goethals B ( 2013) Frequent itemset mining for big data. In: IEEE international conference on big data, pp 111–118 Moens S, Aksehirli E, Goethals B ( 2013) Frequent itemset mining for big data. In: IEEE international conference on big data, pp 111–118
26.
go back to reference Riondato M, DeBrabant JA, Fonseca R, Upfal E (2012) Parma: a parallel randomized algorithm for approximate association rules mining in mapreduce. In: 21st ACM international conference on information and knowledge management (CIKM), pp 85–94 Riondato M, DeBrabant JA, Fonseca R, Upfal E (2012) Parma: a parallel randomized algorithm for approximate association rules mining in mapreduce. In: 21st ACM international conference on information and knowledge management (CIKM), pp 85–94
27.
go back to reference Savasere A, Omiecinski E, Navathe SB ( 1995) An efficient algorithm for mining association rules in large databases. In: Proceedings of international conference on very large data bases (VLDB), pp 432–444 Savasere A, Omiecinski E, Navathe SB ( 1995) An efficient algorithm for mining association rules in large databases. In: Proceedings of international conference on very large data bases (VLDB), pp 432–444
28.
go back to reference Tanbeer S, Ahmed C, Jeong B-S ( 2009) Parallel and distributed frequent pattern mining in large databases. In: 11th IEEE international conference on high performance computing and communications (HPCC), pp 407–414 Tanbeer S, Ahmed C, Jeong B-S ( 2009) Parallel and distributed frequent pattern mining in large databases. In: 11th IEEE international conference on high performance computing and communications (HPCC), pp 407–414
29.
go back to reference Tatti N (2010) Probably the best itemsets. In: Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining, Washington, DC, USA, July 25-28, 2010, pp 293–302. doi:10.1145/1835804.1835843 Tatti N (2010) Probably the best itemsets. In: Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining, Washington, DC, USA, July 25-28, 2010, pp 293–302. doi:10.​1145/​1835804.​1835843
30.
go back to reference Teng W-G, Chen M-S, Yu PS ( 2003) A regression-based temporal pattern mining scheme for data streams. In: Proceedings of international conference on very large data bases (VLDB), pp 93–104 Teng W-G, Chen M-S, Yu PS ( 2003) A regression-based temporal pattern mining scheme for data streams. In: Proceedings of international conference on very large data bases (VLDB), pp 93–104
32.
go back to reference White T (2012) Hadoop: the definitive guide. O’Reilly, California White T (2012) Hadoop: the definitive guide. O’Reilly, California
33.
go back to reference Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I (2010) Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX conference on hot topics in cloud computing, p 10 Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I (2010) Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX conference on hot topics in cloud computing, p 10
Metadata
Title
A highly scalable parallel algorithm for maximally informative k-itemset mining
Authors
Saber Salah
Reza Akbarinia
Florent Masseglia
Publication date
22-03-2016
Publisher
Springer London
Published in
Knowledge and Information Systems / Issue 1/2017
Print ISSN: 0219-1377
Electronic ISSN: 0219-3116
DOI
https://doi.org/10.1007/s10115-016-0931-2

Other articles of this Issue 1/2017

Knowledge and Information Systems 1/2017 Go to the issue

Premium Partner