Top

Knowledge and Information Systems

Published in:

22-03-2016 | Regular paper

A highly scalable parallel algorithm for maximally informative k-itemset mining

Authors: Saber Salah, Reza Akbarinia, Florent Masseglia

Published in: Knowledge and Information Systems | Issue 1/2017

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

The discovery of informative itemsets is a fundamental building block in data analytics and information retrieval. While the problem has been widely studied, only few solutions scale. This is particularly the case when (1) the data set is massive, calling for large-scale distribution, and/or (2) the length k of the informative itemset to be discovered is high. In this paper, we address the problem of parallel mining of maximally informative k-itemsets (miki) based on joint entropy. We propose PHIKS (Parallel Highly Informative \(\underline{K}\)-ItemSet), a highly scalable, parallel miki mining algorithm. PHIKS renders the mining process of large-scale databases (up to terabytes of data) succinct and effective. Its mining process is made up of only two efficient parallel jobs. With PHIKS, we provide a set of significant optimizations for calculating the joint entropies of miki having different sizes, which drastically reduces the execution time, the communication cost and the energy consumption, in a distributed computational platform. PHIKS has been extensively evaluated using massive real-world data sets. Our experimental results confirm the effectiveness of our proposal by the significant scale-up obtained with high itemsets length and over very large databases.

next article An adaptive version of k-medoids to deal with the uncertainty in clustering heterogeneous data using an intermediary fusion approach

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: Proceedings of international conference on very large data bases (VLDB), pp 487–499

Amazon (n.d.) , http://snap.stanford.edu/data/web-Amazon-links.html

Anand R (2012) Mining of massive datasets. Cambridge University Press, New York

Berberich K, Bedathur S (2013) Computing n-gram statistics in mapreduce. In: Proceedings of the 16th international conference on extending database technology (EDBT), pp 101–112

Berry M (2008) Survey of text mining II clustering, classification, and retrieval. Springer, New YorkCrossRef

Bizer C, Boncz PA, Brodie ML, Erling O (2011) The meaningful use of big data: four perspectives—four challenges. SIGMOD Rec 40(4):56–60CrossRef

Brin S, Motwani R, Silverstein C (1997) Beyond market baskets: generalizing association rules to correlations. SIGMOD Rec 26(2):265–276. doi:10.1145/253262.253327 CrossRef

Chandrashekar G, Sahin F (2014) A survey on feature selection methods. Comput Elect Eng 40(1):16–28CrossRef

Cover TM, Thomas JA (1991) Elements of information theory. Wiley-Interscience, New YorkCrossRefMATH

10.

Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1):107–113CrossRef

11.

English Wikipedia Articles (2014) http://dumps.wikimedia.org/enwiki/latest

12.

Hastie T (2009) The elements of statistical learning: data mining, inference, and prediction. Springer, New York. ISBN: 978-0387848570

13.

Gray R (2011) Entropy and information theory. Springer, New YorkCrossRefMATH

14.

Grid5000 (n.d.) https://www.grid5000.fr/mediawiki/index.php/Grid5000:Home

15.

Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182MATH

16.

Hadoop (2014) http://hadoop.apache.org

17.

Han J (2012) Data mining: concepts and techniques. Elsevier/Morgan Kaufmann, BostonCrossRefMATH

18.

Han J, Pei J, Yin Y (2000) Mining frequent patterns without candidate generation. SIGMOD Rec 29(2):1–12. doi:10.1145/335191.335372 CrossRef

19.

Heikinheimo H, Hinkkanen E, Mannila H, Mielikäinen T, Seppänen JK (2007) Finding low-entropy sets and trees from binary data. In: Proceedings of ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 350–359

20.

Herrera F, Carmona C, González P, del Jesus M (2011) An overview on subgroup discovery: foundations and applications. Knowl Inf Syst 29(3):495–525. doi:10.1007/s10115-010-0356-2 CrossRef

21.

Knobbe AJ, Ho EKY (2006) Maximally informative k-itemsets and their efficient discovery. In: Proceedings of ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 237–244

22.

Kotsiantis SB (2007) Supervised machine learning: a review of classification techniques. In: Proceedings of international conference on emerging artificial intelligence applications in computer engineering, pp 3–24

23.

Li H, Wang Y, Zhang D, Zhang M, Chang EY (2008) Pfp: parallel fp-growth for query recommendation. In Proceedings of the ACM conference on recommender systems (RecSys), pp 107–114

24.

Miliaraki I, Berberich K, Gemulla R, Zoupanos S (2013) Mind the gap: Large-scale frequent sequence mining. In: Proceedings of the 2013 ACM SIGMOD international conference on management of data (SIGMOD), pp 797–808

25.

Moens S, Aksehirli E, Goethals B ( 2013) Frequent itemset mining for big data. In: IEEE international conference on big data, pp 111–118

26.

Riondato M, DeBrabant JA, Fonseca R, Upfal E (2012) Parma: a parallel randomized algorithm for approximate association rules mining in mapreduce. In: 21st ACM international conference on information and knowledge management (CIKM), pp 85–94

27.

Savasere A, Omiecinski E, Navathe SB ( 1995) An efficient algorithm for mining association rules in large databases. In: Proceedings of international conference on very large data bases (VLDB), pp 432–444

28.

Tanbeer S, Ahmed C, Jeong B-S ( 2009) Parallel and distributed frequent pattern mining in large databases. In: 11th IEEE international conference on high performance computing and communications (HPCC), pp 407–414

29.

Tatti N (2010) Probably the best itemsets. In: Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining, Washington, DC, USA, July 25-28, 2010, pp 293–302. doi:10.1145/1835804.1835843

30.

Teng W-G, Chen M-S, Yu PS ( 2003) A regression-based temporal pattern mining scheme for data streams. In: Proceedings of international conference on very large data bases (VLDB), pp 93–104

31.

The ClueWeb09 Dataset (2009) http://www.lemurproject.org/clueweb09.php/

32.

White T (2012) Hadoop: the definitive guide. O’Reilly, California

33.

Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I (2010) Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX conference on hot topics in cloud computing, p 10

34.

Zhang C, Masseglia F (2010) Discovering highly informative feature sets from data streams. In: Proceedings of the 21st international conference on database and expert systems applications: part I, DEXA’10, Springer, Berlin, pp 91–104. http://dl.acm.org/citation.cfm?id=1881867.1881877

Title: A highly scalable parallel algorithm for maximally informative k-itemset mining
Authors: Saber Salah
Reza Akbarinia
Florent Masseglia
Publication date: 22-03-2016
Publisher: Springer London
Published in: Knowledge and Information Systems / Issue 1/2017
Print ISSN: 0219-1377
Electronic ISSN: 0219-3116
DOI: https://doi.org/10.1007/s10115-016-0931-2

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Other articles of this Issue 1/2017

Representation and analysis of enterprise models with semantic techniques: an application to ArchiMate, e3value and business model canvas

Truss decomposition of uncertain graphs

An adaptive version of k-medoids to deal with the uncertainty in clustering heterogeneous data using an intermediary fusion approach

Information warfare: a lightweight matrix-based approach for database recovery

Rule-based inference and decomposition for distributed in-network processing in wireless sensor networks

Clinical evidence framework for Bayesian networks

Premium Partner