nach oben

Knowledge and Information Systems

Erschienen in:

03.04.2017 | Regular Paper

A formal series-based unification of the frequent itemset mining approaches

verfasst von: Slimane Oulad-Naoui, Hadda Cherroun, Djelloul Ziadi

Erschienen in: Knowledge and Information Systems | Ausgabe 2/2017

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Over the last two decades, a great deal of work has been devoted to the algorithmic aspects of the frequent itemset (FI) mining problem, leading to a phenomenal number of algorithms and associated implementations, each of which claims supremacy. Meanwhile, it is generally well agreed that developing a unifying theory is one of the most important issues in data mining research. Hence, our primary motivation for this work is to introduce a high-level formalism for this basic problem, which induces a unified vision of the algorithmic approaches presented so far. The key distinctive feature of the introduced model is that it combines, in one fashion, both the qualitative and the quantitative aspects of this basic problem. In this paper, we propose a new model for the FI-mining task based on formal series. In fact, we encode the itemsets as words over a sorted alphabet and express this problem by a formal series over the counting semiring \((\mathbb N,+,\times ,0,1)\), whose range represents the itemsets, and the coefficients are their supports. The aim is threefold: First, to define a clear, unified and extensible theoretical framework through which we can state the main FI-approaches. Second, to prove a convenient connection between the determinization of the acyclic weighted automaton that represents a transaction dataset and the computation of the associated collection of FI. Finally, to devise a first algorithmic transcription, baptized Wafi, of our model by means of weighted automata, which we evaluate against representative leading algorithms. The obtained results show the suitability of our formalism.

Vorheriger Artikel A new accelerated proximal technique for regression with high-dimensional datasets

Nächster Artikel Data-dependent dissimilarity measure: an effective alternative to geometric distance measures

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Nur mit Berechtigung zugänglich

In the counting semiring and by application of the \(\otimes \) operation in general.

In our examples throughout the paper, we consider for easiness that items are sorted according to their lexicographic order.

In our model, an accessible frequent state is a state reachable, using or not \(\epsilon \)-moves, from the initial state, for which the corresponding coefficient of the associated path from the initial state is also greater than the support threshold.

The sense of the derivation does not matter and usually yields the same final coefficient. However, the number of steps needed may be different; it depends on the defined ordering and the given dataset.

To be precise: \(|E| = |Q|-1\).

Agrawal R, Imielinski T, Swami A (1993) Mining association rules between sets of items in large databases. In: Proceedings of the 1993 ACM SIGMOD international conference on management of data, Washington DC, USA, pp 207–216

Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: VLDB’94, proceedings of 20th international conference on very large data bases, 12–15 Sept 1994, Santiago de Chile, Chile, pp 487–499. http://www.vldb.org/conf/1994/P487.PDF

Hipp J, Güntzer U, Nakhaeizadeh G (2000) Algorithms for association rule mining–a general survey and comparison. Sigkdd Explor 2(1):58–64. doi:10.1145/360402.360421 CrossRef

Goethals B, Zaki MJ (eds) (2003) FIMI ’03, In: Proceedings of the workshop on FIM Implementations, Melbourne, Florida, USA. CEUR workshop proceedings, vol. 90

Han J, Cheng H, Xin D, Yan X (2007) Frequent pattern mining: current status and future directions. Data Min Knowl Discov 15(1):55–86. doi:10.1007/s10618-006-0059-1 MathSciNetCrossRef

Borgelt C (2012) Frequent item set mining’. Wiley Interdisc Rew Data Min Knowl Discov 2(6):437–456. doi:10.1002/widm.1074

Aggarwal CC, Bhuiyan M, Hasan MA (2014) Frequent pattern mining algorithms: a survey. In: Frequent pattern mining, pp 19–64 doi:10.1007/978-3-319-07821-2_2

Zaki MJ (2000) Scalable algorithms for association mining. IEEE Trans Knowl Data Eng 12(3):372–390

Zaki MJ, Gouda K (2003) Fast vertical mining using diffsets. In: Proceedings of the Ninth ACM SIGKDD international conference on knowledge discovery and data mining, Washington, DC, USA, 24–27 Aug 2003, pp 326–335. doi:10.1145/956750.956788

10.

Han J, Pei J, Yin Y (2000) Mining frequent patterns without candidate generation. In: Proceedings of the 2000 ACM SIGMOD international conference on management of data, 16–18 May 2000, Dallas, Texas, USA, pp 1–12. doi:10.1145/342009.335372

11.

Bayardo R (1998) Efficiently mining long patterns from databases. In: SIGMOD 1998, proceedings ACM SIGMOD international conference on management of data, 2–4 June 1998, Seattle, Washington, USA, pp 85–93. doi:10.1145/276304.276313

12.

Pasquier N, Bastide Y, Taouil R, Lakhal L (1999) Discovering frequent closed itemsets for association rules. In: Proceedings of the 7th international conference on database theory, ICDT ’99, Springer, Verlag, London, UK, pp 398–416. http://dl.acm.org/citation.cfm?id=645503.656256

13.

Cheung DWL, Lee SD, Kao B (1997) A general incremental technique for maintaining discovered association rules. In: Proceedings of the fifth international conference on database systems for advanced applications (DASFAA). World Scientific Press, pp 185–194. http://dl.acm.org/citation.cfm?id=646711.703155

14.

Valtchev P, Missaoui R, Godin R (2008) A framework for incremental generation of closed itemsets. Discrete Appl Math 156(6):924–949. doi:10.1016/j.dam.2007.08.004 MathSciNetCrossRefMATH

15.

Barbut M, Monjardet B (1970) Ordre et classification: algèbre et combinatoire. Classiques Hachette, Hachette. http://books.google.fr/books?id=n3BpSgAACAAJ

16.

Davey BA, Priestley HA (1990) Introduction to lattices and order. Cambridge University Press, Cambridge. http://www.worldcat.org/search?qt=worldcat_org_all&q=0521367662

17.

Godin R, Missaoui R, Alaoui H (1995) Incremental concept formation algorithms based on galois (concept) lattices. Comput Intell 11:246–267. doi:10.1111/j.1467-8640.1995.tb00031.x CrossRef

18.

Zaki MJ, Ogihara M (1998) Theoretical foundations of association rules. In: 3rd ACM SIGMOD workshop on research issues in data mining and knowledge discovery, June 1998

19.

Yang Q, Wu X (2006) 10 challenging problems in data mining research. Int J Inf Technol Decis Mak 5(4):597–604. doi:10.1142/S0219622006002258 CrossRef

20.

Hoare T (1996) Unification of theories: a challenge for computing science. In: Haveraaen M, Owe O, Dahl O-J (eds) Recent trends in data type specification, 11th workshop on specification of abstract data types joint with the 8th COMPASS workshop, Oslo, Norway, 19–23 Sept 1995, selected papers, Springer, Berlin, Heidelberg, pp 49–57

21.

Oulad-Naoui S, Cherroun H, Ziadi D (2015) A unifying polynomial model for efficient discovery of frequent itemsets. In: Proceedings of 4th international conference on data management technologies and applications, pp 49–59. doi:10.5220/0005516200490059

22.

Salomaa A, Soittola M, Bauer F, Gries D (1978) Automata-theoretic aspects of formal power series. Texts and monographs in computer science. Springer, Verlag. http://books.google.fr/books?id=TtdQAAAAMAAJ

23.

Berstel J, Reutenauer C (1988) Rational series and their languages. EATCS monographs on theoretical computer science. Springer, Verlag. http://books.google.fr/books?id=ZdhQAAAAMAAJ

24.

Hopcroft JE, Motwani R, Ullman JD (2001) Introduction to automata theory, languages, and computation–Addison-Wesley series in computer science, 2nd edn. Addison-Wesley-Longman, LodonMATH

25.

Pin J-E (1988) Tropical semirings. In: Gunawardena J (ed) Idempotency. Cambridge University Press, Cambridge, pp 50–69

26.

Cheung W, Zaïane OR (2003) Incremental mining of frequent patterns without candidate generation or support constraint. In: 7th International database engineering and applications symposium (IDEAS 2003), July 16–18 2003, Hong Kong, China, pp 111–116. doi:10.1109/IDEAS.2003.1214917

27.

Goethals B (2004) Memory issues in frequent itemset mining. In: Proceedings of the 2004 ACM symposium on applied computing (SAC), Nicosia, Cyprus, 14-17 March 2004, pp 530–534

28.

Totad SG, Geeta RB, Reddy PVGDP (2012) Batch incremental processing for fp-tree construction using fp-growth algorithm. Knowl Inf Syst 33(2):475–490. doi:10.1007/s10115-012-0514-9 CrossRef

29.

Droste M, Stüber T, Vogler H (2010) Weighted finite automata over strong bimonoids. Inf Sci 180(1):156–166. doi:10.1016/j.ins.2009.09.003 MathSciNetCrossRefMATH

30.

Pijls W, Kosters WA (2010) Mining frequent itemsets: a perspective from operations research. Stat Neerl. 64(4):367–387. doi:10.1111/j.1467-9574.2010.00452.x MathSciNetCrossRef

31.

Achar A, Laxman S, Sastry P (2012) A unified view of the apriori-based algorithms for frequent episode discovery. Knowl Inf Syst 31(2):223–250. doi:10.1007/s10115-011-0408-2 CrossRef

32.

Mannila H, Toivonen H, Verkamo AI (1997) Discovery of frequent episodes in event sequences. Data Min Knowl Discov 1(3):259–289. doi:10.1023/A:1009748302351 CrossRef

33.

Mohri M (2009) Weighted automata algorithms. In: Droste M, Kuich W, Vogler H (eds) Handbook of weighted automata, monographs in theoretical computer science. An EATCS series. Springer, Berlin, pp 213–254. doi:10.1007/978-3-642-01492-5_6

34.

Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng AFM, Liu B, Yu PS, Steinbach Zhou Z-H, M, Hand DJ, Steinberg D, (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14(1):1–37. doi:10.1007/s10115-007-0114-2

35.

Schmidt-Thieme L (2004) Algorithmic features of eclat.In: FIMI ’04, proceedings of the IEEE ICDM workshop on frequent itemset mining implementations, Brighton, UK, Nov 1. http://ceur-ws.org/Vol-126/schmidtthieme.pdf

36.

Lv Deng Z-H, S-L, (2015) Prepost\({}^{\text{+}}\): an efficient n-lists-based algorithm for mining frequent itemsets via children-parent equivalence pruning. Expert Syst Appl 42(13):5424–5432. doi:10.1016/j.eswa.2015.03.004

37.

Cohen E, Halperin E, Kaplan H, Zwick U (2002) Reachability and distance queries via 2-hop labels. In: Proceedings of the thirteenth annual ACM-SIAM symposium on discrete algorithms, SODA ’02. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA pp 937–946

38.

Deng Z-H, Wang Z (2010) A new fast vertical method for mining frequent patterns. Int J Comput Intell Syst 3(6):733–744. doi:10.1080/18756891.2010.9727736

39.

Wang Deng Z-H, Z, Jiang J-J, (2012) A new algorithm for fast mining frequent itemsets using n-lists. Sci China Inf Sci 55(9):2008–2030. doi:10.1007/s11432-012-4638-z

40.

Cormen TH, Leiserson CE, Rivest RL, Stein C (2009) Introduction to algorithms, 3rd edn. The MIT Press, BostonMATH

41.

fimdr (2003) Fimi repository for frequent itemset mining, implementations and datasets. http://fimi.ua.ac.be/data/

42.

Lichman M (2013) UCI machine learning repository. http://archive.ics.uci.edu/ml

43.

Fournier-Viger P, Lin JC-W, Gomariz A, Gueniche T, Soltani A, Deng Z, Lam HT (2016) The SPMF open-source data mining library version 2. Proceedings of 19th European Conference on Principles of Data Mining and Knowledge Discovery PKDD 2016, pp 36–40

44.

Rácz B, Bodon F, Schmidt-Thieme L (2005) On benchmarking frequent itemset mining algorithms: From measurement to analysis. In: Proceedings of the 1st international workshop on open source data mining: frequent pattern mining implementations, OSDM ’05, ACM, New York, NY, USA, pp 36–45. doi:10.1145/1133905.1133911

Titel: A formal series-based unification of the frequent itemset mining approaches
verfasst von: Slimane Oulad-Naoui
Hadda Cherroun
Djelloul Ziadi
Publikationsdatum: 03.04.2017
Verlag: Springer London
Erschienen in: Knowledge and Information Systems / Ausgabe 2/2017
Print ISSN: 0219-1377
Elektronische ISSN: 0219-3116
DOI: https://doi.org/10.1007/s10115-017-1048-y

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Weitere Artikel der Ausgabe 2/2017

A new accelerated proximal technique for regression with high-dimensional datasets

Distributed and scalable sequential pattern mining through stream processing

Time-weighted counting for recently frequent pattern mining in data streams

Recent advances in document summarization

Dynamic sampling of text streams and its application in text analysis

Effective sparsity control in deep belief networks using normal regularization term