Skip to main content
Erschienen in: Knowledge and Information Systems 2/2017

03.04.2017 | Regular Paper

A formal series-based unification of the frequent itemset mining approaches

verfasst von: Slimane Oulad-Naoui, Hadda Cherroun, Djelloul Ziadi

Erschienen in: Knowledge and Information Systems | Ausgabe 2/2017

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Over the last two decades, a great deal of work has been devoted to the algorithmic aspects of the frequent itemset (FI) mining problem, leading to a phenomenal number of algorithms and associated implementations, each of which claims supremacy. Meanwhile, it is generally well agreed that developing a unifying theory is one of the most important issues in data mining research. Hence, our primary motivation for this work is to introduce a high-level formalism for this basic problem, which induces a unified vision of the algorithmic approaches presented so far. The key distinctive feature of the introduced model is that it combines, in one fashion, both the qualitative and the quantitative aspects of this basic problem. In this paper, we propose a new model for the FI-mining task based on formal series. In fact, we encode the itemsets as words over a sorted alphabet and express this problem by a formal series over the counting semiring \((\mathbb N,+,\times ,0,1)\), whose range represents the itemsets, and the coefficients are their supports. The aim is threefold: First, to define a clear, unified and extensible theoretical framework through which we can state the main FI-approaches. Second, to prove a convenient connection between the determinization of the acyclic weighted automaton that represents a transaction dataset and the computation of the associated collection of FI. Finally, to devise a first algorithmic transcription, baptized Wafi, of our model by means of weighted automata, which we evaluate against representative leading algorithms. The obtained results show the suitability of our formalism.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Fußnoten
1
In the counting semiring and by application of the \(\otimes \) operation in general.
 
2
In our examples throughout the paper, we consider for easiness that items are sorted according to their lexicographic order.
 
3
In our model, an accessible frequent state is a state reachable, using or not \(\epsilon \)-moves, from the initial state, for which the corresponding coefficient of the associated path from the initial state is also greater than the support threshold.
 
4
The sense of the derivation does not matter and usually yields the same final coefficient. However, the number of steps needed may be different; it depends on the defined ordering and the given dataset.
 
5
To be precise: \(|E| = |Q|-1\).
 
Literatur
1.
Zurück zum Zitat Agrawal R, Imielinski T, Swami A (1993) Mining association rules between sets of items in large databases. In: Proceedings of the 1993 ACM SIGMOD international conference on management of data, Washington DC, USA, pp 207–216 Agrawal R, Imielinski T, Swami A (1993) Mining association rules between sets of items in large databases. In: Proceedings of the 1993 ACM SIGMOD international conference on management of data, Washington DC, USA, pp 207–216
2.
Zurück zum Zitat Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: VLDB’94, proceedings of 20th international conference on very large data bases, 12–15 Sept 1994, Santiago de Chile, Chile, pp 487–499. http://www.vldb.org/conf/1994/P487.PDF Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: VLDB’94, proceedings of 20th international conference on very large data bases, 12–15 Sept 1994, Santiago de Chile, Chile, pp 487–499. http://​www.​vldb.​org/​conf/​1994/​P487.​PDF
4.
Zurück zum Zitat Goethals B, Zaki MJ (eds) (2003) FIMI ’03, In: Proceedings of the workshop on FIM Implementations, Melbourne, Florida, USA. CEUR workshop proceedings, vol. 90 Goethals B, Zaki MJ (eds) (2003) FIMI ’03, In: Proceedings of the workshop on FIM Implementations, Melbourne, Florida, USA. CEUR workshop proceedings, vol. 90
8.
Zurück zum Zitat Zaki MJ (2000) Scalable algorithms for association mining. IEEE Trans Knowl Data Eng 12(3):372–390 Zaki MJ (2000) Scalable algorithms for association mining. IEEE Trans Knowl Data Eng 12(3):372–390
9.
Zurück zum Zitat Zaki MJ, Gouda K (2003) Fast vertical mining using diffsets. In: Proceedings of the Ninth ACM SIGKDD international conference on knowledge discovery and data mining, Washington, DC, USA, 24–27 Aug 2003, pp 326–335. doi:10.1145/956750.956788 Zaki MJ, Gouda K (2003) Fast vertical mining using diffsets. In: Proceedings of the Ninth ACM SIGKDD international conference on knowledge discovery and data mining, Washington, DC, USA, 24–27 Aug 2003, pp 326–335. doi:10.​1145/​956750.​956788
10.
Zurück zum Zitat Han J, Pei J, Yin Y (2000) Mining frequent patterns without candidate generation. In: Proceedings of the 2000 ACM SIGMOD international conference on management of data, 16–18 May 2000, Dallas, Texas, USA, pp 1–12. doi:10.1145/342009.335372 Han J, Pei J, Yin Y (2000) Mining frequent patterns without candidate generation. In: Proceedings of the 2000 ACM SIGMOD international conference on management of data, 16–18 May 2000, Dallas, Texas, USA, pp 1–12. doi:10.​1145/​342009.​335372
11.
Zurück zum Zitat Bayardo R (1998) Efficiently mining long patterns from databases. In: SIGMOD 1998, proceedings ACM SIGMOD international conference on management of data, 2–4 June 1998, Seattle, Washington, USA, pp 85–93. doi:10.1145/276304.276313 Bayardo R (1998) Efficiently mining long patterns from databases. In: SIGMOD 1998, proceedings ACM SIGMOD international conference on management of data, 2–4 June 1998, Seattle, Washington, USA, pp 85–93. doi:10.​1145/​276304.​276313
18.
Zurück zum Zitat Zaki MJ, Ogihara M (1998) Theoretical foundations of association rules. In: 3rd ACM SIGMOD workshop on research issues in data mining and knowledge discovery, June 1998 Zaki MJ, Ogihara M (1998) Theoretical foundations of association rules. In: 3rd ACM SIGMOD workshop on research issues in data mining and knowledge discovery, June 1998
20.
Zurück zum Zitat Hoare T (1996) Unification of theories: a challenge for computing science. In: Haveraaen M, Owe O, Dahl O-J (eds) Recent trends in data type specification, 11th workshop on specification of abstract data types joint with the 8th COMPASS workshop, Oslo, Norway, 19–23 Sept 1995, selected papers, Springer, Berlin, Heidelberg, pp 49–57 Hoare T (1996) Unification of theories: a challenge for computing science. In: Haveraaen M, Owe O, Dahl O-J (eds) Recent trends in data type specification, 11th workshop on specification of abstract data types joint with the 8th COMPASS workshop, Oslo, Norway, 19–23 Sept 1995, selected papers, Springer, Berlin, Heidelberg, pp 49–57
21.
Zurück zum Zitat Oulad-Naoui S, Cherroun H, Ziadi D (2015) A unifying polynomial model for efficient discovery of frequent itemsets. In: Proceedings of 4th international conference on data management technologies and applications, pp 49–59. doi:10.5220/0005516200490059 Oulad-Naoui S, Cherroun H, Ziadi D (2015) A unifying polynomial model for efficient discovery of frequent itemsets. In: Proceedings of 4th international conference on data management technologies and applications, pp 49–59. doi:10.​5220/​0005516200490059​
24.
Zurück zum Zitat Hopcroft JE, Motwani R, Ullman JD (2001) Introduction to automata theory, languages, and computation–Addison-Wesley series in computer science, 2nd edn. Addison-Wesley-Longman, LodonMATH Hopcroft JE, Motwani R, Ullman JD (2001) Introduction to automata theory, languages, and computation–Addison-Wesley series in computer science, 2nd edn. Addison-Wesley-Longman, LodonMATH
25.
Zurück zum Zitat Pin J-E (1988) Tropical semirings. In: Gunawardena J (ed) Idempotency. Cambridge University Press, Cambridge, pp 50–69 Pin J-E (1988) Tropical semirings. In: Gunawardena J (ed) Idempotency. Cambridge University Press, Cambridge, pp 50–69
26.
Zurück zum Zitat Cheung W, Zaïane OR (2003) Incremental mining of frequent patterns without candidate generation or support constraint. In: 7th International database engineering and applications symposium (IDEAS 2003), July 16–18 2003, Hong Kong, China, pp 111–116. doi:10.1109/IDEAS.2003.1214917 Cheung W, Zaïane OR (2003) Incremental mining of frequent patterns without candidate generation or support constraint. In: 7th International database engineering and applications symposium (IDEAS 2003), July 16–18 2003, Hong Kong, China, pp 111–116. doi:10.​1109/​IDEAS.​2003.​1214917
27.
Zurück zum Zitat Goethals B (2004) Memory issues in frequent itemset mining. In: Proceedings of the 2004 ACM symposium on applied computing (SAC), Nicosia, Cyprus, 14-17 March 2004, pp 530–534 Goethals B (2004) Memory issues in frequent itemset mining. In: Proceedings of the 2004 ACM symposium on applied computing (SAC), Nicosia, Cyprus, 14-17 March 2004, pp 530–534
33.
Zurück zum Zitat Mohri M (2009) Weighted automata algorithms. In: Droste M, Kuich W, Vogler H (eds) Handbook of weighted automata, monographs in theoretical computer science. An EATCS series. Springer, Berlin, pp 213–254. doi:10.1007/978-3-642-01492-5_6 Mohri M (2009) Weighted automata algorithms. In: Droste M, Kuich W, Vogler H (eds) Handbook of weighted automata, monographs in theoretical computer science. An EATCS series. Springer, Berlin, pp 213–254. doi:10.​1007/​978-3-642-01492-5_​6
34.
Zurück zum Zitat Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng AFM, Liu B, Yu PS, Steinbach Zhou Z-H, M, Hand DJ, Steinberg D, (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14(1):1–37. doi:10.1007/s10115-007-0114-2 Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng AFM, Liu B, Yu PS, Steinbach Zhou Z-H, M, Hand DJ, Steinberg D, (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14(1):1–37. doi:10.​1007/​s10115-007-0114-2
36.
Zurück zum Zitat Lv Deng Z-H, S-L, (2015) Prepost\({}^{\text{+}}\): an efficient n-lists-based algorithm for mining frequent itemsets via children-parent equivalence pruning. Expert Syst Appl 42(13):5424–5432. doi:10.1016/j.eswa.2015.03.004 Lv Deng Z-H, S-L, (2015) Prepost\({}^{\text{+}}\): an efficient n-lists-based algorithm for mining frequent itemsets via children-parent equivalence pruning. Expert Syst Appl 42(13):5424–5432. doi:10.​1016/​j.​eswa.​2015.​03.​004
37.
Zurück zum Zitat Cohen E, Halperin E, Kaplan H, Zwick U (2002) Reachability and distance queries via 2-hop labels. In: Proceedings of the thirteenth annual ACM-SIAM symposium on discrete algorithms, SODA ’02. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA pp 937–946 Cohen E, Halperin E, Kaplan H, Zwick U (2002) Reachability and distance queries via 2-hop labels. In: Proceedings of the thirteenth annual ACM-SIAM symposium on discrete algorithms, SODA ’02. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA pp 937–946
39.
40.
Zurück zum Zitat Cormen TH, Leiserson CE, Rivest RL, Stein C (2009) Introduction to algorithms, 3rd edn. The MIT Press, BostonMATH Cormen TH, Leiserson CE, Rivest RL, Stein C (2009) Introduction to algorithms, 3rd edn. The MIT Press, BostonMATH
43.
Zurück zum Zitat Fournier-Viger P, Lin JC-W, Gomariz A, Gueniche T, Soltani A, Deng Z, Lam HT (2016) The SPMF open-source data mining library version 2. Proceedings of 19th European Conference on Principles of Data Mining and Knowledge Discovery PKDD 2016, pp 36–40 Fournier-Viger P, Lin JC-W, Gomariz A, Gueniche T, Soltani A, Deng Z, Lam HT (2016) The SPMF open-source data mining library version 2. Proceedings of 19th European Conference on Principles of Data Mining and Knowledge Discovery PKDD 2016, pp 36–40
44.
Zurück zum Zitat Rácz B, Bodon F, Schmidt-Thieme L (2005) On benchmarking frequent itemset mining algorithms: From measurement to analysis. In: Proceedings of the 1st international workshop on open source data mining: frequent pattern mining implementations, OSDM ’05, ACM, New York, NY, USA, pp 36–45. doi:10.1145/1133905.1133911 Rácz B, Bodon F, Schmidt-Thieme L (2005) On benchmarking frequent itemset mining algorithms: From measurement to analysis. In: Proceedings of the 1st international workshop on open source data mining: frequent pattern mining implementations, OSDM ’05, ACM, New York, NY, USA, pp 36–45. doi:10.​1145/​1133905.​1133911
Metadaten
Titel
A formal series-based unification of the frequent itemset mining approaches
verfasst von
Slimane Oulad-Naoui
Hadda Cherroun
Djelloul Ziadi
Publikationsdatum
03.04.2017
Verlag
Springer London
Erschienen in
Knowledge and Information Systems / Ausgabe 2/2017
Print ISSN: 0219-1377
Elektronische ISSN: 0219-3116
DOI
https://doi.org/10.1007/s10115-017-1048-y

Weitere Artikel der Ausgabe 2/2017

Knowledge and Information Systems 2/2017 Zur Ausgabe