Skip to main content
Erschienen in: Knowledge and Information Systems 2/2017

22.03.2017 | Regular Paper

Time-weighted counting for recently frequent pattern mining in data streams

verfasst von: Yongsub Lim, U. Kang

Erschienen in: Knowledge and Information Systems | Ausgabe 2/2017

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

How can we discover interesting patterns from time-evolving high-speed data streams? How to analyze the data streams quickly and accurately, with little space overhead? How to guarantee the found patterns to be self-consistent? High-speed data stream has been receiving increasing attention due to its wide applications such as sensors, network traffic, social networks, etc. The most fundamental task on the data stream is frequent pattern mining; especially, focusing on recentness is important in real applications. In this paper, we develop two algorithms for discovering recently frequent patterns in data streams. First, we propose TwMinSwap to find top-k recently frequent items in data streams, which is a deterministic version of our motivating algorithm TwSample providing theoretical guarantees based on item sampling. TwMinSwap improves TwSample in terms of speed, accuracy, and memory usage. Both require only O(k) memory spaces and do not require any prior knowledge on the stream such as its length and the number of distinct items in the stream. Second, we propose TwMinSwap-Is to find top-k recently frequent itemsets in data streams. We especially focus on keeping self-consistency of the discovered itemsets, which is the most important property for reliable results, while using O(k) memory space with the assumption of a constant itemset size. Through extensive experiments, we demonstrate that TwMinSwap outperforms all competitors in terms of accuracy and memory usage, with fast running time. We also show that TwMinSwap-Is is more accurate than the competitor and discovers recently frequent itemsets with reasonably large sizes (at most 5–7) depending on datasets. Thanks to TwMinSwap and TwMinSwap-Is, we report interesting discoveries in real world data streams, including the difference of trends between the winner and the loser of U.S. presidential candidates, and temporal human contact patterns.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Fußnoten
1
If an itemset \(\mu \) is frequent, its every subset \(\nu \subseteq \mu \) is also frequent.
 
2
Here, \(binomial(\omega ,\theta )\) denotes a binomial random variable with the number \(\omega \) of independent trials and the success probability \(\theta \).
 
3
This is a different concept from closed frequent itemsets [2].
 
4
In the original paper proposing Skip LC-SS, k is set to a large \(50{,}000\le k\le 70{,}000\).
 
Literatur
1.
Zurück zum Zitat Arasu A, Manku GS (2004) Approximate counts and quantiles over sliding windows. In: PODS Arasu A, Manku GS (2004) Approximate counts and quantiles over sliding windows. In: PODS
2.
Zurück zum Zitat Borgelt C, Yang X, Nogales-Cadenas R, Carmona-Saez P, Pascual-Montano A (2011) Finding closed frequent item sets by intersecting transactions. In: EDBT Borgelt C, Yang X, Nogales-Cadenas R, Carmona-Saez P, Pascual-Montano A (2011) Finding closed frequent item sets by intersecting transactions. In: EDBT
3.
Zurück zum Zitat Boyer RS, Moore JS (1991) Mjrty: a fast majority vote algorithm. In: Automated reasoning: essays in honor of Woody Bledsoe Boyer RS, Moore JS (1991) Mjrty: a fast majority vote algorithm. In: Automated reasoning: essays in honor of Woody Bledsoe
4.
Zurück zum Zitat Brijs T, Swinnen G, Vanhoof K, Wets G (1999) Using association rules for product assortment decisions: a case study. In: Knowledge discovery and data mining Brijs T, Swinnen G, Vanhoof K, Wets G (1999) Using association rules for product assortment decisions: a case study. In: Knowledge discovery and data mining
5.
Zurück zum Zitat Calders T, Dexters N, Gillis JJM, Goethals B (2014) Mining frequent itemsets in a stream. Inf Syst 39:233–255CrossRef Calders T, Dexters N, Gillis JJM, Goethals B (2014) Mining frequent itemsets in a stream. Inf Syst 39:233–255CrossRef
6.
Zurück zum Zitat Chang JH, Lee WS (2003) Finding recent frequent itemsets adaptively over online data streams. In: KDD Chang JH, Lee WS (2003) Finding recent frequent itemsets adaptively over online data streams. In: KDD
7.
Zurück zum Zitat Chang J H, Lee W S (2004) Decaying obsolete information in finding recent frequent itemsets over data streams. IEICE Trans 87–D(6):1588–1592 Chang J H, Lee W S (2004) Decaying obsolete information in finding recent frequent itemsets over data streams. IEICE Trans 87–D(6):1588–1592
8.
Zurück zum Zitat Charikar M, Chen K, Farach-Colton M (2002) Finding frequent items in data streams. In: ICALP Charikar M, Chen K, Farach-Colton M (2002) Finding frequent items in data streams. In: ICALP
10.
Zurück zum Zitat Cormode G, Hadjieleftheriou M (2010) Methods for finding frequent items in data streams. VLDB J 19(1):3–20CrossRef Cormode G, Hadjieleftheriou M (2010) Methods for finding frequent items in data streams. VLDB J 19(1):3–20CrossRef
11.
Zurück zum Zitat Cormode G, Muthukrishnan S (2004a) An improved data stream summary: the count-min sketch and its applications. In: LATIN Cormode G, Muthukrishnan S (2004a) An improved data stream summary: the count-min sketch and its applications. In: LATIN
12.
Zurück zum Zitat Cormode G, Muthukrishnan S (2004b) What’s new: finding significant differences in network data streams. In: INFOCOM Cormode G, Muthukrishnan S (2004b) What’s new: finding significant differences in network data streams. In: INFOCOM
13.
Zurück zum Zitat Dallachiesa M, Palpanas T (2013) Identifying streaming frequent items in ad hoc time windows. Data Knowl Eng 87:66–90CrossRef Dallachiesa M, Palpanas T (2013) Identifying streaming frequent items in ad hoc time windows. Data Knowl Eng 87:66–90CrossRef
14.
Zurück zum Zitat Demaine ED, Lpez-Ortiz A, Munro JI (2002) Frequency estimation of internet packet streams with limited space. In: ESA Demaine ED, Lpez-Ortiz A, Munro JI (2002) Frequency estimation of internet packet streams with limited space. In: ESA
15.
Zurück zum Zitat Eagle N, Pentland A (2006) Reality mining: sensing complex social systems. Pers Ubiquitous Comput 10(4):255–268CrossRef Eagle N, Pentland A (2006) Reality mining: sensing complex social systems. Pers Ubiquitous Comput 10(4):255–268CrossRef
16.
Zurück zum Zitat Fischer MJ, Salzberg SL (1982) Finding a majority among \(n\) votes: solution to problem 81–5 (Journal of Algorithms, June 1981). J Algorithms 3(4):362–380CrossRef Fischer MJ, Salzberg SL (1982) Finding a majority among \(n\) votes: solution to problem 81–5 (Journal of Algorithms, June 1981). J Algorithms 3(4):362–380CrossRef
17.
Zurück zum Zitat Gibbons PB, Matias Y (1998) New sampling-based summary statistics for improving approximate query answers. In: SIGMOD Gibbons PB, Matias Y (1998) New sampling-based summary statistics for improving approximate query answers. In: SIGMOD
18.
Zurück zum Zitat Golab L, DeHaan D, Demaine ED, López-Ortiz A, Munro JI (2003) Identifying frequent items in sliding windows over on-line packet streams. In: Internet measurement conference Golab L, DeHaan D, Demaine ED, López-Ortiz A, Munro JI (2003) Identifying frequent items in sliding windows over on-line packet streams. In: Internet measurement conference
19.
Zurück zum Zitat Golab L, DeHaan D, López-Ortiz A, Demaine ED (2004) Finding frequent items in sliding windows with multinomially-distributed item frequencies. In: SSDBM Golab L, DeHaan D, López-Ortiz A, Demaine ED (2004) Finding frequent items in sliding windows with multinomially-distributed item frequencies. In: SSDBM
20.
Zurück zum Zitat Jin C, Qian W, Sha C, Yu JX, Zhou A (2003) Dynamically maintaining frequent items over a data stream. In: CIKM Jin C, Qian W, Sha C, Yu JX, Zhou A (2003) Dynamically maintaining frequent items over a data stream. In: CIKM
21.
Zurück zum Zitat Jin R, Agrawal G (2005) An algorithm for in-core frequent itemset mining on streaming data. In: ICDM Jin R, Agrawal G (2005) An algorithm for in-core frequent itemset mining on streaming data. In: ICDM
22.
Zurück zum Zitat Karp RM, Shenker S, Papadimitriou CH (2003) A simple algorithm for finding frequent elements in streams and bags. ACM Trans Database Syst 28:51–55CrossRef Karp RM, Shenker S, Papadimitriou CH (2003) A simple algorithm for finding frequent elements in streams and bags. ACM Trans Database Syst 28:51–55CrossRef
23.
Zurück zum Zitat Lee D, Lee W (2005) Finding maximal frequent itemsets over online data streams adaptively. In: ICDM Lee D, Lee W (2005) Finding maximal frequent itemsets over online data streams adaptively. In: ICDM
24.
Zurück zum Zitat Leskovec J, Backstrom L, Kleinberg J (2009) Meme-tracking and the dynamics of the news cycle. In: KDD Leskovec J, Backstrom L, Kleinberg J (2009) Meme-tracking and the dynamics of the news cycle. In: KDD
25.
Zurück zum Zitat Lewis DD, Yang Y, Rose TG, Li F (2004) Rcv1: a new benchmark collection for text categorization research. J Mach Learn Res 5:361–397 Lewis DD, Yang Y, Rose TG, Li F (2004) Rcv1: a new benchmark collection for text categorization research. J Mach Learn Res 5:361–397
26.
Zurück zum Zitat Li H, Lee S, Shan M (2005) Online mining (recently) maximal frequent itemsets over data streams. In: 15th international workshop on research issues in data engineering Li H, Lee S, Shan M (2005) Online mining (recently) maximal frequent itemsets over data streams. In: 15th international workshop on research issues in data engineering
27.
Zurück zum Zitat Li H, Zhang N, Chen Z (2012) A simple but effective maximal frequent itemset mining algorithm over streams. JSW 7(1):25–32 Li H, Zhang N, Chen Z (2012) A simple but effective maximal frequent itemset mining algorithm over streams. JSW 7(1):25–32
28.
Zurück zum Zitat Lim Y, Choi J, Kang U (2014) Fast, accurate, and space-efficient tracking of time-weighted frequent items from data streams. In: CIKM Lim Y, Choi J, Kang U (2014) Fast, accurate, and space-efficient tracking of time-weighted frequent items from data streams. In: CIKM
29.
Zurück zum Zitat Lim Y, Jung M, Kang U (2017) Memory-efficient and accurate sampling for counting local triangles in graph streams: from simple to multigraphs. ACM Trans Knowl Discov Data 11(4) Lim Y, Jung M, Kang U (2017) Memory-efficient and accurate sampling for counting local triangles in graph streams: from simple to multigraphs. ACM Trans Knowl Discov Data 11(4)
30.
Zurück zum Zitat Lim Y, Kang U (2015) Mascot: memory-efficient and accurate sampling for counting local triangles in graph streams. In: KDD Lim Y, Kang U (2015) Mascot: memory-efficient and accurate sampling for counting local triangles in graph streams. In: KDD
31.
Zurück zum Zitat Liu H, Lu Y, Han J, He J (2006) Error-adaptive and time-aware maintenance of frequency counts over data streams. In: WAIM Liu H, Lu Y, Han J, He J (2006) Error-adaptive and time-aware maintenance of frequency counts over data streams. In: WAIM
32.
Zurück zum Zitat Manerikar N, Palpanas T (2009) Frequent items in streaming data: an experimental evaluation of the state-of-the-art. Data Knowl Eng 68(4):415–430CrossRef Manerikar N, Palpanas T (2009) Frequent items in streaming data: an experimental evaluation of the state-of-the-art. Data Knowl Eng 68(4):415–430CrossRef
33.
Zurück zum Zitat Manku GS, Motwani R (2002) Approximate frequency counts over data streams. In: VLDB Manku GS, Motwani R (2002) Approximate frequency counts over data streams. In: VLDB
34.
Zurück zum Zitat McAuley JJ, Leskovec J (2013) From amateurs to connoisseurs: modeling the evolution of user expertise through online reviews. In: WWW McAuley JJ, Leskovec J (2013) From amateurs to connoisseurs: modeling the evolution of user expertise through online reviews. In: WWW
35.
Zurück zum Zitat Metwally A, Agrawal D, Abbadi AE (2005) Efficient computation of frequent and top-k elements in data streams. In: ICDT Metwally A, Agrawal D, Abbadi AE (2005) Efficient computation of frequent and top-k elements in data streams. In: ICDT
38.
Zurück zum Zitat Yamamoto Y, Iwanuma K, Fukuda S (2014) Resource-oriented approximation for frequent itemset mining from bursty data streams. In: SIGMOD Yamamoto Y, Iwanuma K, Fukuda S (2014) Resource-oriented approximation for frequent itemset mining from bursty data streams. In: SIGMOD
39.
Zurück zum Zitat Zhang S, Chen L, Tu L (2009a) Frequent items mining on data stream based on time fading factor. In: AICI Zhang S, Chen L, Tu L (2009a) Frequent items mining on data stream based on time fading factor. In: AICI
40.
Zurück zum Zitat Zhang S, Chen L, Tu L (2009b) Frequent items mining on data stream using hash-table and heap. In: ICIS Zhang S, Chen L, Tu L (2009b) Frequent items mining on data stream using hash-table and heap. In: ICIS
Metadaten
Titel
Time-weighted counting for recently frequent pattern mining in data streams
verfasst von
Yongsub Lim
U. Kang
Publikationsdatum
22.03.2017
Verlag
Springer London
Erschienen in
Knowledge and Information Systems / Ausgabe 2/2017
Print ISSN: 0219-1377
Elektronische ISSN: 0219-3116
DOI
https://doi.org/10.1007/s10115-017-1045-1

Weitere Artikel der Ausgabe 2/2017

Knowledge and Information Systems 2/2017 Zur Ausgabe