Skip to main content
Erschienen in: Knowledge and Information Systems 2/2017

20.03.2017 | Regular Paper

Distributed and scalable sequential pattern mining through stream processing

verfasst von: Chun-Chieh Chen, Hong-Han Shuai, Ming-Syan Chen

Erschienen in: Knowledge and Information Systems | Ausgabe 2/2017

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Scalability is a primary issue in existing sequential pattern mining algorithms for dealing with a large amount of data. Previous work, namely sequential pattern mining on the cloud (SPAMC), has already addressed the scalability problem. It supports the MapReduce cloud computing architecture for mining frequent sequential patterns on large datasets. However, this existing algorithm does not address the iterative mining problem, which is the problem that reloading data incur additional costs. Furthermore, it did not study the load balancing problem. To remedy these problems, we devised a powerful sequential pattern mining algorithm, the sequential pattern mining in the cloud-uniform distributed lexical sequence tree algorithm (SPAMC-UDLT), exploiting MapReduce and streaming processes. SPAMC-UDLT dramatically improves overall performance without launching multiple MapReduce rounds and provides perfect load balancing across machines in the cloud. The results show that SPAMC-UDLT can significantly reduce execution time, achieves extremely high scalability, and provides much better load balancing than existing algorithms in the cloud.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Fußnoten
3
If the bitmap vector is extremely sparse, the word-aligned hybrid code (WAH) [44] can serve for our goal. Specifically, WAH is a run-length encoding for compressing input data to words, where ANDs can be efficiently performed on any two words, and thus the bitmap representations can still work in this situation.
 
Literatur
3.
Zurück zum Zitat Agrawal R, Srikant R (1995) Mining sequential patterns. In: Proceedings of the 11th international conference on data engineering (ICDE’95), pp 3–14 Agrawal R, Srikant R (1995) Mining sequential patterns. In: Proceedings of the 11th international conference on data engineering (ICDE’95), pp 3–14
4.
Zurück zum Zitat Ayres J, Flannick J, Gehrke J et al (2002) Sequential pattern mining using a bitmap representation. In: Proceedings of the 8th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’02), pp 429–435 Ayres J, Flannick J, Gehrke J et al (2002) Sequential pattern mining using a bitmap representation. In: Proceedings of the 8th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’02), pp 429–435
5.
Zurück zum Zitat Batal I, Valizadegan H, Cooper GF et al (2013) A temporal pattern mining approach for classifying electronic health record data. Trans Intell Syst Technol (TIST’13) 63:1–22 Batal I, Valizadegan H, Cooper GF et al (2013) A temporal pattern mining approach for classifying electronic health record data. Trans Intell Syst Technol (TIST’13) 63:1–22
6.
Zurück zum Zitat Bu Y, Howe B, Balazinska M et al (2010) Haloop: efficient iterative data processing on large clusters. In: Proceedings of the VLDB endowment (PVLDB’10), pp 285–296 Bu Y, Howe B, Balazinska M et al (2010) Haloop: efficient iterative data processing on large clusters. In: Proceedings of the VLDB endowment (PVLDB’10), pp 285–296
7.
Zurück zum Zitat Chen CC, Tseng CY, Chen MS (2013) Highly scalable sequential pattern mining based on MapReduce model on the cloud. IEEE international congress on big data (BigData Congress’13), pp 310–317 Chen CC, Tseng CY, Chen MS (2013) Highly scalable sequential pattern mining based on MapReduce model on the cloud. IEEE international congress on big data (BigData Congress’13), pp 310–317
9.
Zurück zum Zitat Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM (CACM’08) 51:107–113 Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM (CACM’08) 51:107–113
10.
Zurück zum Zitat Ekanayake J, Li H, Zhang B et al (2010) Twister: a runtime for iterative MapReduce. In: Proceeding of the 19th ACM international symposium on high performance distributed computing (HPDC’10), pp 810–818 Ekanayake J, Li H, Zhang B et al (2010) Twister: a runtime for iterative MapReduce. In: Proceeding of the 19th ACM international symposium on high performance distributed computing (HPDC’10), pp 810–818
11.
Zurück zum Zitat Fang W, Lu M, Xiao X et al (2009) Frequent itemset mining on graphics processors. In: Proceedings of the 5th international workshop on data management on new hardware (DaMoN’09), pp 34–42 Fang W, Lu M, Xiao X et al (2009) Frequent itemset mining on graphics processors. In: Proceedings of the 5th international workshop on data management on new hardware (DaMoN’09), pp 34–42
12.
Zurück zum Zitat Gomariz A, Campos M, Marin R et al (2013) ClaSP: an efficient algorithm for mining frequent closed sequences. In: Proceedings of the 17th Pacific-Asia conference on knowledge discovery and data mining (PAKDD’13), pp 50–61 Gomariz A, Campos M, Marin R et al (2013) ClaSP: an efficient algorithm for mining frequent closed sequences. In: Proceedings of the 17th Pacific-Asia conference on knowledge discovery and data mining (PAKDD’13), pp 50–61
13.
Zurück zum Zitat Goodhope K, Koshy J, Kreps J et al (2012) Building LinkedIn’s real-time activity data pipeline. IEEE Data Eng Bull (Data Eng Bull’12) 35:33–45 Goodhope K, Koshy J, Kreps J et al (2012) Building LinkedIn’s real-time activity data pipeline. IEEE Data Eng Bull (Data Eng Bull’12) 35:33–45
14.
Zurück zum Zitat Guralnik V, Karypis G (2004) Parallel tree-projection-based sequence mining algorithms. Parallel Comput (PARALLEL COMPUT’04) 30:443–472CrossRef Guralnik V, Karypis G (2004) Parallel tree-projection-based sequence mining algorithms. Parallel Comput (PARALLEL COMPUT’04) 30:443–472CrossRef
15.
Zurück zum Zitat Han J, Pei J, Mortazavi-Asl B et al (2000) FreeSpan: frequent pattern-projected sequential pattern mining. In: Proceedings of the 6th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’00), pp 355–359 Han J, Pei J, Mortazavi-Asl B et al (2000) FreeSpan: frequent pattern-projected sequential pattern mining. In: Proceedings of the 6th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’00), pp 355–359
16.
Zurück zum Zitat Han J, Pei J, Yan X (2005) Sequential pattern mining by pattern-growth: principles and extension. Foundations and advances in data mining. Springer, BerlinMATH Han J, Pei J, Yan X (2005) Sequential pattern mining by pattern-growth: principles and extension. Foundations and advances in data mining. Springer, BerlinMATH
17.
Zurück zum Zitat Ho J, Lukov L, Chawla S (2005) Sequential pattern mining with constraints on large protein databases. In: Proceedings of the 12th international conference on management of data (COMAD’05), pp 89–100 Ho J, Lukov L, Chawla S (2005) Sequential pattern mining with constraints on large protein databases. In: Proceedings of the 12th international conference on management of data (COMAD’05), pp 89–100
18.
Zurück zum Zitat Huang JW, Tseng CY, Ou JC et al (2008) A general model for sequential pattern mining with a progressive database. IEEE Trans Knowl Data Eng (TKDE’08) 20:1153–1167CrossRef Huang JW, Tseng CY, Ou JC et al (2008) A general model for sequential pattern mining with a progressive database. IEEE Trans Knowl Data Eng (TKDE’08) 20:1153–1167CrossRef
19.
Zurück zum Zitat Huang JW, Lin SC, Chen MS (2010) DPSP: distributed progressive sequential pattern mining on the cloud. 14th Pacific–Asia conference on knowledge discovery and data mining (PAKDD’10), pp 27–34 Huang JW, Lin SC, Chen MS (2010) DPSP: distributed progressive sequential pattern mining on the cloud. 14th Pacific–Asia conference on knowledge discovery and data mining (PAKDD’10), pp 27–34
20.
Zurück zum Zitat Isard M, Budiu M, Yu Y et al (2007) Dryad: distributed data-parallel programs from sequential building blocks. ACM SIGOPS Oper Syst Rev (SIGOPS’07) 41:59–72CrossRef Isard M, Budiu M, Yu Y et al (2007) Dryad: distributed data-parallel programs from sequential building blocks. ACM SIGOPS Oper Syst Rev (SIGOPS’07) 41:59–72CrossRef
21.
Zurück zum Zitat Ji X, Bailey J, Dong G (2007) Mining minimal distinguishing subsequence patterns with gap constraints. Knowl Inf Syst (KAIS’07) 11:259–286CrossRef Ji X, Bailey J, Dong G (2007) Mining minimal distinguishing subsequence patterns with gap constraints. Knowl Inf Syst (KAIS’07) 11:259–286CrossRef
22.
Zurück zum Zitat Kreps J, Narkhede N, Rao J (2011) Kafka: a distributed messaging system for log processing. NetDB workshop Kreps J, Narkhede N, Rao J (2011) Kafka: a distributed messaging system for log processing. NetDB workshop
23.
Zurück zum Zitat Liao CC, Chen MS (2014) DFSP: a Depth-First SPelling algorithm for sequential pattern mining of biological sequences. Knowl Inf Syst (KAIS’14) 38:623–639CrossRef Liao CC, Chen MS (2014) DFSP: a Depth-First SPelling algorithm for sequential pattern mining of biological sequences. Knowl Inf Syst (KAIS’14) 38:623–639CrossRef
24.
Zurück zum Zitat Luo C, Chung S (2008) A scalable algorithm for mining maximal frequent sequences using a sample. Knowl Inf Syst (KAIS’08) 15:149–179CrossRef Luo C, Chung S (2008) A scalable algorithm for mining maximal frequent sequences using a sample. Knowl Inf Syst (KAIS’08) 15:149–179CrossRef
25.
Zurück zum Zitat Mabroukeh NR, Ezeife CI (2010) A taxonomy of sequential pattern mining algorithms. ACM Comput Surv (CSUR’10) 43:1–41CrossRef Mabroukeh NR, Ezeife CI (2010) A taxonomy of sequential pattern mining algorithms. ACM Comput Surv (CSUR’10) 43:1–41CrossRef
26.
Zurück zum Zitat Mane RV (2013) A comparative study of Spam and PrefixSpan sequential pattern mining algorithm for protein sequences. In: Proceedings of the 3rd international conference on advances in computing, communication, and control (ICAC3’13), pp 147–155 Mane RV (2013) A comparative study of Spam and PrefixSpan sequential pattern mining algorithm for protein sequences. In: Proceedings of the 3rd international conference on advances in computing, communication, and control (ICAC3’13), pp 147–155
27.
Zurück zum Zitat Miliaraki I, Berberich K, Gemulla R et al (2013) Mind the gap: large-scale frequent sequence mining. In: Proceedings of the 2013 ACM SIGMOD international conference on management of data (SIGMOD’13), pp 797–808 Miliaraki I, Berberich K, Gemulla R et al (2013) Mind the gap: large-scale frequent sequence mining. In: Proceedings of the 2013 ACM SIGMOD international conference on management of data (SIGMOD’13), pp 797–808
28.
Zurück zum Zitat Papapetrou P, Kollios G, Sclaroff S et al (2009) Mining frequent arrangements of temporal intervals. Knowl Inf Syst (KAIS’09) 21:133–171CrossRef Papapetrou P, Kollios G, Sclaroff S et al (2009) Mining frequent arrangements of temporal intervals. Knowl Inf Syst (KAIS’09) 21:133–171CrossRef
29.
Zurück zum Zitat Parimala M, Sathiyabama S (2012) SPMLS: an efficient sequential pattern mining algorithm with candidate generation and frequency testing. Int J Comput Sci Eng (IJCSE’12) 4:601–607 Parimala M, Sathiyabama S (2012) SPMLS: an efficient sequential pattern mining algorithm with candidate generation and frequency testing. Int J Comput Sci Eng (IJCSE’12) 4:601–607
30.
Zurück zum Zitat Pei J, Han J, Mortazavi-asl B et al (2001) PrefixSpan: mining sequential patterns efficiently by prefix-projected pattern growth. In: Proceedings of the 7th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’01), pp 215–224 Pei J, Han J, Mortazavi-asl B et al (2001) PrefixSpan: mining sequential patterns efficiently by prefix-projected pattern growth. In: Proceedings of the 7th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’01), pp 215–224
31.
Zurück zum Zitat Perer A, Wang F (2014) Frequence: interactive mining and visualization of temporal frequent event sequences. In: Proceedings of the 19th ACM international conference on intelligent user interfaces (IUI’14), pp 153–162 Perer A, Wang F (2014) Frequence: interactive mining and visualization of temporal frequent event sequences. In: Proceedings of the 19th ACM international conference on intelligent user interfaces (IUI’14), pp 153–162
32.
Zurück zum Zitat Sahli M, Mansour E, Kalnis P (2014) ACME: a scalable parallel system for extracting frequent patterns from a very long sequence. VLDB J (VLDBJ’14) 23:871–893CrossRef Sahli M, Mansour E, Kalnis P (2014) ACME: a scalable parallel system for extracting frequent patterns from a very long sequence. VLDB J (VLDBJ’14) 23:871–893CrossRef
33.
Zurück zum Zitat Shie BE, Hsiao HF, Tseng V (2013) Efficient algorithms for discovering high utility user behavior patterns in mobile commerce environments. Knowl Inf Syst (KAIS’13) 37:363–387CrossRef Shie BE, Hsiao HF, Tseng V (2013) Efficient algorithms for discovering high utility user behavior patterns in mobile commerce environments. Knowl Inf Syst (KAIS’13) 37:363–387CrossRef
34.
Zurück zum Zitat Srikant R, Agrawal R (1996) Mining sequential patterns: generalizations and performance improvements. In: Proceedings of the 5th international conference on extending database technology (EDBT’96), pp 3–17 Srikant R, Agrawal R (1996) Mining sequential patterns: generalizations and performance improvements. In: Proceedings of the 5th international conference on extending database technology (EDBT’96), pp 3–17
40.
Zurück zum Zitat White Tom (2009) Hadoop: the definitive guide. O’Reilly Media, Newton White Tom (2009) Hadoop: the definitive guide. O’Reilly Media, Newton
41.
Zurück zum Zitat Wang K, Xu Y, Yu JX (2004) Scalable sequential pattern mining for biological sequences. In: Proceedings of the 13th ACM international conference on information and knowledge management (CIKM’04), pp 178–187 Wang K, Xu Y, Yu JX (2004) Scalable sequential pattern mining for biological sequences. In: Proceedings of the 13th ACM international conference on information and knowledge management (CIKM’04), pp 178–187
42.
Zurück zum Zitat Wang X, Wang J, Wang T et al (2010) Parallel sequential pattern mining by transaction decomposition. International conference on fuzzy systems and knowledge discovery (FSKD’10), pp 1746–1750 Wang X, Wang J, Wang T et al (2010) Parallel sequential pattern mining by transaction decomposition. International conference on fuzzy systems and knowledge discovery (FSKD’10), pp 1746–1750
43.
44.
Zurück zum Zitat Wu K, Otoo EJ, Shoshani A (2002) Compressing bitmap indexes for faster search operations. In: Proceedings of 14th international conference on scientific and statistical database management (SSDBM’02), pp 99–108 Wu K, Otoo EJ, Shoshani A (2002) Compressing bitmap indexes for faster search operations. In: Proceedings of 14th international conference on scientific and statistical database management (SSDBM’02), pp 99–108
45.
Zurück zum Zitat Yu D, Wu W, Zheng S et al (2012) BIDE-based parallel mining of frequent closed sequences with MapReduce. In: Proceedings of the 12th international conference on algorithms and architectures for parallel processing (ICA3PP’12), pp 177–186 Yu D, Wu W, Zheng S et al (2012) BIDE-based parallel mining of frequent closed sequences with MapReduce. In: Proceedings of the 12th international conference on algorithms and architectures for parallel processing (ICA3PP’12), pp 177–186
46.
Zurück zum Zitat Yu D, Zhu Q, Shao J et al (2014) Parallel execution of data-intensive web services based on data-flow constructs and I/O operation ratio. Int J Database Theory Appl (IJDTA’14) 7:129–138CrossRef Yu D, Zhu Q, Shao J et al (2014) Parallel execution of data-intensive web services based on data-flow constructs and I/O operation ratio. Int J Database Theory Appl (IJDTA’14) 7:129–138CrossRef
47.
Zurück zum Zitat Zaharia M, Chowdhury M, Das T et al (2012) Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX conference on networked systems design and implementation (NSDI’12), p 2 Zaharia M, Chowdhury M, Das T et al (2012) Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX conference on networked systems design and implementation (NSDI’12), p 2
48.
Zurück zum Zitat Zaharia M, Chowdhury M, Das T et al (2012) Discretized streams: an efficient and fault-tolerant model for stream processing on large clusters. In: Proceedings of the 4th USENIX conference on hot topics in cloud computing (HotCloud’12), pp 215–224 Zaharia M, Chowdhury M, Das T et al (2012) Discretized streams: an efficient and fault-tolerant model for stream processing on large clusters. In: Proceedings of the 4th USENIX conference on hot topics in cloud computing (HotCloud’12), pp 215–224
49.
Zurück zum Zitat Zaki MJ (1998) Efficient enumeration of frequent sequences. In: Proceedings of the 7th ACM international conference on information and knowledge management (CIKM’98), pp 68–75 Zaki MJ (1998) Efficient enumeration of frequent sequences. In: Proceedings of the 7th ACM international conference on information and knowledge management (CIKM’98), pp 68–75
50.
Zurück zum Zitat Zaki MJ (2001) Parallel sequence mining on shared-memory machines. J Parallel Distrib Comput (JPDC’01) 61:401–426CrossRefMATH Zaki MJ (2001) Parallel sequence mining on shared-memory machines. J Parallel Distrib Comput (JPDC’01) 61:401–426CrossRefMATH
51.
Zurück zum Zitat Zhao Q, Bhowmick SS (2003) Sequential pattern matching: a survey. ITechnical report CAIS Nayang Technological University Singapore, pp 1–26 Zhao Q, Bhowmick SS (2003) Sequential pattern matching: a survey. ITechnical report CAIS Nayang Technological University Singapore, pp 1–26
Metadaten
Titel
Distributed and scalable sequential pattern mining through stream processing
verfasst von
Chun-Chieh Chen
Hong-Han Shuai
Ming-Syan Chen
Publikationsdatum
20.03.2017
Verlag
Springer London
Erschienen in
Knowledge and Information Systems / Ausgabe 2/2017
Print ISSN: 0219-1377
Elektronische ISSN: 0219-3116
DOI
https://doi.org/10.1007/s10115-017-1037-1

Weitere Artikel der Ausgabe 2/2017

Knowledge and Information Systems 2/2017 Zur Ausgabe