Skip to main content
Erschienen in: World Wide Web 1/2019

10.05.2018

Scalable and parallel sequential pattern mining using spark

verfasst von: Xiao Yu, Qing Li, Jin Liu

Erschienen in: World Wide Web | Ausgabe 1/2019

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

The performance of the existing parallel sequential pattern mining algorithms is often unsatisfactory due to high IO overhead and imbalanced load among the computing nodes. To address such problems, this paper proposes two efficient parallel sequential pattern mining algorithms based on Spark, i.e., GSP-S (GSP algorithm based on Spark) and PrefixSpan-S (PrefixSpan algorithm based on Spark). For both algorithms, multiple MapReduce jobs are implemented to complete a mining task. To reduce IO overhead and take advantage of cluster memory, the first MapReduce job loads sequence database from the Hadoop Distributed File System (HDFS) into the Spark resilient distributed datasets (RDDs), and further MapReduce jobs read the database from the RDDs and store intermediate results back into the RDDs. Our findings suggest that a wise choice can be made between GSP-S and PrefixSpan-S, depending on the user-specified minimum support threshold. Moreover, theoretical analysis shows that GSP-S and PrefixSpan-S are sensitive to data distribution on the cluster. To further improve performance, we propose two database partition strategies to balance load among the computing nodes in a cluster. Experiment results demonstrate the high performance of GSP-S and PrefixSpan-S in terms of load-balancing, speedup and scalability.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Aggarwal, C.-C., Han, J.: Frequent pattern mining. Springer. Aggarwal, C.-C., Han, J.: Frequent pattern mining. Springer.
2.
Zurück zum Zitat Agrawal, R., Srikant, R.: Mining sequential pattern. In: 11th International Conference on Data Engineering, pp. 3–14. IEEE(1995) Agrawal, R., Srikant, R.: Mining sequential pattern. In: 11th International Conference on Data Engineering, pp. 3–14. IEEE(1995)
3.
Zurück zum Zitat Armbrust, M., Das, T., Davidson, A., Ghodsi, A., Or, A., Rosen, J., Zaharia, M.: Scaling spark in the real world: performance and usability. Proceedings of the VLDB Endowment. 8(12), 1840–1843 (2015)CrossRef Armbrust, M., Das, T., Davidson, A., Ghodsi, A., Or, A., Rosen, J., Zaharia, M.: Scaling spark in the real world: performance and usability. Proceedings of the VLDB Endowment. 8(12), 1840–1843 (2015)CrossRef
4.
Zurück zum Zitat Ayres, J., Gehrke, J., Yiu, T., et al: Sequential pattern mining using a bitmap representation. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 429–435(2002) Ayres, J., Gehrke, J., Yiu, T., et al: Sequential pattern mining using a bitmap representation. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 429–435(2002)
6.
Zurück zum Zitat Chen, C.-C., Tseng, C.-Y., Chen, M.-S.: Highly scalable sequential pattern mining based on mapreduce model on the cloud. In: 2013 I.E. International Congress on Big Data, pp. 310–317. IEEE (2013) Chen, C.-C., Tseng, C.-Y., Chen, M.-S.: Highly scalable sequential pattern mining based on mapreduce model on the cloud. In: 2013 I.E. International Congress on Big Data, pp. 310–317. IEEE (2013)
7.
Zurück zum Zitat Hu, Y., Cheng-Kui Huang, T.: Knowledge discovery of weighted RFM sequential patterns from customer sequence databases. J. Syst. Softw., vol. 86, no. 3, pp. 779–788(2013) Hu, Y., Cheng-Kui Huang, T.: Knowledge discovery of weighted RFM sequential patterns from customer sequence databases. J. Syst. Softw., vol. 86, no. 3, pp. 779–788(2013)
8.
Zurück zum Zitat Cong, S., Han, J., Padua, D.: Parallel mining of closed sequential patterns. In: KDD '05 Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, pp. 562–567(2005) Cong, S., Han, J., Padua, D.: Parallel mining of closed sequential patterns. In: KDD '05 Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, pp. 562–567(2005)
9.
Zurück zum Zitat Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM. 51(1), 107–113 (2008)CrossRef Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM. 51(1), 107–113 (2008)CrossRef
10.
Zurück zum Zitat Fournier-Viger, P., Wu, C.-W., Tseng, V.-S.: Mining maximal sequential patterns without candidate maintenance. In: International Conference on Advanced Data Mining and Applications, Springer, Berlin, Heidelberg, pp. 169–180(2013) Fournier-Viger, P., Wu, C.-W., Tseng, V.-S.: Mining maximal sequential patterns without candidate maintenance. In: International Conference on Advanced Data Mining and Applications, Springer, Berlin, Heidelberg, pp. 169–180(2013)
11.
Zurück zum Zitat Guan, E.-Z., Chang, X.-Y., Wang, Z., Zhou, C.-G.: Mining maximal sequential patterns.In: Proc of the Second Int’l Conf. Neural Networks and Brain, pp. 525–528(2005) Guan, E.-Z., Chang, X.-Y., Wang, Z., Zhou, C.-G.: Mining maximal sequential patterns.In: Proc of the Second Int’l Conf. Neural Networks and Brain, pp. 525–528(2005)
12.
Zurück zum Zitat Gurainik, V., Garg, N., Karypis, G.: Parallel tree projection algorithm for sequence mining. In: 7th International Euro-Par Conference on Parallel Processing, pp. 310–320(2001) Gurainik, V., Garg, N., Karypis, G.: Parallel tree projection algorithm for sequence mining. In: 7th International Euro-Par Conference on Parallel Processing, pp. 310–320(2001)
14.
Zurück zum Zitat Han, J., Pei, J., Mortazavi-Asl, B., et al.: FreeSpan: frequent pattern-projected sequential pattern mining. In: Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 355–359(2000) Han, J., Pei, J., Mortazavi-Asl, B., et al.: FreeSpan: frequent pattern-projected sequential pattern mining. In: Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 355–359(2000)
15.
Zurück zum Zitat Huang, J., Lin, S., Chen, M.: DPSP: distributed progressive sequential pattern mining on the cloud. Advances in Knowledge Discovery and Data Mining. 27–34 (2010) Huang, J., Lin, S., Chen, M.: DPSP: distributed progressive sequential pattern mining on the cloud. Advances in Knowledge Discovery and Data Mining. 27–34 (2010)
16.
Zurück zum Zitat Kessl, R.: Probabilistic static load-balancing of parallel mining of frequent sequences. IEEE Trans. Knowl. Data Eng. 28(5), 1299–1311 (2016)CrossRef Kessl, R.: Probabilistic static load-balancing of parallel mining of frequent sequences. IEEE Trans. Knowl. Data Eng. 28(5), 1299–1311 (2016)CrossRef
17.
Zurück zum Zitat Leung, C.-K.-S., MacKinnon, R.-K., Jiang, F.: Finding efficiencies in frequent pattern mining from big uncertain data. World Wide Web. 20(3), 571–594 (2017)CrossRef Leung, C.-K.-S., MacKinnon, R.-K., Jiang, F.: Finding efficiencies in frequent pattern mining from big uncertain data. World Wide Web. 20(3), 571–594 (2017)CrossRef
18.
Zurück zum Zitat Li, C., Yang, Q., Wang, J., Li, M.: Efficient mining of gap-constrained subsequences and its various applications. ACM Trans. Knowl. Discov. Data. 6(1), 2:1–2:39 (2012)CrossRef Li, C., Yang, Q., Wang, J., Li, M.: Efficient mining of gap-constrained subsequences and its various applications. ACM Trans. Knowl. Discov. Data. 6(1), 2:1–2:39 (2012)CrossRef
19.
Zurück zum Zitat Liao, V.-C.-C., Chen, M.-S.: DFSP: a depth-first SPelling algorithm for sequential pattern mining of biological sequences. Knowl. Inf. Syst. 38(3), 623–639 (2014)CrossRef Liao, V.-C.-C., Chen, M.-S.: DFSP: a depth-first SPelling algorithm for sequential pattern mining of biological sequences. Knowl. Inf. Syst. 38(3), 623–639 (2014)CrossRef
20.
Zurück zum Zitat Liu, C., Yao, L., Li, J., Zhou, R., He, Z.: Finding smallest k-compact tree set for keyword queries on graphs using mapreduce. World Wide Web. 19(3), 499–518 (2016)CrossRef Liu, C., Yao, L., Li, J., Zhou, R., He, Z.: Finding smallest k-compact tree set for keyword queries on graphs using mapreduce. World Wide Web. 19(3), 499–518 (2016)CrossRef
21.
Zurück zum Zitat Lu, S., Li, C.: AprioriAdjust: an efficient algorithm for discovering the maximum sequential patterns. In: Proc. 2nd Int’l Workshop Knowl. Grid and Grid Intell(2004) Lu, S., Li, C.: AprioriAdjust: an efficient algorithm for discovering the maximum sequential patterns. In: Proc. 2nd Int’l Workshop Knowl. Grid and Grid Intell(2004)
22.
Zurück zum Zitat Luo, C., Chung, S. M.: Efficient mining of maximal sequential patterns using multiple samples. In: Proceedings of the 2005 SIAM International Conference on Data Mining, Society for Industrial and Applied Mathematics, pp. 415–426(2005) Luo, C., Chung, S. M.: Efficient mining of maximal sequential patterns using multiple samples. In: Proceedings of the 2005 SIAM International Conference on Data Mining, Society for Industrial and Applied Mathematics, pp. 415–426(2005)
23.
Zurück zum Zitat Pei, J.: Mining sequential patterns by pattern-growth: the PrefixSpan approach. IEEE Computer Society. 16(11), 1424–1440 (2004) Pei, J.: Mining sequential patterns by pattern-growth: the PrefixSpan approach. IEEE Computer Society. 16(11), 1424–1440 (2004)
24.
Zurück zum Zitat Pei, J., Han, J., Pinto, H.: PrefixSpan: mining sequential pattern efficiently by prefix-projected pattern growth. In: 17th international conference on data. Engineering. 215–224 (2001) Pei, J., Han, J., Pinto, H.: PrefixSpan: mining sequential pattern efficiently by prefix-projected pattern growth. In: 17th international conference on data. Engineering. 215–224 (2001)
25.
Zurück zum Zitat Pinto, H., Han, J., Pei, J., Wang, K., Chen, Q., Dayal, U.: Multi-dimensional sequential pattern mining. In CIKM Conference, pp. 81–88(2001) Pinto, H., Han, J., Pei, J., Wang, K., Chen, Q., Dayal, U.: Multi-dimensional sequential pattern mining. In CIKM Conference, pp. 81–88(2001)
26.
Zurück zum Zitat Sabrina, P.-N.: Miltiple MapReduce and derivative projected database: new approach for supporting prefixspan scalability. In: 2015 I.E. International Conference on Data and Software Engineering, pp. 148–153. IEEE (2015) Sabrina, P.-N.: Miltiple MapReduce and derivative projected database: new approach for supporting prefixspan scalability. In: 2015 I.E. International Conference on Data and Software Engineering, pp. 148–153. IEEE (2015)
27.
Zurück zum Zitat Shintani, T., Kitsuregawa, M.: Mining algorithms for sequential patterns in parallel: hash based approach. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer, Berlin, Heidelberg, pp. 283–294(1998) Shintani, T., Kitsuregawa, M.: Mining algorithms for sequential patterns in parallel: hash based approach. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer, Berlin, Heidelberg, pp. 283–294(1998)
29.
Zurück zum Zitat Srikant, R., Agrawal, R.: Mining sequential patterns: generalizations and performance improvements. Advances in Database Technology — EDBT '96. 1057, 1–17 (1996)CrossRef Srikant, R., Agrawal, R.: Mining sequential patterns: generalizations and performance improvements. Advances in Database Technology — EDBT '96. 1057, 1–17 (1996)CrossRef
30.
Zurück zum Zitat Wang, X.: Parallel sequential pattern mining by transcation decompostion. The International Conference on Fuzzy Systems and Knowledge Discovery. 4, 1746–1750 (2010) Wang, X.: Parallel sequential pattern mining by transcation decompostion. The International Conference on Fuzzy Systems and Knowledge Discovery. 4, 1746–1750 (2010)
31.
Zurück zum Zitat Wang, J., Han, J.: Bide:Efficientminingoffrequentclosedsequences. In: 20th International Conference on Data Engineering, pp. 79–90. IEEE (2004) Wang, J., Han, J.: Bide:Efficientminingoffrequentclosedsequences. In: 20th International Conference on Data Engineering, pp. 79–90. IEEE (2004)
32.
Zurück zum Zitat Wang, J., Han, J., Li, C.: Frequent closed sequence mining without candidate maintenance. TKDE. 19(8), 1042–1056 (2007) Wang, J., Han, J., Li, C.: Frequent closed sequence mining without candidate maintenance. TKDE. 19(8), 1042–1056 (2007)
33.
Zurück zum Zitat Wang, T., Zhang, D., Zhou, X., et al.: Mining personal frequent routes via road corner detection. IEEE Trans. Syst. 46(4), 445–458 (2016) Wang, T., Zhang, D., Zhou, X., et al.: Mining personal frequent routes via road corner detection. IEEE Trans. Syst. 46(4), 445–458 (2016)
34.
Zurück zum Zitat Wei, Q.-Y., Liu, D., Duan, S.-L.: Distributed PrefixSpan algorithm based on MapReduce. In: 2012 International Symposium on Information Technology in Medicine and Education, pp. 901–904(2012) Wei, Q.-Y., Liu, D., Duan, S.-L.: Distributed PrefixSpan algorithm based on MapReduce. In: 2012 International Symposium on Information Technology in Medicine and Education, pp. 901–904(2012)
35.
Zurück zum Zitat Wu, C., Lai, C., Lo, Y.: An empirical study on mining sequential patterns in a grid computing environment. Expert Syst. Appl. 39(5), 5748–5757 (2012)CrossRef Wu, C., Lai, C., Lo, Y.: An empirical study on mining sequential patterns in a grid computing environment. Expert Syst. Appl. 39(5), 5748–5757 (2012)CrossRef
36.
Zurück zum Zitat Xin, J., Wang, Z., Chen, C., Ding, L., Wang, G., Zhao, Y.: ELM∗: distributed extreme learning machine with MapReduce. World Wide Web. 17(5), 1189–1204 (2014)CrossRef Xin, J., Wang, Z., Chen, C., Ding, L., Wang, G., Zhao, Y.: ELM∗: distributed extreme learning machine with MapReduce. World Wide Web. 17(5), 1189–1204 (2014)CrossRef
37.
Zurück zum Zitat Xun, Y., Zhang, J., Qin, X.: FiDoop: parallel Mining of Frequent Itemsets Using MapReduce. IEEE Transactions on Systems, Man, and Cybernetics: Systems. 46(3), 313–325 (2016)CrossRef Xun, Y., Zhang, J., Qin, X.: FiDoop: parallel Mining of Frequent Itemsets Using MapReduce. IEEE Transactions on Systems, Man, and Cybernetics: Systems. 46(3), 313–325 (2016)CrossRef
38.
Zurück zum Zitat Yan, X., Han, J., Afshar, R.: Clospan:Mining closed sequential patterns in large datasets. In: SDM Conference, pp. 166–177(2003) Yan, X., Han, J., Afshar, R.: Clospan:Mining closed sequential patterns in large datasets. In: SDM Conference, pp. 166–177(2003)
39.
Zurück zum Zitat Yu, C.-C., Chen, Y.-L.: Mining sequential patterns from multidimensional sequence data. IEEE Trans. Knowl. Data Eng. 17(1), 136–140 (2005)MathSciNetCrossRef Yu, C.-C., Chen, Y.-L.: Mining sequential patterns from multidimensional sequence data. IEEE Trans. Knowl. Data Eng. 17(1), 136–140 (2005)MathSciNetCrossRef
40.
Zurück zum Zitat Yu, D., Wu, W., Zheng, S., Zhu, Z.: BIDE-based ParallelMining of frequent closed sequences with MapReduce. In: Proceedings of the 12th International Conference on Algorithms and Architecturesfor Parallel Processing, pp.177–186(2012) Yu, D., Wu, W., Zheng, S., Zhu, Z.: BIDE-based ParallelMining of frequent closed sequences with MapReduce. In: Proceedings of the 12th International Conference on Algorithms and Architecturesfor Parallel Processing, pp.177–186(2012)
41.
Zurück zum Zitat Yu, X., Liu, J., Ma, C., Li, B.: A MapReduc reinforeced distirbuted sequential pattern mining algorithm. Algorithms and Architectures for Parallel Processing. 9529, 183–197 (2015)CrossRef Yu, X., Liu, J., Ma, C., Li, B.: A MapReduc reinforeced distirbuted sequential pattern mining algorithm. Algorithms and Architectures for Parallel Processing. 9529, 183–197 (2015)CrossRef
42.
Zurück zum Zitat Zaharia, M., et al.: Spark: cluster computing with working sets. HotCloud, pp. 10–10(2010) Zaharia, M., et al.: Spark: cluster computing with working sets. HotCloud, pp. 10–10(2010)
43.
Zurück zum Zitat Zaharia, M., et al: Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation, USENIX Association(2012) Zaharia, M., et al: Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation, USENIX Association(2012)
44.
Zurück zum Zitat Zaki, M.: SPADE: an efficient algorithm for mining frequent sequences. Mach. Learn. 41(2), 31–60 (2001)CrossRefMATH Zaki, M.: SPADE: an efficient algorithm for mining frequent sequences. Mach. Learn. 41(2), 31–60 (2001)CrossRefMATH
45.
Zurück zum Zitat Zaki, M.J.: Parallel sequence mining on shared-memory machines. J. Parallel Distrib. Comput. 61(3), 401–426 (2001)CrossRefMATH Zaki, M.J.: Parallel sequence mining on shared-memory machines. J. Parallel Distrib. Comput. 61(3), 401–426 (2001)CrossRefMATH
46.
Zurück zum Zitat Zhang, C., Hu, K., Liu, H.: FMGSP: an efficient method of mining global sequential pattern. In: 4th International Conference on Fuzzy Systems and Knowledge Discovery, pp. 761–765(2007) Zhang, C., Hu, K., Liu, H.: FMGSP: an efficient method of mining global sequential pattern. In: 4th International Conference on Fuzzy Systems and Knowledge Discovery, pp. 761–765(2007)
47.
Zurück zum Zitat Zheng, Z., Wei, W., Liu, C., et al.: An effective contrast sequential pattern mining approach to taxpayer behavior analysis. World Wide Web-internet & Web Information Systems. 19(4), 633–651 (2016) Zheng, Z., Wei, W., Liu, C., et al.: An effective contrast sequential pattern mining approach to taxpayer behavior analysis. World Wide Web-internet & Web Information Systems. 19(4), 633–651 (2016)
Metadaten
Titel
Scalable and parallel sequential pattern mining using spark
verfasst von
Xiao Yu
Qing Li
Jin Liu
Publikationsdatum
10.05.2018
Verlag
Springer US
Erschienen in
World Wide Web / Ausgabe 1/2019
Print ISSN: 1386-145X
Elektronische ISSN: 1573-1413
DOI
https://doi.org/10.1007/s11280-018-0566-1

Weitere Artikel der Ausgabe 1/2019

World Wide Web 1/2019 Zur Ausgabe