Skip to main content

2017 | OriginalPaper | Buchkapitel

MiSeRe-Hadoop: A Large-Scale Robust Sequential Classification Rules Mining Framework

verfasst von : Elias Egho, Dominique Gay, Romain Trinquart, Marc Boullé, Nicolas Voisine, Fabrice Clérot

Erschienen in: Big Data Analytics and Knowledge Discovery

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Sequence classification has become a fundamental problem in data mining and machine learning. Feature based classification is one of the techniques that has been used widely for sequence classification. Mining sequential classification rules plays an important role in feature based classification. Despite the abundant literature in this area, mining sequential classification rules is still a challenge; few of the available methods are sufficiently scalable to handle large-scale datasets. MapReduce is an ideal framework to support distributed computing on large data sets on clusters of computers. In this paper, we propose a distributed version of MiSeRe algorithm on MapReduce, called MiSeRe-Hadoop. MiSeRe-Hadoop holds the same valuable properties as MiSeRe, i.e., it is: (i) robust and user parameter-free anytime algorithm and (ii) it employs an instance-based randomized strategy to promote diversity mining. We have applied our method on two real-world large datasets: a marketing dataset and a text dataset. Our results confirm that our method is scalable for large scale sequential data analysis.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
This file keeps a copy of all the candidate sequences generated from the job \({{\varvec{Generating\ Candidates}}}\) in each iteration.
 
2
Orange Livebox is an ADSL wireless router available to customers of Orange’s Broadband services in several countries.
 
Literatur
1.
Zurück zum Zitat Anastasiu, D.C., Iverson, J., Smith, S., Karypis, G.: Big data frequent pattern mining. In: Aggarwal, C.C., Han, J. (eds.) Frequent Pattern Mining, pp. 225–259. Springer, Cham (2014). doi:10.1007/978-3-319-07821-2_10 Anastasiu, D.C., Iverson, J., Smith, S., Karypis, G.: Big data frequent pattern mining. In: Aggarwal, C.C., Han, J. (eds.) Frequent Pattern Mining, pp. 225–259. Springer, Cham (2014). doi:10.​1007/​978-3-319-07821-2_​10
2.
Zurück zum Zitat Andrews, G.R.: Foundations of Multithreaded, Parallel, and Distributed Programming. University of Arizona, Wesley (2000) Andrews, G.R.: Foundations of Multithreaded, Parallel, and Distributed Programming. University of Arizona, Wesley (2000)
3.
Zurück zum Zitat Beedkar, K., Berberich, K., Gemulla, R., Miliaraki, I.: Closing the gap: sequence mining at scale. ACM Trans. Database Syst. 40(2), 8:1–8:44 (2015)MathSciNetCrossRef Beedkar, K., Berberich, K., Gemulla, R., Miliaraki, I.: Closing the gap: sequence mining at scale. ACM Trans. Database Syst. 40(2), 8:1–8:44 (2015)MathSciNetCrossRef
4.
Zurück zum Zitat Chen, C.C., Tseng, C.Y., Chen, M.S.: Highly scalable sequential pattern mining based on mapreduce model on the cloud. In: 2013 IEEE International Congress on Big Data, pp. 310–317 (2013) Chen, C.C., Tseng, C.Y., Chen, M.S.: Highly scalable sequential pattern mining based on mapreduce model on the cloud. In: 2013 IEEE International Congress on Big Data, pp. 310–317 (2013)
5.
Zurück zum Zitat Cong, S., Han, J., Padua, D.: Parallel mining of closed sequential patterns. In: Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, pp. 562–567. ACM (2005) Cong, S., Han, J., Padua, D.: Parallel mining of closed sequential patterns. In: Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, pp. 562–567. ACM (2005)
6.
Zurück zum Zitat Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRef Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRef
7.
Zurück zum Zitat Deshpande, M., Karypis, G.: Evaluation of techniques for classifying biological sequences. In: Chen, M.-S., Yu, P.S., Liu, B. (eds.) PAKDD 2002. LNCS, vol. 2336, pp. 417–431. Springer, Heidelberg (2002). doi:10.1007/3-540-47887-6_41 CrossRef Deshpande, M., Karypis, G.: Evaluation of techniques for classifying biological sequences. In: Chen, M.-S., Yu, P.S., Liu, B. (eds.) PAKDD 2002. LNCS, vol. 2336, pp. 417–431. Springer, Heidelberg (2002). doi:10.​1007/​3-540-47887-6_​41 CrossRef
8.
Zurück zum Zitat Egho, E., Gay, D., Boullé, M., Voisine, N., Clérot, F.: A parameter-free approach for mining robust sequential classification rules. In: 2015 IEEE International Conference on Data Mining, ICDM 2015, Atlantic City, NJ, USA, November 14–17, 2015, pp. 745–750 (2015) Egho, E., Gay, D., Boullé, M., Voisine, N., Clérot, F.: A parameter-free approach for mining robust sequential classification rules. In: 2015 IEEE International Conference on Data Mining, ICDM 2015, Atlantic City, NJ, USA, November 14–17, 2015, pp. 745–750 (2015)
9.
Zurück zum Zitat Egho, E., Gay, D., Boullé, M., Voisine, N., Clérot, F.: A user parameter-free approach for mining robust sequential classification rules. Knowl. Inform. Syst. 52, 1–29 (2016) Egho, E., Gay, D., Boullé, M., Voisine, N., Clérot, F.: A user parameter-free approach for mining robust sequential classification rules. Knowl. Inform. Syst. 52, 1–29 (2016)
10.
Zurück zum Zitat Egho, E., Jay, N., Raïssi, C., Nuemi, G., Quantin, C., Napoli, A.: An approach for mining care trajectories for chronic diseases. In: Peek, N., Marín Morales, R., Peleg, M. (eds.) AIME 2013. LNCS, vol. 7885, pp. 258–267. Springer, Heidelberg (2013). doi:10.1007/978-3-642-38326-7_37 CrossRef Egho, E., Jay, N., Raïssi, C., Nuemi, G., Quantin, C., Napoli, A.: An approach for mining care trajectories for chronic diseases. In: Peek, N., Marín Morales, R., Peleg, M. (eds.) AIME 2013. LNCS, vol. 7885, pp. 258–267. Springer, Heidelberg (2013). doi:10.​1007/​978-3-642-38326-7_​37 CrossRef
11.
Zurück zum Zitat Gropp, W., Lusk, E., Doss, N., Skjellum, A.: A high-performance, portable implementation of the MPI message passing interface standard. Parallel Comput. 22(6), 789–828 (1996)CrossRefMATH Gropp, W., Lusk, E., Doss, N., Skjellum, A.: A high-performance, portable implementation of the MPI message passing interface standard. Parallel Comput. 22(6), 789–828 (1996)CrossRefMATH
12.
Zurück zum Zitat Guralnik, V., Karypis, G.: Parallel tree-projection-based sequence mining algorithms. Parallel Comput. 30(4), 443–472 (2004)CrossRef Guralnik, V., Karypis, G.: Parallel tree-projection-based sequence mining algorithms. Parallel Comput. 30(4), 443–472 (2004)CrossRef
13.
Zurück zum Zitat Holat, P., Plantevit, M., Raïssi, C., Tomeh, N., Charnois, T., Crémilleux, B.: Sequence classification based on delta-free sequential patterns. In: ICDM 2014, pp. 170–179 (2014) Holat, P., Plantevit, M., Raïssi, C., Tomeh, N., Charnois, T., Crémilleux, B.: Sequence classification based on delta-free sequential patterns. In: ICDM 2014, pp. 170–179 (2014)
14.
Zurück zum Zitat Itkar, S., Kulkarni, U.: Distributed sequential pattern mining: a survey and future scope. Int. J. Comput. Appl. 94(18), 28–35 (2014) Itkar, S., Kulkarni, U.: Distributed sequential pattern mining: a survey and future scope. Int. J. Comput. Appl. 94(18), 28–35 (2014)
15.
Zurück zum Zitat Jorge, A.M., Azevedo, P.J., Pereira, F.: Distribution rules with numeric attributes of interest. In: Fürnkranz, J., Scheffer, T., Spiliopoulou, M. (eds.) PKDD 2006. LNCS, vol. 4213, pp. 247–258. Springer, Heidelberg (2006). doi:10.1007/11871637_26 CrossRef Jorge, A.M., Azevedo, P.J., Pereira, F.: Distribution rules with numeric attributes of interest. In: Fürnkranz, J., Scheffer, T., Spiliopoulou, M. (eds.) PKDD 2006. LNCS, vol. 4213, pp. 247–258. Springer, Heidelberg (2006). doi:10.​1007/​11871637_​26 CrossRef
16.
Zurück zum Zitat Lesh, N., Zaki, M.J., Ogihara, M.: Mining features for sequence classification. In: ACM SIGKDD 1999, pp. 342–346 (1999) Lesh, N., Zaki, M.J., Ogihara, M.: Mining features for sequence classification. In: ACM SIGKDD 1999, pp. 342–346 (1999)
17.
Zurück zum Zitat Qiao, S., Li, T., Peng, J., Qiu, J.: Parallel sequential pattern mining of massive trajectory data. Int. J. Comput. Intell. Syst. 3(3), 343–356 (2010)CrossRef Qiao, S., Li, T., Peng, J., Qiu, J.: Parallel sequential pattern mining of massive trajectory data. Int. J. Comput. Intell. Syst. 3(3), 343–356 (2010)CrossRef
18.
Zurück zum Zitat Sandhaus, E.: The New York Times Annotated Corpus. Linguistic Data Consortium, Philadelphia (2008) Sandhaus, E.: The New York Times Annotated Corpus. Linguistic Data Consortium, Philadelphia (2008)
19.
Zurück zum Zitat She, R., Chen, F., Wang, K., Ester, M., Gardy, J.L., Brinkman, F.S.L.: Frequent-subsequence-based prediction of outer membrane proteins. In: ACM SIGKDD 2003, pp. 436–445 (2003) She, R., Chen, F., Wang, K., Ester, M., Gardy, J.L., Brinkman, F.S.L.: Frequent-subsequence-based prediction of outer membrane proteins. In: ACM SIGKDD 2003, pp. 436–445 (2003)
20.
Zurück zum Zitat Tan, P., Kumar, V.: Discovery of web robot sessions based on their navigational patterns. Data Min. Knowl. Discov. 6(1), 9–35 (2002)MathSciNetCrossRef Tan, P., Kumar, V.: Discovery of web robot sessions based on their navigational patterns. Data Min. Knowl. Discov. 6(1), 9–35 (2002)MathSciNetCrossRef
21.
Zurück zum Zitat Tseng, V.S., Lee, C.: CBS: a new classification method by using sequential patterns. In: SDM 2005, pp. 596–600 (2005) Tseng, V.S., Lee, C.: CBS: a new classification method by using sequential patterns. In: SDM 2005, pp. 596–600 (2005)
22.
Zurück zum Zitat Wang, J., Han, J.: BIDE: efficient mining of frequent closed sequences. In: ICDE 2004, pp. 79–90 (2004) Wang, J., Han, J.: BIDE: efficient mining of frequent closed sequences. In: ICDE 2004, pp. 79–90 (2004)
23.
Zurück zum Zitat Wu, X., Zhu, X., Wu, G.Q., Ding, W.: Data mining with big data. IEEE Trans. Knowl. Data Eng. 26(1), 97–107 (2014)CrossRef Wu, X., Zhu, X., Wu, G.Q., Ding, W.: Data mining with big data. IEEE Trans. Knowl. Data Eng. 26(1), 97–107 (2014)CrossRef
24.
Zurück zum Zitat Xing, Z., Pei, J., Keogh, E.J.: A brief survey on sequence classification. SIGKDD Explor. 12(1), 40–48 (2010)CrossRef Xing, Z., Pei, J., Keogh, E.J.: A brief survey on sequence classification. SIGKDD Explor. 12(1), 40–48 (2010)CrossRef
25.
Zurück zum Zitat Zaki, M.: Sequence mining in categorical domains: incorporating constraints, pp. 422–429 (2000) Zaki, M.: Sequence mining in categorical domains: incorporating constraints, pp. 422–429 (2000)
26.
Zurück zum Zitat Zaki, M.J.: Parallel sequence mining on shared-memory machines. J. Parallel Distrib. Comput. 61(3), 401–426 (2001)CrossRefMATH Zaki, M.J.: Parallel sequence mining on shared-memory machines. J. Parallel Distrib. Comput. 61(3), 401–426 (2001)CrossRefMATH
27.
Zurück zum Zitat Zhou, C., Cule, B., Goethals, B.: Itemset based sequence classification. In: ECML/PKDD 2013, pp. 353–368 (2013) Zhou, C., Cule, B., Goethals, B.: Itemset based sequence classification. In: ECML/PKDD 2013, pp. 353–368 (2013)
Metadaten
Titel
MiSeRe-Hadoop: A Large-Scale Robust Sequential Classification Rules Mining Framework
verfasst von
Elias Egho
Dominique Gay
Romain Trinquart
Marc Boullé
Nicolas Voisine
Fabrice Clérot
Copyright-Jahr
2017
DOI
https://doi.org/10.1007/978-3-319-64283-3_8