Skip to main content
Erschienen in: Data Mining and Knowledge Discovery 6/2015

01.11.2015

Learning sequential classifiers from long and noisy discrete-event sequences efficiently

verfasst von: Gessé Dafé, Adriano Veloso, Mohammed Zaki, Wagner Meira Jr.

Erschienen in: Data Mining and Knowledge Discovery | Ausgabe 6/2015

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

A variety of applications, such as information extraction, intrusion detection and protein fold recognition, can be expressed as sequences of discrete events or elements (rather than unordered sets of features), that is, there is an order dependence among the elements composing each data instance. These applications may be modeled as classification problems, and in this case the classifier should exploit sequential interactions among the elements, so that the ordering relationship among them is properly captured. Dominant approaches to this problem include: (i) learning Hidden Markov Models, (ii) exploiting frequent sequences extracted from the data and (iii) computing string kernels. Such approaches, however, are computationally hard and vulnerable to noise, especially if the data shows long range dependencies (i.e., long subsequences are necessary in order to model the data). In this paper we provide simple algorithms that build highly effective sequential classifiers. Our algorithms are based on enumerating approximately contiguous subsequences from the training set on a demand-driven basis, exploiting a lightweight and flexible subsequence matching function and an innovative subsequence enumeration strategy called pattern silhouettes, making our learning algorithms fast and the corresponding classifiers robust to noisy data. Our empirical results on a variety of datasets indicate that the best trade-off between accuracy and learning time is usually obtained by limiting the length of the subsequences by a factor of \(\log {n}\), which leads to a \(O(n\log {n})\) learning cost (where \(n\) is the length of the sequence being classified). Finally, we show that, in most of the cases, our classifiers are faster than existing solutions (sometimes, by orders of magnitude), also providing significant accuracy improvements in most of the evaluated cases.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
A similar trend is observed for SC-SC algorithm.
 
2
The reason we include the LAC algorithm (which does not exploit adjacency information) as a baseline, is to evaluate the possible benefits of exploiting adjacency information while producing the classifier.
 
Literatur
Zurück zum Zitat Agrawal R, Srikant R (1995) Mining sequential patterns. In: ICDE, pp 3–14 Agrawal R, Srikant R (1995) Mining sequential patterns. In: ICDE, pp 3–14
Zurück zum Zitat Bannister W (2007) Associative and sequential classification with adaptive constrained regression methods. PhD thesis, Tempe, AZ, USA Bannister W (2007) Associative and sequential classification with adaptive constrained regression methods. PhD thesis, Tempe, AZ, USA
Zurück zum Zitat Baum L, Petrie T (1966) Statistical inference for probabilistic functions of finite state Markov chains. Ann Math Stat 37(6):1554–1563MathSciNetCrossRefMATH Baum L, Petrie T (1966) Statistical inference for probabilistic functions of finite state Markov chains. Ann Math Stat 37(6):1554–1563MathSciNetCrossRefMATH
Zurück zum Zitat Baum L, Petrie T, Soules G, Weiss N (1970) A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Ann Math Stat 41:164MathSciNetCrossRefMATH Baum L, Petrie T, Soules G, Weiss N (1970) A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Ann Math Stat 41:164MathSciNetCrossRefMATH
Zurück zum Zitat Bicego M, Murino V, Figueiredo M (2003a) A sequential pruning strategy for the selection of the number of states in hidden Markov models. Pattern Recognit Lett 24(9–10):1395–1407CrossRefMATH Bicego M, Murino V, Figueiredo M (2003a) A sequential pruning strategy for the selection of the number of states in hidden Markov models. Pattern Recognit Lett 24(9–10):1395–1407CrossRefMATH
Zurück zum Zitat Bicego M, Murino V, Figueiredo M (2004) Similarity-based classification of sequences using hidden Markov models. Pattern Recognit 37(12):2281–2291CrossRef Bicego M, Murino V, Figueiredo M (2004) Similarity-based classification of sequences using hidden Markov models. Pattern Recognit 37(12):2281–2291CrossRef
Zurück zum Zitat Bicego M, Murino V, Pelillo M, Torsello A (2006) Similarity-based pattern recognition. Pattern Recognit 39(10):1813–1814CrossRef Bicego M, Murino V, Pelillo M, Torsello A (2006) Similarity-based pattern recognition. Pattern Recognit 39(10):1813–1814CrossRef
Zurück zum Zitat Davis A, Veloso A, da Silva A, Laender A, Meira W Jr (2012) Named entity disambiguation in streaming data. In: ACL, pp 815–824 Davis A, Veloso A, da Silva A, Laender A, Meira W Jr (2012) Named entity disambiguation in streaming data. In: ACL, pp 815–824
Zurück zum Zitat Deshpande M, Karypis G (2004) Selective Markov models for predicting web page accesses. ACM Trans Internet Technol 4(2):163–184CrossRef Deshpande M, Karypis G (2004) Selective Markov models for predicting web page accesses. ACM Trans Internet Technol 4(2):163–184CrossRef
Zurück zum Zitat Durbin R, Eddy AKS, Mitchison G (1998) Biological sequence analysis. Cambridge University Press Durbin R, Eddy AKS, Mitchison G (1998) Biological sequence analysis. Cambridge University Press
Zurück zum Zitat Durbin R, Eddy S, Krogh A, Mitchison G (1998) Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge University Press Durbin R, Eddy S, Krogh A, Mitchison G (1998) Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge University Press
Zurück zum Zitat Du Preez J (1998) Efficient training of high-order hidden Markov models using first-order representations. Comput Speech Lang 12(1):23–39CrossRef Du Preez J (1998) Efficient training of high-order hidden Markov models using first-order representations. Comput Speech Lang 12(1):23–39CrossRef
Zurück zum Zitat Eddy S (1998) Profile hidden Markov models. Bioinformatics 14(9):755–763CrossRef Eddy S (1998) Profile hidden Markov models. Bioinformatics 14(9):755–763CrossRef
Zurück zum Zitat Galassi U, Giordana A, Saitta L (2007) Incremental construction of structured hidden Markov models. In: IJCAI, pp 798–803 Galassi U, Giordana A, Saitta L (2007) Incremental construction of structured hidden Markov models. In: IJCAI, pp 798–803
Zurück zum Zitat Golding A, Roth D (1996) Applying winnow to context-sensitive spelling correction. CoRR Golding A, Roth D (1996) Applying winnow to context-sensitive spelling correction. CoRR
Zurück zum Zitat Han H, Giles C, Zha H, Li C, Tsioutsiouliklis K (2004) Two supervised learning approaches for name disambiguation in author citations. In: JCDL, pp 296–305 Han H, Giles C, Zha H, Li C, Tsioutsiouliklis K (2004) Two supervised learning approaches for name disambiguation in author citations. In: JCDL, pp 296–305
Zurück zum Zitat Han H, Zha H, Giles C (2005) Name disambiguation in author citations using a k-way spectral clustering method. In: JCDL, pp 334–343 Han H, Zha H, Giles C (2005) Name disambiguation in author citations using a k-way spectral clustering method. In: JCDL, pp 334–343
Zurück zum Zitat Haussler D (1999) Convolution kernels on discrete structures. Tech. rep,Technical report, UC Santa Cruz Haussler D (1999) Convolution kernels on discrete structures. Tech. rep,Technical report, UC Santa Cruz
Zurück zum Zitat Hu J, Brown M, Turin W (1996) Hmm based on-line handwriting recognition. IEEE Trans Pattern Anal Mach Intell 18(10):1039–1045CrossRef Hu J, Brown M, Turin W (1996) Hmm based on-line handwriting recognition. IEEE Trans Pattern Anal Mach Intell 18(10):1039–1045CrossRef
Zurück zum Zitat Hughey R, Krogh A (1996) Hidden Markov models for sequence analysis: extension and analysis of the basic method. Comput Appl Biosci 12(2):95–107 Hughey R, Krogh A (1996) Hidden Markov models for sequence analysis: extension and analysis of the basic method. Comput Appl Biosci 12(2):95–107
Zurück zum Zitat Kriouile A, Mari J, Haon J (1990) Some improvements in speech recognition algorithms based on HMM. In: ICASSP, pp 545–548 Kriouile A, Mari J, Haon J (1990) Some improvements in speech recognition algorithms based on HMM. In: ICASSP, pp 545–548
Zurück zum Zitat Kuksa P, Huang PH, Pavlovic V (2008) A fast, large-scale learning method for protein sequence classification. In: 8th international workshop on data mining in bioinformatics, pp 29–37 Kuksa P, Huang PH, Pavlovic V (2008) A fast, large-scale learning method for protein sequence classification. In: 8th international workshop on data mining in bioinformatics, pp 29–37
Zurück zum Zitat Lane T, Brodley C (1999) Temporal sequence learning and data reduction for anomaly detection. ACM Trans Inf Syst Secur 2(3):295–331CrossRef Lane T, Brodley C (1999) Temporal sequence learning and data reduction for anomaly detection. ACM Trans Inf Syst Secur 2(3):295–331CrossRef
Zurück zum Zitat Law H, Chan C (1996) N-th order ergodic multigram hmm for modeling of languages without marked word boundaries. In: COLING, pp 204–209 Law H, Chan C (1996) N-th order ergodic multigram hmm for modeling of languages without marked word boundaries. In: COLING, pp 204–209
Zurück zum Zitat Lesh N, Zaki M, Ogihara M (1999) Mining features for sequence classification. In: KDD, pp 342–346 Lesh N, Zaki M, Ogihara M (1999) Mining features for sequence classification. In: KDD, pp 342–346
Zurück zum Zitat Leslie C, Kuang R (2003) Fast kernels for inexact string matching. In: COLT, pp 114–128 Leslie C, Kuang R (2003) Fast kernels for inexact string matching. In: COLT, pp 114–128
Zurück zum Zitat Leslie C, Kuang R (2004) Fast string kernels using inexact matching for protein sequences. J Mach Learn Res 5:1435–1455MathSciNetMATH Leslie C, Kuang R (2004) Fast string kernels using inexact matching for protein sequences. J Mach Learn Res 5:1435–1455MathSciNetMATH
Zurück zum Zitat Leslie C, Eskin E, Noble WS (2002a) The spectrum kernel: a string kernel for SVM protein classification. Pac Symp Biocomput 7:566–575 Leslie C, Eskin E, Noble WS (2002a) The spectrum kernel: a string kernel for SVM protein classification. Pac Symp Biocomput 7:566–575
Zurück zum Zitat Leslie C, Eskin E, Weston J, Noble W (2002b) Mismatch string kernels for SVM protein classification. In: NIPS, pp 1417–1424 Leslie C, Eskin E, Weston J, Noble W (2002b) Mismatch string kernels for SVM protein classification. In: NIPS, pp 1417–1424
Zurück zum Zitat Leslie C, Eskin E, Cohen A, Weston J, Noble W (2004) Mismatch string kernels for discriminative protein classification. Bioinformatics 20(4):467–476CrossRef Leslie C, Eskin E, Cohen A, Weston J, Noble W (2004) Mismatch string kernels for discriminative protein classification. Bioinformatics 20(4):467–476CrossRef
Zurück zum Zitat Lin M, Hsueh S, Chen M, Hsu H (2009) Mining sequential patterns for image classification in ubiquitous multimedia systems. In: IIH-MSP, pp 303–306 Lin M, Hsueh S, Chen M, Hsu H (2009) Mining sequential patterns for image classification in ubiquitous multimedia systems. In: IIH-MSP, pp 303–306
Zurück zum Zitat Lodhi H, Saunders C, Shawe-Taylor J, Cristianini N, Watkins C (2002) Text classification using string kernels. J Mach Learn Res 2:419–444MATH Lodhi H, Saunders C, Shawe-Taylor J, Cristianini N, Watkins C (2002) Text classification using string kernels. J Mach Learn Res 2:419–444MATH
Zurück zum Zitat Lodhi H, Muggleton S, Sternberg M (2009) Multi-class protein fold recognition using large margin logic based divide and conquer learning. SIGKDD Explor 11(2):117–122CrossRef Lodhi H, Muggleton S, Sternberg M (2009) Multi-class protein fold recognition using large margin logic based divide and conquer learning. SIGKDD Explor 11(2):117–122CrossRef
Zurück zum Zitat Malik H, Kender J (2008) Classifying high-dimensional text and web data using very short patterns. In: ICDM, pp 923–928 Malik H, Kender J (2008) Classifying high-dimensional text and web data using very short patterns. In: ICDM, pp 923–928
Zurück zum Zitat Müller S, Eickeler S, Rigoll G (2000) Crane gesture recognition using pseudo 3-d hidden Markov models. In: FG (Conf. on Automatic Face and Gesture Recognition), pp 398–402 Müller S, Eickeler S, Rigoll G (2000) Crane gesture recognition using pseudo 3-d hidden Markov models. In: FG (Conf. on Automatic Face and Gesture Recognition), pp 398–402
Zurück zum Zitat Murzin A, Brenner S, Hubbard T, Chothia C (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 247(4):536–540 Murzin A, Brenner S, Hubbard T, Chothia C (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 247(4):536–540
Zurück zum Zitat Pitkow J, Pirolli P (1999) Mining longest repeating subsequences to predict world wide web surfing. In: USENIX symposium on Internet technologies and systems Pitkow J, Pirolli P (1999) Mining longest repeating subsequences to predict world wide web surfing. In: USENIX symposium on Internet technologies and systems
Zurück zum Zitat Rabiner L (1989) A tutorial on hidden Markov models and selected applications in speech recognition. Proc IEEE 77(2):257–286CrossRef Rabiner L (1989) A tutorial on hidden Markov models and selected applications in speech recognition. Proc IEEE 77(2):257–286CrossRef
Zurück zum Zitat Rieck K, Laskov P (2008) Linear-time computation of similarity measures for sequential data. J Mach Learn Res 9:23–48MATH Rieck K, Laskov P (2008) Linear-time computation of similarity measures for sequential data. J Mach Learn Res 9:23–48MATH
Zurück zum Zitat Rousu J, Shawe-Taylor J (2005) Efficient computation of gapped substring kernels on large alphabets. J Mach Learn Res 6(1):1323–1344MathSciNetMATH Rousu J, Shawe-Taylor J (2005) Efficient computation of gapped substring kernels on large alphabets. J Mach Learn Res 6(1):1323–1344MathSciNetMATH
Zurück zum Zitat Saul L, Jordan M (1999) Mixed memory Markov models: decomposing complex stochastic processes as mixtures of simpler ones. Mach Learn 37(1):75–87CrossRefMATH Saul L, Jordan M (1999) Mixed memory Markov models: decomposing complex stochastic processes as mixtures of simpler ones. Mach Learn 37(1):75–87CrossRefMATH
Zurück zum Zitat Schwardt L, Preez JD (2000) Efficient mixed-order hidden Markov model inference. In: ICSLP, pp 238–241 Schwardt L, Preez JD (2000) Efficient mixed-order hidden Markov model inference. In: ICSLP, pp 238–241
Zurück zum Zitat Sha F, Saul L (2006) Large margin hidden Markov models for automatic speech recognition. In: NIPS, pp 1249–1256 Sha F, Saul L (2006) Large margin hidden Markov models for automatic speech recognition. In: NIPS, pp 1249–1256
Zurück zum Zitat Silva I, Gomide J, Veloso A, Meira Jr W, Ferreira R (2011) Effective sentiment stream analysis with self-augmenting training and demand-driven projection. In: SIGIR, pp 475–484 Silva I, Gomide J, Veloso A, Meira Jr W, Ferreira R (2011) Effective sentiment stream analysis with self-augmenting training and demand-driven projection. In: SIGIR, pp 475–484
Zurück zum Zitat Srikant R, Agrawal R (1996) Mining sequential patterns: generalizations and performance improvements. In: EDBT, pp 3–17 Srikant R, Agrawal R (1996) Mining sequential patterns: generalizations and performance improvements. In: EDBT, pp 3–17
Zurück zum Zitat Srivatsan L, Sastry P, Unnikrishnan K (2005) Discovering frequent episodes and learning hidden Markov models: a formal connection. IEEE TKDE 17:1505–1517 Srivatsan L, Sastry P, Unnikrishnan K (2005) Discovering frequent episodes and learning hidden Markov models: a formal connection. IEEE TKDE 17:1505–1517
Zurück zum Zitat Syed Z, Indyk P, Guttag J (2009) Learning approximate sequential patterns for classification. J Mach Learn Res 10:1913–1936MathSciNetMATH Syed Z, Indyk P, Guttag J (2009) Learning approximate sequential patterns for classification. J Mach Learn Res 10:1913–1936MathSciNetMATH
Zurück zum Zitat Szymanski B (2004) Recursive data mining for masquerade detection and author identification. Workshop on Information Assurance, pp 424–431 Szymanski B (2004) Recursive data mining for masquerade detection and author identification. Workshop on Information Assurance, pp 424–431
Zurück zum Zitat Tseng V, Lee C (2005) CBS: a new classification method by using sequential patterns. In: SDM Tseng V, Lee C (2005) CBS: a new classification method by using sequential patterns. In: SDM
Zurück zum Zitat Vapnik V (1979) Estimation of dependences based on empirical data (in Russian). Nauka Vapnik V (1979) Estimation of dependences based on empirical data (in Russian). Nauka
Zurück zum Zitat Veloso A, Meira W Jr (2011) Demand-driven associative classification. Springer Veloso A, Meira W Jr (2011) Demand-driven associative classification. Springer
Zurück zum Zitat Wang Y, Zhou L, Feng J, Wang J, Liu Z (2006) Mining complex time-series data by learning Markovian models. In: ICDM, pp 1136–1140 Wang Y, Zhou L, Feng J, Wang J, Liu Z (2006) Mining complex time-series data by learning Markovian models. In: ICDM, pp 1136–1140
Zurück zum Zitat Watkins C (1999) Dynamic alignment kernels. Advances in neural information processing systems, pp 39–50 Watkins C (1999) Dynamic alignment kernels. Advances in neural information processing systems, pp 39–50
Zurück zum Zitat Ye L, Keogh E (2009) Time series shapelets: a new primitive for data mining. In: KDD, pp 947–956 Ye L, Keogh E (2009) Time series shapelets: a new primitive for data mining. In: KDD, pp 947–956
Zurück zum Zitat Ye L, Keogh E (2011) Time series shapelets: a novel technique that allows accurate, interpretable and fast classification. Data Min Knowl Discov 22(1–2):149–182MathSciNetCrossRefMATH Ye L, Keogh E (2011) Time series shapelets: a novel technique that allows accurate, interpretable and fast classification. Data Min Knowl Discov 22(1–2):149–182MathSciNetCrossRefMATH
Zurück zum Zitat Zaki M (2000) Sequence mining in categorical domains: Incorporating constraints. In: CIKM, pp 422–429 Zaki M (2000) Sequence mining in categorical domains: Incorporating constraints. In: CIKM, pp 422–429
Zurück zum Zitat Zaki M, Carothers C, Szymanski B (2010) Vogue: a variable order hidden Markov model with duration based on frequent sequence mining. TKDD 4(1):1–31 Zaki M, Carothers C, Szymanski B (2010) Vogue: a variable order hidden Markov model with duration based on frequent sequence mining. TKDD 4(1):1–31
Metadaten
Titel
Learning sequential classifiers from long and noisy discrete-event sequences efficiently
verfasst von
Gessé Dafé
Adriano Veloso
Mohammed Zaki
Wagner Meira Jr.
Publikationsdatum
01.11.2015
Verlag
Springer US
Erschienen in
Data Mining and Knowledge Discovery / Ausgabe 6/2015
Print ISSN: 1384-5810
Elektronische ISSN: 1573-756X
DOI
https://doi.org/10.1007/s10618-014-0391-9

Weitere Artikel der Ausgabe 6/2015

Data Mining and Knowledge Discovery 6/2015 Zur Ausgabe

Premium Partner