Skip to main content
Top
Published in: International Journal of Machine Learning and Cybernetics 4/2019

15-11-2017 | Original Article

Two-stage pruning method for gram-based categorical sequence clustering

Authors: Liang Yuan, Wenjian Wang, Lifei Chen

Published in: International Journal of Machine Learning and Cybernetics | Issue 4/2019

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Gram-based vector space model has been extensively applied to categorical sequence clustering. However, there is a general lack of an efficient method to determine the length of grams and to identify redundant and non-significant grams involved in the model. In this paper, a variable-length gram model is proposed, different from previous studies mainly focused on the fixed-length grams of sequences. The variable-length grams are obtained using a two-stage pruning method aimed at selecting the irredundant and significant subsequences from the prefix trees, created from the fixed-length grams with an initially large length. A robust partitioning algorithm is then defined for categorical sequence clustering on the normalized representation model using variable-length grams collected from the pruned trees. Experimental results on real-world sequence sets from various domains are given to demonstrate the performance of the proposed methods.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Show more products
Literature
1.
go back to reference Xing Z, Pei J, Keogh E (2010) A brief survey on sequence classification. ACM SIGKDD Explor 12(1):40–48CrossRef Xing Z, Pei J, Keogh E (2010) A brief survey on sequence classification. ACM SIGKDD Explor 12(1):40–48CrossRef
2.
go back to reference Kelil A, Wang S (2008) SCS: a new similarity measure for categorical sequences. In: Proceedings of the IEEE ICDM, pp 343–352 Kelil A, Wang S (2008) SCS: a new similarity measure for categorical sequences. In: Proceedings of the IEEE ICDM, pp 343–352
3.
go back to reference Aggarwal CC (2015) Data mining: the textbook. Springer, BerlinMATH Aggarwal CC (2015) Data mining: the textbook. Springer, BerlinMATH
5.
go back to reference Cao F, Yu L, Huang J, Liang J (2017) K-mw-modes: an algorithm for clustering categorical matrix-object data. Appl Soft Comput 57:605–614CrossRef Cao F, Yu L, Huang J, Liang J (2017) K-mw-modes: an algorithm for clustering categorical matrix-object data. Appl Soft Comput 57:605–614CrossRef
7.
go back to reference Chen L (2014) EM-type method for measuring graph dissimilarity. Int J Mach Learn Cybern 5:625–633CrossRef Chen L (2014) EM-type method for measuring graph dissimilarity. Int J Mach Learn Cybern 5:625–633CrossRef
8.
go back to reference Herranz J, Nin J, Sol\(\acute{e}\) M (2011) Optimal symbol alignment distance: a new distance for sequences of symbols. IEEE Trans Knowl Data Eng 23:1541–1554 Herranz J, Nin J, Sol\(\acute{e}\) M (2011) Optimal symbol alignment distance: a new distance for sequences of symbols. IEEE Trans Knowl Data Eng 23:1541–1554
9.
go back to reference Song K, Ren J, Reinert G, Deng M, Waterman MS, Sun F (2014) New developments of alignment-free sequence comparison: measures, statistics and next-generation sequencing. Brief Bioinf 15(3):343–353CrossRef Song K, Ren J, Reinert G, Deng M, Waterman MS, Sun F (2014) New developments of alignment-free sequence comparison: measures, statistics and next-generation sequencing. Brief Bioinf 15(3):343–353CrossRef
10.
go back to reference Wei D, Jiang Q, Wei Y, Wang S (2012) A novel hierarchical clustering algorithm for gene sequences. BMC Bioinf 13:174CrossRef Wei D, Jiang Q, Wei Y, Wang S (2012) A novel hierarchical clustering algorithm for gene sequences. BMC Bioinf 13:174CrossRef
11.
go back to reference Yang J, Wang W (2003) CLUSEQ: efficient and effective sequence clustering. In: Proceedings of the IEEE ICDE, pp 101–112 Yang J, Wang W (2003) CLUSEQ: efficient and effective sequence clustering. In: Proceedings of the IEEE ICDE, pp 101–112
12.
go back to reference Xiong T, Wang S, Jiang Q, Huang JZ (2014) A novel variable-order Markov model for clustering categorical sequences. IEEE Trans Knowl Data Eng 26:2339–2353CrossRef Xiong T, Wang S, Jiang Q, Huang JZ (2014) A novel variable-order Markov model for clustering categorical sequences. IEEE Trans Knowl Data Eng 26:2339–2353CrossRef
13.
go back to reference Sbakan YC, Kurt B, Cemgil AT, Sankurc B (2014) Probabilistic sequence clustering with spectral learning. Digital Signal Process 29:1–19MathSciNetCrossRef Sbakan YC, Kurt B, Cemgil AT, Sankurc B (2014) Probabilistic sequence clustering with spectral learning. Digital Signal Process 29:1–19MathSciNetCrossRef
14.
go back to reference Fink GA (2008) Markov models for pattern recognition: from theory to applications. Springer, New York, Berlin HeidelbergMATH Fink GA (2008) Markov models for pattern recognition: from theory to applications. Springer, New York, Berlin HeidelbergMATH
15.
go back to reference Namiki Y, Ishida T, Akiyama Y (2013) Acceleration of sequence clustering using longest common subsequence filtering. BMC Bioinf 14(Suppl 8):S7CrossRef Namiki Y, Ishida T, Akiyama Y (2013) Acceleration of sequence clustering using longest common subsequence filtering. BMC Bioinf 14(Suppl 8):S7CrossRef
16.
go back to reference Basu T, Murthy CA (2016) A supervised term selection technique for effective text categorization. Int J Mach Learn Cybern 7(5):877–892CrossRef Basu T, Murthy CA (2016) A supervised term selection technique for effective text categorization. Int J Mach Learn Cybern 7(5):877–892CrossRef
17.
go back to reference Domeniconi C, Gunopulos S, Ma S, Yan B, Razgan MA, Papadopoulos D (2007) Locally adaptive metrics for clustering high dimensional data. Data Min Knowl Disc 14(1):63–97MathSciNetCrossRef Domeniconi C, Gunopulos S, Ma S, Yan B, Razgan MA, Papadopoulos D (2007) Locally adaptive metrics for clustering high dimensional data. Data Min Knowl Disc 14(1):63–97MathSciNetCrossRef
18.
go back to reference Yuan L, Hong Z, Chen L, Cai Q (2016) Clustering categorical sequences with variable-length tuples representation, In: Proceedings of the KSEM, pp 15–27 Yuan L, Hong Z, Chen L, Cai Q (2016) Clustering categorical sequences with variable-length tuples representation, In: Proceedings of the KSEM, pp 15–27
19.
go back to reference Bezdek JC (1998) Pattern recognition in handbook of fuzzy computation. IOP Publishing Ltd, Bristol Bezdek JC (1998) Pattern recognition in handbook of fuzzy computation. IOP Publishing Ltd, Bristol
20.
go back to reference Wu D, Ren J (2017) Sequence clustering algorithm based on weighted vector identification. Int J Mach Learn Cybern 8(3):731–738CrossRef Wu D, Ren J (2017) Sequence clustering algorithm based on weighted vector identification. Int J Mach Learn Cybern 8(3):731–738CrossRef
21.
go back to reference Loiselle S, Rouat J, Pressnitzer D, Thorpe S (2005) Exploration of rank order coding with spiking neural networks for speech recognition. Proc IEEE IJCNN 4:2076–2080 Loiselle S, Rouat J, Pressnitzer D, Thorpe S (2005) Exploration of rank order coding with spiking neural networks for speech recognition. Proc IEEE IJCNN 4:2076–2080
Metadata
Title
Two-stage pruning method for gram-based categorical sequence clustering
Authors
Liang Yuan
Wenjian Wang
Lifei Chen
Publication date
15-11-2017
Publisher
Springer Berlin Heidelberg
Published in
International Journal of Machine Learning and Cybernetics / Issue 4/2019
Print ISSN: 1868-8071
Electronic ISSN: 1868-808X
DOI
https://doi.org/10.1007/s13042-017-0744-y

Other articles of this Issue 4/2019

International Journal of Machine Learning and Cybernetics 4/2019 Go to the issue