Skip to main content
Erschienen in: Pattern Analysis and Applications 4/2018

28.02.2017 | Theoretical Advances

Co-occurrence pattern mining based on a biological approximation scoring matrix

Erschienen in: Pattern Analysis and Applications | Ausgabe 4/2018

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Mining co-occurrence frequency patterns from multiple sequences is a hot topic in bioinformatics. Many seemingly disorganized constituents repetitively appear under different biological matrices, such as PAM250 and BLOSUM62, which are considered hidden frequent patterns (FPs). A hidden FP with both gap and flexible approximation operations (replacement, deletion or insertion) deepens the difficulty in discovering its true occurrences. To effectively discover co-occurrence FPs (Co-FPs) under these conditions, we design a mining algorithm (co-fp-miner) using the following steps: (1) a biological approximation scoring matrix is designed to discover various deformations of a single FP pattern; (2) a data-driven intersection tactic is used to generate candidate Co-FPs; (3) a deterministic Apriori-like rule is proposed to prune unnecessary Co-FPs; and (4) finally, we employ a backtracking matching scheme to validate true Co-FPs. The co-fp-miner algorithm is an unified framework for both exact and approximate mining on multiple sequences. Experiments on DNA and protein sequences demonstrate that co-fp-miner is more efficient on solutions, time and memory consumption than that of other peers.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Fußnoten
1
The paper obeys the deletion rule that while deleting a sub-pattern of pattern P, its previous gap usually has no meaning for constraint preservation and we delete the gap with no operation cost. But if the first sub-pattern disappears, its following gap becomes no meaning. We discard the gap with no operation cost. Deleting a sub-pattern is taken as an approximate cost.
 
Literatur
1.
Zurück zum Zitat Han J, Cheng H, Xin D, Yan X (2007) Frequent pattern mining: current status and future directions. Data Min Knowl Discov 15:55–86MathSciNetCrossRef Han J, Cheng H, Xin D, Yan X (2007) Frequent pattern mining: current status and future directions. Data Min Knowl Discov 15:55–86MathSciNetCrossRef
2.
Zurück zum Zitat Chen G, Wu XD, Zhu XQ, Arslan AN, He Y (2006) Efficient string matching with wildcards and length constraints. Knowl Inf Syst 10:399–419CrossRef Chen G, Wu XD, Zhu XQ, Arslan AN, He Y (2006) Efficient string matching with wildcards and length constraints. Knowl Inf Syst 10:399–419CrossRef
3.
Zurück zum Zitat Ding B, Lo D, Han J, Khoo S (2005) Efficient mining of closed repetitive gapped subsequences from a sequence database. In: IEEE 25th international conference on data engineering, pp 1024–1035 Ding B, Lo D, Han J, Khoo S (2005) Efficient mining of closed repetitive gapped subsequences from a sequence database. In: IEEE 25th international conference on data engineering, pp 1024–1035
4.
Zurück zum Zitat Xie F, Wu XD, Hu XG, Gao J, Guo D, Fei Y, Hua E (2010) Sequential pattern mining with wildcards. In: 22nd IEEE international conference on tools with artificial intelligence, pp 241–247 Xie F, Wu XD, Hu XG, Gao J, Guo D, Fei Y, Hua E (2010) Sequential pattern mining with wildcards. In: 22nd IEEE international conference on tools with artificial intelligence, pp 241–247
5.
Zurück zum Zitat Yang QX, Yuan SS, Zhao L et al (2003) Faster algorithm of string comparison. Pattern Anal Appl 6(2):122–133MathSciNetCrossRef Yang QX, Yuan SS, Zhao L et al (2003) Faster algorithm of string comparison. Pattern Anal Appl 6(2):122–133MathSciNetCrossRef
6.
Zurück zum Zitat Chen YC, Weng JTY, Hui LA (2016) A novel algorithm for mining closed temporal patterns from interval-based data[J]. Knowl Inf Syst 46(1):151–183CrossRef Chen YC, Weng JTY, Hui LA (2016) A novel algorithm for mining closed temporal patterns from interval-based data[J]. Knowl Inf Syst 46(1):151–183CrossRef
7.
Zurück zum Zitat Silva A, Antunes C (2016) Constrained pattern mining in the new era[J]. Knowl Inf Syst 47(3):489–516CrossRef Silva A, Antunes C (2016) Constrained pattern mining in the new era[J]. Knowl Inf Syst 47(3):489–516CrossRef
8.
Zurück zum Zitat Oates T, Cohen PR (1996) Searching for structure in multiple streams of data. In: Proceeding of 13th international conference on machine learning, pp 346–354 Oates T, Cohen PR (1996) Searching for structure in multiple streams of data. In: Proceeding of 13th international conference on machine learning, pp 346–354
9.
Zurück zum Zitat Notredame C (2002) Recent progress in multiple sequence alignment: a survey. Pharmacogenomics 3(1):131–144CrossRef Notredame C (2002) Recent progress in multiple sequence alignment: a survey. Pharmacogenomics 3(1):131–144CrossRef
10.
Zurück zum Zitat Mathkour H, Ahmad M (2009) A pattern matching technique for multiple sequences alignment with GAP consideration. In: International conference on signal acquisition and processing, pp 123–127 Mathkour H, Ahmad M (2009) A pattern matching technique for multiple sequences alignment with GAP consideration. In: International conference on signal acquisition and processing, pp 123–127
11.
Zurück zum Zitat Yao D, Jiang M, You X et al (2015) An algorithm of multiple sequence alignment based on consensus sequence searched by simulated annealing and star alignment. In: International symposium on bioelectronics and bioinformatics, pp 3–6 Yao D, Jiang M, You X et al (2015) An algorithm of multiple sequence alignment based on consensus sequence searched by simulated annealing and star alignment. In: International symposium on bioelectronics and bioinformatics, pp 3–6
12.
Zurück zum Zitat Ni B, Wong MH, Lam CFD et al (2014) Applying Agrep to r-NSA to solve multiple sequences approximate matching. Int J Data Min Bioinform 9(4):358–385CrossRef Ni B, Wong MH, Lam CFD et al (2014) Applying Agrep to r-NSA to solve multiple sequences approximate matching. Int J Data Min Bioinform 9(4):358–385CrossRef
13.
Zurück zum Zitat Kouzinopoulos CS, Michailidis PD, Margaritis KG (2011) Experimental results on multiple pattern matching algorithms for biological sequences. Bioinformatics 274–277 Kouzinopoulos CS, Michailidis PD, Margaritis KG (2011) Experimental results on multiple pattern matching algorithms for biological sequences. Bioinformatics 274–277
14.
Zurück zum Zitat Li Y, Patel JM, Terrell A (2012) WHAM: a high-throughput sequence alignment method. ACM Trans Database Syst 37(4):28 Li Y, Patel JM, Terrell A (2012) WHAM: a high-throughput sequence alignment method. ACM Trans Database Syst 37(4):28
15.
Zurück zum Zitat Besharati A et al (2014) Multiple sequence alignment using biological features classification. In: International congress on technology, communication and knowledge, pp 1–5 Besharati A et al (2014) Multiple sequence alignment using biological features classification. In: International congress on technology, communication and knowledge, pp 1–5
16.
Zurück zum Zitat Zhan Q, Ye Y, Lam TW et al (2015) Improving multiple sequence alignment by using better guide trees. BMC Bioinform 16(5):1 Zhan Q, Ye Y, Lam TW et al (2015) Improving multiple sequence alignment by using better guide trees. BMC Bioinform 16(5):1
17.
Zurück zum Zitat Altschul SF, Gish W, Miller W et al (1990) Basic local alignment search tool. J Mol Biol 215(3):403–410CrossRef Altschul SF, Gish W, Miller W et al (1990) Basic local alignment search tool. J Mol Biol 215(3):403–410CrossRef
18.
Zurück zum Zitat Han J, Pei J, Mortazavi-Asl B, Chen Q, Dayal U, Hsu MC (2000) Freespan: frequent pattern-projected sequential pattern mining. In: Proceedings of the sixth ACM SIGKDD international conference on knowledge discovery and data mining, pp 355–359 Han J, Pei J, Mortazavi-Asl B, Chen Q, Dayal U, Hsu MC (2000) Freespan: frequent pattern-projected sequential pattern mining. In: Proceedings of the sixth ACM SIGKDD international conference on knowledge discovery and data mining, pp 355–359
19.
Zurück zum Zitat He D, Zhu XQ, Wu XD (2011) Mining approximate repeating patterns from sequence data with gap constraints. Comput Intell 27(3):336–362MathSciNetCrossRef He D, Zhu XQ, Wu XD (2011) Mining approximate repeating patterns from sequence data with gap constraints. Comput Intell 27(3):336–362MathSciNetCrossRef
20.
Zurück zum Zitat Boeva V, Regnier M, Papatsenko D et al (2006) Short fuzzy tandem repeats in genomic sequences, identification, and possible role in regulation of gene expression. Bioinformatics 22(6):676–684CrossRef Boeva V, Regnier M, Papatsenko D et al (2006) Short fuzzy tandem repeats in genomic sequences, identification, and possible role in regulation of gene expression. Bioinformatics 22(6):676–684CrossRef
21.
Zurück zum Zitat Navarro G, Raffinot M (2002) Flexible pattern matching in strings practical on-line search algorithms for texts and Biological Sequences. Cambridge University Press, CambridgeCrossRef Navarro G, Raffinot M (2002) Flexible pattern matching in strings practical on-line search algorithms for texts and Biological Sequences. Cambridge University Press, CambridgeCrossRef
22.
Zurück zum Zitat Zhang M, Kao B, Cheung DW et al (2007) Mining periodic patterns with gap requirement from sequences. ACM Trans Knowl Discov Data 1(2):7CrossRef Zhang M, Kao B, Cheung DW et al (2007) Mining periodic patterns with gap requirement from sequences. ACM Trans Knowl Discov Data 1(2):7CrossRef
23.
Zurück zum Zitat Bille P, Gortz I, Vildhoj H, Wind D (2012) String matching with variable length gaps. Theor Comput Sci 443:25–34MathSciNetCrossRef Bille P, Gortz I, Vildhoj H, Wind D (2012) String matching with variable length gaps. Theor Comput Sci 443:25–34MathSciNetCrossRef
24.
Zurück zum Zitat Zhang JY, Yang CH (2013) Pattern matching with wildcard gaps based on cross list. In: Proceedings of 6th international symposium on computational intelligence and design, pp 154–156 Zhang JY, Yang CH (2013) Pattern matching with wildcard gaps based on cross list. In: Proceedings of 6th international symposium on computational intelligence and design, pp 154–156
25.
Zurück zum Zitat Pasquier C, Sanhes J, Flouvat F et al. (2016) Frequent pattern mining in attributed trees: algorithms and applications[J]. Knowl Inf Syst 46(3):491–514CrossRef Pasquier C, Sanhes J, Flouvat F et al. (2016) Frequent pattern mining in attributed trees: algorithms and applications[J]. Knowl Inf Syst 46(3):491–514CrossRef
26.
Zurück zum Zitat Wang JZ, Huang JL, Chen YC (2016) On efficiently mining high utility sequential patterns[J]. Knowl Inf Syst 49(2):597–627CrossRef Wang JZ, Huang JL, Chen YC (2016) On efficiently mining high utility sequential patterns[J]. Knowl Inf Syst 49(2):597–627CrossRef
27.
Zurück zum Zitat Gouda K, Zaki M (2001) Efficiently mining maximal frequent itemsets. ICDM. In: Proceedings IEEE international conference on IEEE, pp 163–170 Gouda K, Zaki M (2001) Efficiently mining maximal frequent itemsets. ICDM. In: Proceedings IEEE international conference on IEEE, pp 163–170
28.
Zurück zum Zitat Hong XL, Wu XD, Hu XG, Liu YL, Gao J, Wu GQ (2009) BPBM: an algorithm for string matching with wildcards and length constraints. In: International conference on rough sets. Fuzzy sets, data mining and granular computing, pp 518–525 Hong XL, Wu XD, Hu XG, Liu YL, Gao J, Wu GQ (2009) BPBM: an algorithm for string matching with wildcards and length constraints. In: International conference on rough sets. Fuzzy sets, data mining and granular computing, pp 518–525
29.
Zurück zum Zitat Hu H, Wang H, Li J et al. (2016) An efficient pruning strategy for approximate string matching over suffix tree[J]. Knowl Inf Syst 49(1):121–141CrossRef Hu H, Wang H, Li J et al. (2016) An efficient pruning strategy for approximate string matching over suffix tree[J]. Knowl Inf Syst 49(1):121–141CrossRef
30.
Zurück zum Zitat Kum HC, Pei J, Wang W et al (2003) ApproxMAP: approximate mining of consensus sequential patterns. In: Proceedings of the 2003 SIAM international conference on data mining. Society for industrial and applied mathematics, pp 311–315 Kum HC, Pei J, Wang W et al (2003) ApproxMAP: approximate mining of consensus sequential patterns. In: Proceedings of the 2003 SIAM international conference on data mining. Society for industrial and applied mathematics, pp 311–315
31.
Zurück zum Zitat Chen C, Yan X, Zhu F et al (2007) gapprox: mining frequent approximate patterns from a massive network. In: Seventh IEEE international conference on data mining. IEEE, pp 445–450 Chen C, Yan X, Zhu F et al (2007) gapprox: mining frequent approximate patterns from a massive network. In: Seventh IEEE international conference on data mining. IEEE, pp 445–450
32.
Zurück zum Zitat Manber U, Baeza-Yates R (1991) An algorithm for string matching with a sequence of don’t cares. Inf Process Lett 37(3):133–136MathSciNetCrossRef Manber U, Baeza-Yates R (1991) An algorithm for string matching with a sequence of don’t cares. Inf Process Lett 37(3):133–136MathSciNetCrossRef
33.
Zurück zum Zitat Huang CW, Lee WS, Hsieh SY (2011) An improved heuristic algorithm for finding motif signals in dna sequences. IEEE/ACM Trans Comput Biol Bioinform 8(4):959–975CrossRef Huang CW, Lee WS, Hsieh SY (2011) An improved heuristic algorithm for finding motif signals in dna sequences. IEEE/ACM Trans Comput Biol Bioinform 8(4):959–975CrossRef
34.
Zurück zum Zitat Machanick P, Bailey TL (2011) Meme-chip: motif analysis of large DNA datasets. Bioinformatics 27(12):1696–1697CrossRef Machanick P, Bailey TL (2011) Meme-chip: motif analysis of large DNA datasets. Bioinformatics 27(12):1696–1697CrossRef
35.
Zurück zum Zitat Felicioli C, Marangoni R (2012) Bpmatch: an efficient algorithm for a segmental analysis of genomic sequences. IEEE/ACM Trans Comput Biol Bioinform 9(4):1120–1127CrossRef Felicioli C, Marangoni R (2012) Bpmatch: an efficient algorithm for a segmental analysis of genomic sequences. IEEE/ACM Trans Comput Biol Bioinform 9(4):1120–1127CrossRef
36.
Zurück zum Zitat Wong AK, Lee ESA (2014) Aligning and clustering patterns to reveal the protein functionality of sequences. IEEE/ACM Trans Comput Biol Bioinform 11(3):548–560CrossRef Wong AK, Lee ESA (2014) Aligning and clustering patterns to reveal the protein functionality of sequences. IEEE/ACM Trans Comput Biol Bioinform 11(3):548–560CrossRef
37.
Zurück zum Zitat Freire JM, Dias SA, Flores L, Veiga AS, Castanho MA (2015) Mining viral proteins for antimicrobial and cell-penetrating drug delivery peptides. Bioinformatics 31(14):2252–2256CrossRef Freire JM, Dias SA, Flores L, Veiga AS, Castanho MA (2015) Mining viral proteins for antimicrobial and cell-penetrating drug delivery peptides. Bioinformatics 31(14):2252–2256CrossRef
38.
Zurück zum Zitat Vijaya PA, Murty MN, Subramanian DK (2006) Efficient median based clustering and classification techniques for protein sequences. Pattern Anal Appl 9(2):243–255MathSciNetCrossRef Vijaya PA, Murty MN, Subramanian DK (2006) Efficient median based clustering and classification techniques for protein sequences. Pattern Anal Appl 9(2):243–255MathSciNetCrossRef
39.
Zurück zum Zitat Floratou A, Tata S, Patel JM (2011) Efficient and accurate discovery of patterns in sequence data sets. IEEE Trans Knowl Data Eng 23(8):1154–1168CrossRef Floratou A, Tata S, Patel JM (2011) Efficient and accurate discovery of patterns in sequence data sets. IEEE Trans Knowl Data Eng 23(8):1154–1168CrossRef
40.
Zurück zum Zitat Wang K, Xu Y, Yu JX (2004) Scalable sequential pattern mining for biological sequences. In: Proceedings of the thirteenth ACM international conference on Information and knowledge management. ACM, pp 178–187 Wang K, Xu Y, Yu JX (2004) Scalable sequential pattern mining for biological sequences. In: Proceedings of the thirteenth ACM international conference on Information and knowledge management. ACM, pp 178–187
41.
Zurück zum Zitat Zhang J, Wang Y, Zhang C et al (2016) Mining contiguous sequential generators in biological sequences. IEEE/ACM Trans Comput Biol Bioinform 13(5):855–867CrossRef Zhang J, Wang Y, Zhang C et al (2016) Mining contiguous sequential generators in biological sequences. IEEE/ACM Trans Comput Biol Bioinform 13(5):855–867CrossRef
42.
Zurück zum Zitat Durian B, Holub J, Peltola H, Tarhio J (2009) Tuning BNDM with q-grams. In: Proceedings of the meeting on algorithm engineering and experiments, pp 29–37 Durian B, Holub J, Peltola H, Tarhio J (2009) Tuning BNDM with q-grams. In: Proceedings of the meeting on algorithm engineering and experiments, pp 29–37
43.
Zurück zum Zitat Prasad R, Agarwal S (2007) Optimal shift-or string matching algorithm for multiple patterns. In: Proceedings of international conference on computer science and applications, pp 263–266 Prasad R, Agarwal S (2007) Optimal shift-or string matching algorithm for multiple patterns. In: Proceedings of international conference on computer science and applications, pp 263–266
44.
Zurück zum Zitat Kandhan R, Teletia N, Patel JM (2010) SigMatch: fast and scalable multi-pattern matching. Proc VLDB Endow 3(1–2):1173–1184CrossRef Kandhan R, Teletia N, Patel JM (2010) SigMatch: fast and scalable multi-pattern matching. Proc VLDB Endow 3(1–2):1173–1184CrossRef
45.
Zurück zum Zitat Wang XD, Liu JX, Xu Y et al (2015) A survey of multiple sequence alignment techniques. In: International conference on intelligent computing. Springer International Publishing, pp 529–538 Wang XD, Liu JX, Xu Y et al (2015) A survey of multiple sequence alignment techniques. In: International conference on intelligent computing. Springer International Publishing, pp 529–538
46.
Zurück zum Zitat Prasad R, Agarwal S, Yadav I et al (2010) A fast bit-parallel multi-patterns string matching algorithm for biological sequences. In: Proceedings of the international symposium on biocomputing, pp 46 Prasad R, Agarwal S, Yadav I et al (2010) A fast bit-parallel multi-patterns string matching algorithm for biological sequences. In: Proceedings of the international symposium on biocomputing, pp 46
47.
Zurück zum Zitat Zhu H, He Z, Jia Y (2015) A novel approach to multiple sequence alignment using multi-objective evolutionary algorithm based on decomposition. IEEE J Biomed Health Inform 20(2):717–727CrossRef Zhu H, He Z, Jia Y (2015) A novel approach to multiple sequence alignment using multi-objective evolutionary algorithm based on decomposition. IEEE J Biomed Health Inform 20(2):717–727CrossRef
Metadaten
Titel
Co-occurrence pattern mining based on a biological approximation scoring matrix
Publikationsdatum
28.02.2017
Erschienen in
Pattern Analysis and Applications / Ausgabe 4/2018
Print ISSN: 1433-7541
Elektronische ISSN: 1433-755X
DOI
https://doi.org/10.1007/s10044-017-0609-8

Weitere Artikel der Ausgabe 4/2018

Pattern Analysis and Applications 4/2018 Zur Ausgabe