Skip to main content
Top

2016 | OriginalPaper | Chapter

Protein Sequence Classification Based on N-Gram and K-Nearest Neighbor Algorithm

Authors : Jyotshna Dongardive, Siby Abraham

Published in: Computational Intelligence in Data Mining—Volume 2

Publisher: Springer India

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

The paper proposes classification of protein sequences using K-Nearest Neighbor (KNN) algorithm. Motif extraction method N-gram is used to encode biological sequences into feature vectors. The N-gram generated is represented using Boolean data representation technique. The experiments are conducted on dataset consisting of 717 sequences unequally distributed into seven classes with a sequence identity of 25 %. The number of neighbors in the KNN classifier is varied from 3, 5, 7, 9, 11, 13 and 15. Euclidean distance and Cosine coefficient similarity measures are used for determining nearest neighbors. The experimental results revealed that the procedure with Cosine measure and the number of neighbors as 15 gave the highest accuracy of 84 %. The effectiveness of the proposed method is also shown by comparing the experimental results with those of other related methods on the same dataset.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
2.
go back to reference Lengeler, J.W.: Metabolic networks: a signal-oriented approach to cellular models. Biol. Chem. 381, 911 (2000)CrossRef Lengeler, J.W.: Metabolic networks: a signal-oriented approach to cellular models. Biol. Chem. 381, 911 (2000)CrossRef
3.
go back to reference Siomi, H., Dreyfuss, G.: RNA-binding proteins as regulators of gene expression. Curr. Opin. Genet. Dev. 7, 345 (1997)CrossRef Siomi, H., Dreyfuss, G.: RNA-binding proteins as regulators of gene expression. Curr. Opin. Genet. Dev. 7, 345 (1997)CrossRef
4.
go back to reference Draper, D.E.: Themes in RNA-protein recognition. J. Mol. Biol. 293, 255 (1999)CrossRef Draper, D.E.: Themes in RNA-protein recognition. J. Mol. Biol. 293, 255 (1999)CrossRef
5.
go back to reference Webb, A., Copsey, K., Cawley, G.: Statistical Pattern Recognition (2011) Webb, A., Copsey, K., Cawley, G.: Statistical Pattern Recognition (2011)
6.
go back to reference Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, New York (2006) Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, New York (2006)
7.
go back to reference Vapnik, V.N.: The Nature of Statistical Learning Theory, 2nd edn. Springer, New York (1999)MATH Vapnik, V.N.: The Nature of Statistical Learning Theory, 2nd edn. Springer, New York (1999)MATH
8.
go back to reference Burges, C.J.C.: A tutorial on support vector machine for pattern recognition. Data Min. Knowl. Disc. 2, 121 (1998)CrossRef Burges, C.J.C.: A tutorial on support vector machine for pattern recognition. Data Min. Knowl. Disc. 2, 121 (1998)CrossRef
9.
go back to reference Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48(3), 443–453 (1990)CrossRef Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48(3), 443–453 (1990)CrossRef
10.
go back to reference Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197 (1987)CrossRef Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197 (1987)CrossRef
11.
go back to reference Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. J. Mol. Biol. 215(3), 403–410 (1990)CrossRef Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. J. Mol. Biol. 215(3), 403–410 (1990)CrossRef
12.
go back to reference Pearson, W.: Finding protein and nucleotide similarities with FASTA. Current Protocols in Bioinformatics, Chapter 3, unit 3.9 (2004) Pearson, W.: Finding protein and nucleotide similarities with FASTA. Current Protocols in Bioinformatics, Chapter 3, unit 3.9 (2004)
13.
go back to reference Wu, C.H., Huang, H., Yeh, L., Barker, W.C.: Protein family classification and functional annotation. Comput. Biol. Chem. 27(1), 37–47 (2003)CrossRef Wu, C.H., Huang, H., Yeh, L., Barker, W.C.: Protein family classification and functional annotation. Comput. Biol. Chem. 27(1), 37–47 (2003)CrossRef
14.
go back to reference Pearson, W.R.: Effective protein sequence comparison. Methods Enzymol. 266, 227–258 (1996)CrossRef Pearson, W.R.: Effective protein sequence comparison. Methods Enzymol. 266, 227–258 (1996)CrossRef
15.
go back to reference Pearson, W.R.: Empirical statistical estimates for sequence similarity searches. J. Mol. Biol. 276(1), 71–84 (1998)MathSciNetCrossRef Pearson, W.R.: Empirical statistical estimates for sequence similarity searches. J. Mol. Biol. 276(1), 71–84 (1998)MathSciNetCrossRef
16.
go back to reference Jeong, J.C., Lin, X., Chen, X.: On position-specific scoring matrix for protein function prediction. IEEE/ACM Trans. Comput. Biol. Bioinf. 8(2), 308–315 (2011)CrossRef Jeong, J.C., Lin, X., Chen, X.: On position-specific scoring matrix for protein function prediction. IEEE/ACM Trans. Comput. Biol. Bioinf. 8(2), 308–315 (2011)CrossRef
17.
go back to reference Mansoori, E.G., Zolghadri, M.J., Katebi, S.D.: Protein superfamily classification using fuzzy rule-based classifier. IEEE Trans. Nanobiosci. 8(1), 92–99 (2009)CrossRef Mansoori, E.G., Zolghadri, M.J., Katebi, S.D.: Protein superfamily classification using fuzzy rule-based classifier. IEEE Trans. Nanobiosci. 8(1), 92–99 (2009)CrossRef
18.
go back to reference Bandyopadhyay, S.: An efficient technique for superfamily classification of amino acid sequences: feature extraction, fuzzy clustering and prototype selection. Fuzzy Sets Syst. 152(1), 5–16 (2005)MathSciNetCrossRefMATH Bandyopadhyay, S.: An efficient technique for superfamily classification of amino acid sequences: feature extraction, fuzzy clustering and prototype selection. Fuzzy Sets Syst. 152(1), 5–16 (2005)MathSciNetCrossRefMATH
19.
go back to reference Leslie, C., Eskin, E., Noble, W.S.: The spectrum kernel: a string kernel for SVM protein classification. In: Proceedings of the Pacific Symposium on Biocomputing, pp. 564–575 (2002) Leslie, C., Eskin, E., Noble, W.S.: The spectrum kernel: a string kernel for SVM protein classification. In: Proceedings of the Pacific Symposium on Biocomputing, pp. 564–575 (2002)
20.
go back to reference Caragea, C., Silvescu, A., Mitra, P.: Protein sequence classification using feature hashing. In: Proceedings of the IEEE International Conference on Bioinformatics and Biomedicine (BIBM ’11), pp. 538–543. Atlanta, GA, USA, Nov 2011 Caragea, C., Silvescu, A., Mitra, P.: Protein sequence classification using feature hashing. In: Proceedings of the IEEE International Conference on Bioinformatics and Biomedicine (BIBM ’11), pp. 538–543. Atlanta, GA, USA, Nov 2011
21.
go back to reference Yu, C., He, R.L., S.-T. Yau, S.: Protein sequence comparison based on K-string dictionary. Gene 529, 250–256 (2013) Yu, C., He, R.L., S.-T. Yau, S.: Protein sequence comparison based on K-string dictionary. Gene 529, 250–256 (2013)
22.
go back to reference Ramadevi, Y., Rao, C.R.: Rough set protein classifier. J. Theor. Appl. Inf. Technol. (2009) Ramadevi, Y., Rao, C.R.: Rough set protein classifier. J. Theor. Appl. Inf. Technol. (2009)
23.
go back to reference Suprativ, S., Rituparna, C.: A brief review of data mining application involving protein sequence classification. Advances in Computing and Information Technology (2013) Suprativ, S., Rituparna, C.: A brief review of data mining application involving protein sequence classification. Advances in Computing and Information Technology (2013)
24.
go back to reference Wang, J.T.L., Ma, Q.H., Shasha, D., Wu, C.H.: Application of neural networks to biological data mining: a case study in protein sequence classification. In: KDD, Boston, MA, USA, pp. 305–309 (2000) Wang, J.T.L., Ma, Q.H., Shasha, D., Wu, C.H.: Application of neural networks to biological data mining: a case study in protein sequence classification. In: KDD, Boston, MA, USA, pp. 305–309 (2000)
25.
go back to reference Yu, X., Wang, C., Li, Y.: Classification of protein quaternary structure by functional domain composition. BMC Bioinf. 7, 187–192 (2006)CrossRef Yu, X., Wang, C., Li, Y.: Classification of protein quaternary structure by functional domain composition. BMC Bioinf. 7, 187–192 (2006)CrossRef
26.
go back to reference Boeckmann, B., Bairoch, A., Apweiler, R., Blatter, M.C., Estreicher, A., Gasteiger, E., Martin, M.J., Michoud, K., O’Donovan, C., Phan, I., Pilbout, S., Schneider, M.: The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res. 31, 365–370 (2003)CrossRef Boeckmann, B., Bairoch, A., Apweiler, R., Blatter, M.C., Estreicher, A., Gasteiger, E., Martin, M.J., Michoud, K., O’Donovan, C., Phan, I., Pilbout, S., Schneider, M.: The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res. 31, 365–370 (2003)CrossRef
27.
go back to reference Bairoch, A., Boeckmann, B., Ferro, S., Gasteiger, E.: Swiss-Prot: juggling between evolution and stability. Brief Bioinform. 5, 39–55 (2004)CrossRef Bairoch, A., Boeckmann, B., Ferro, S., Gasteiger, E.: Swiss-Prot: juggling between evolution and stability. Brief Bioinform. 5, 39–55 (2004)CrossRef
28.
go back to reference Leslie, C., Eskin, E., Noble, W.S.: The spectrum kernel: a string kernel for SVM protein classification. In: Proceedings of the Pacific Symposium on Biocomputing (PSB), pp. 564–575 (2002) Leslie, C., Eskin, E., Noble, W.S.: The spectrum kernel: a string kernel for SVM protein classification. In: Proceedings of the Pacific Symposium on Biocomputing (PSB), pp. 564–575 (2002)
29.
go back to reference Chou, P.Y.: Prediction of protein structural classes from amino acid composition. In: Fasman, G.D. (ed.) Prediction of Protein Structure and the Principles of Protein Conformation. Plenum Press, New York, pp. 549–586 (1989) Chou, P.Y.: Prediction of protein structural classes from amino acid composition. In: Fasman, G.D. (ed.) Prediction of Protein Structure and the Principles of Protein Conformation. Plenum Press, New York, pp. 549–586 (1989)
30.
go back to reference Saidi, R., Maddouri, M., Nguifo, E.M.: Protein sequence classification by means of feature extraction with substitution matrices. BMC Bioinf. 11, 175 (2010) Saidi, R., Maddouri, M., Nguifo, E.M.: Protein sequence classification by means of feature extraction with substitution matrices. BMC Bioinf. 11, 175 (2010)
Metadata
Title
Protein Sequence Classification Based on N-Gram and K-Nearest Neighbor Algorithm
Authors
Jyotshna Dongardive
Siby Abraham
Copyright Year
2016
Publisher
Springer India
DOI
https://doi.org/10.1007/978-81-322-2731-1_15

Premium Partner