Skip to main content
Erschienen in: Pattern Analysis and Applications 2-3/2006

01.10.2006 | Theoretical Advances

Efficient median based clustering and classification techniques for protein sequences

verfasst von: P. A. Vijaya, M. Narasimha Murty, D. K. Subramanian

Erschienen in: Pattern Analysis and Applications | Ausgabe 2-3/2006

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

In this paper, an efficient K-medians clustering (unsupervised) algorithm for prototype selection and Supervised K-medians (SKM) classification technique for protein sequences are presented. For sequence data sets, a median string/sequence can be used as the cluster/group representative. In K-medians clustering technique, a desired number of clusters, K, each represented by a median string/sequence, is generated and these median sequences are used as prototypes for classifying the new/test sequence whereas in SKM classification technique, median sequence in each group/class of labelled protein sequences is determined and the set of median sequences is used as prototypes for classification purpose. It is found that the K-medians clustering technique outperforms the leader based technique and also SKM classification technique performs better than that of motifs based approach for the data sets used. We further use a simple technique to reduce time and space requirements during protein sequence clustering and classification. During training and testing phase, the similarity score value between a pair of sequences is determined by selecting a portion of the sequence instead of the entire sequence. It is like selecting a subset of features for sequence data sets. The experimental results of the proposed method on K-medians, SKM and Nearest Neighbour Classifier (NNC) techniques show that the Classification Accuracy (CA) using the prototypes generated/used does not degrade much but the training and testing time are reduced significantly. Thus the experimental results indicate that the similarity score does not need to be calculated by considering the entire length of the sequence for achieving a good CA. Even space requirement is reduced during both training and classification.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Bandyopadhyay S (2005) An efficient technique for superfamily classification of amino acid sequences: feature extraction, fuzzy clustering and prototype selection. Fuzzy Sets Syst 152(1):5–16MATHMathSciNetCrossRef Bandyopadhyay S (2005) An efficient technique for superfamily classification of amino acid sequences: feature extraction, fuzzy clustering and prototype selection. Fuzzy Sets Syst 152(1):5–16MATHMathSciNetCrossRef
2.
Zurück zum Zitat Bolten E, Schliep A, Schneckener S, Schomburg D, Schrader R (2001) Clustering protein sequences-structure prediction by transitive homology. Bioinformatics 17(10):935–941CrossRef Bolten E, Schliep A, Schneckener S, Schomburg D, Schrader R (2001) Clustering protein sequences-structure prediction by transitive homology. Bioinformatics 17(10):935–941CrossRef
3.
Zurück zum Zitat Conte LL, Ailey B, Hubbard TJP, Brenner SE, Murzin AG, Chotia C (2000) SCOP: a structural classification of protein database. Nucleic Acids Res 28(1):257–259CrossRef Conte LL, Ailey B, Hubbard TJP, Brenner SE, Murzin AG, Chotia C (2000) SCOP: a structural classification of protein database. Nucleic Acids Res 28(1):257–259CrossRef
4.
Zurück zum Zitat Cover T, Hart P (1967) Nearest neighbour pattern classification. IEEE Trans Inform Theory 13(1):21–27MATHCrossRef Cover T, Hart P (1967) Nearest neighbour pattern classification. IEEE Trans Inform Theory 13(1):21–27MATHCrossRef
5.
Zurück zum Zitat Duda RO, Hart PE, Stork DG (2000) Pattern classification, 2nd edn. Wiley, New York Duda RO, Hart PE, Stork DG (2000) Pattern classification, 2nd edn. Wiley, New York
6.
Zurück zum Zitat Durbin R, Eddy S, Krogh A, Mitchison G (1998) Biological sequence analysis. Cambridge University Press, Cambridge Durbin R, Eddy S, Krogh A, Mitchison G (1998) Biological sequence analysis. Cambridge University Press, Cambridge
7.
Zurück zum Zitat Guha S, Meyerson A, Mishra N, Motwani R, O’Callaghan L (2003) Clustering data streams: theory and practice. IEEE Trans Knowl Data Eng 153:515–528CrossRef Guha S, Meyerson A, Mishra N, Motwani R, O’Callaghan L (2003) Clustering data streams: theory and practice. IEEE Trans Knowl Data Eng 153:515–528CrossRef
8.
Zurück zum Zitat Guralnik V, Karypis G (2001) A scalable algorithm for clustering sequential data. In: Proceedings of I IEEE conference on data mining, pp 179–186 Guralnik V, Karypis G (2001) A scalable algorithm for clustering sequential data. In: Proceedings of I IEEE conference on data mining, pp 179–186
9.
Zurück zum Zitat Hamamoto Y, Uchimura S, Tomita S (1996) On the behavior of artificial neural network classifiers in high-dimensional spaces. IEEE Trans Pattern Anal Mach Intell 18(5):571–574CrossRef Hamamoto Y, Uchimura S, Tomita S (1996) On the behavior of artificial neural network classifiers in high-dimensional spaces. IEEE Trans Pattern Anal Mach Intell 18(5):571–574CrossRef
10.
Zurück zum Zitat Han E, Karypis G, Kumar V, Mobasher B (1997) Clustering in a high dimensional space using hypergraph models. In: Proceedings of data mining and knowledge discovery Han E, Karypis G, Kumar V, Mobasher B (1997) Clustering in a high dimensional space using hypergraph models. In: Proceedings of data mining and knowledge discovery
11.
Zurück zum Zitat Henikoff S, Henikoff JG (1994) Protein family classification based on searching a database of blocks. Genomics 19:97–107CrossRef Henikoff S, Henikoff JG (1994) Protein family classification based on searching a database of blocks. Genomics 19:97–107CrossRef
12.
Zurück zum Zitat Huang X, Webb M (1991) A time-efficient, linear-space local similarity algorithm. Adv Appl Math 12:337–357MATHCrossRef Huang X, Webb M (1991) A time-efficient, linear-space local similarity algorithm. Adv Appl Math 12:337–357MATHCrossRef
13.
Zurück zum Zitat Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice-Hall, Upper Saddle RiverMATH Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice-Hall, Upper Saddle RiverMATH
14.
Zurück zum Zitat Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323CrossRef Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323CrossRef
15.
Zurück zum Zitat Kaufman L, Rousseeuw P (1990) Finding groups in data: an introduction to cluster analysis. Wiley, New York Kaufman L, Rousseeuw P (1990) Finding groups in data: an introduction to cluster analysis. Wiley, New York
16.
Zurück zum Zitat Knuth DE (1998) Art of computer programming, 2nd edn, vol 3. Addison- Wesley, Reading Knuth DE (1998) Art of computer programming, 2nd edn, vol 3. Addison- Wesley, Reading
17.
Zurück zum Zitat Kohonen T (1985) Median strings. Pattern Recogn Lett (3):309–313 Kohonen T (1985) Median strings. Pattern Recogn Lett (3):309–313
18.
Zurück zum Zitat Krause A (2002) Large scale clustering of protein sequences. Ph.D. Thesis, Berlin Krause A (2002) Large scale clustering of protein sequences. Ph.D. Thesis, Berlin
19.
Zurück zum Zitat Kriventseva EV, Fleischmann W, Zdobnov EM, Apweiler G (2001) CluSTr: a database of clusters of SWISS-PROT+TrEMBL proteins. Nucleic Acids Res 29(1):33–36CrossRef Kriventseva EV, Fleischmann W, Zdobnov EM, Apweiler G (2001) CluSTr: a database of clusters of SWISS-PROT+TrEMBL proteins. Nucleic Acids Res 29(1):33–36CrossRef
20.
Zurück zum Zitat Martinez CD, Juan A, Casacuberta F (2003) Median strings for k-nearest neighbour classification. Pattern Recogn Lett 24:173–181CrossRef Martinez CD, Juan A, Casacuberta F (2003) Median strings for k-nearest neighbour classification. Pattern Recogn Lett 24:173–181CrossRef
21.
Zurück zum Zitat MicA L, Oncina J, Vidal E (1994) A new version of the nearest-neighbor approximating and eliminating search algorithm (AESA) with linear preprocessing time and memory requirements. Pattern Recogn Lett 15:9–17CrossRef MicA L, Oncina J, Vidal E (1994) A new version of the nearest-neighbor approximating and eliminating search algorithm (AESA) with linear preprocessing time and memory requirements. Pattern Recogn Lett 15:9–17CrossRef
22.
Zurück zum Zitat MicA L, Oncina J, Carrasco R (1996) A fast branch and bound nearest neighbor classifier in metric spaces. Pattern Recogn Lett 17:731–739CrossRef MicA L, Oncina J, Carrasco R (1996) A fast branch and bound nearest neighbor classifier in metric spaces. Pattern Recogn Lett 17:731–739CrossRef
23.
Zurück zum Zitat Mitra S, Acharya T (2003) Data mining: multimedia, soft computing and bioinformatics. Wiley, New York Mitra S, Acharya T (2003) Data mining: multimedia, soft computing and bioinformatics. Wiley, New York
24.
Zurück zum Zitat Moreno F, MicA L, Oncina J (2003) A modification of the LAESA algorithm for approximated k-NN classification. Pattern Recogn Lett 22:1145–1151 Moreno F, MicA L, Oncina J (2003) A modification of the LAESA algorithm for approximated k-NN classification. Pattern Recogn Lett 22:1145–1151
25.
Zurück zum Zitat Mount DW (2002) Bioinformatics—sequence and genome analysis. Cold Spring Harbor Lab Press, New York Mount DW (2002) Bioinformatics—sequence and genome analysis. Cold Spring Harbor Lab Press, New York
26.
Zurück zum Zitat Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequence of the proteins. J Mol Biol 48:443–453CrossRef Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequence of the proteins. J Mol Biol 48:443–453CrossRef
27.
Zurück zum Zitat Ng RT, Han J (2002) CLARANS: a method for clustering objects for spatial data mining. IEEE Trans Knowl Data Eng 14(5):1003–1016CrossRef Ng RT, Han J (2002) CLARANS: a method for clustering objects for spatial data mining. IEEE Trans Knowl Data Eng 14(5):1003–1016CrossRef
28.
Zurück zum Zitat Pal SK, Mitra P (2004) Pattern recognition algorithms for data mining: scalability, knowledge discovery and soft granular computing. CHAPMAN & HALL/CRC Pal SK, Mitra P (2004) Pattern recognition algorithms for data mining: scalability, knowledge discovery and soft granular computing. CHAPMAN & HALL/CRC
30.
Zurück zum Zitat Peter C, Rolf B (2000) Computational molecular biology—an introduction. Wiley, New York Peter C, Rolf B (2000) Computational molecular biology—an introduction. Wiley, New York
31.
Zurück zum Zitat Pujari AK (2000) Data mining techniques. Universities Press (India) Private Limited Pujari AK (2000) Data mining techniques. Universities Press (India) Private Limited
32.
Zurück zum Zitat Ramasubramanian V, Paliwal KK (2000) Fast nearest neighbor search algorithms based on approximation-elimination search. Pattern Recogn 33:1497–1510CrossRef Ramasubramanian V, Paliwal KK (2000) Fast nearest neighbor search algorithms based on approximation-elimination search. Pattern Recogn 33:1497–1510CrossRef
33.
Zurück zum Zitat Sahni S (1998) Data Structures, Algorithms and applications in C++. WCB McGraw Hill Sahni S (1998) Data Structures, Algorithms and applications in C++. WCB McGraw Hill
34.
Zurück zum Zitat Salzberg S, Cost S (1992) Predicting protein secondary structure with a nearest neighbour algorithm. J Mol Biol 227:371–374CrossRef Salzberg S, Cost S (1992) Predicting protein secondary structure with a nearest neighbour algorithm. J Mol Biol 227:371–374CrossRef
36.
Zurück zum Zitat Sharan R, Shamir R (2000) CLICK: a clustering algorithm with applications to gene expression analysis. In: Proceedings of 8th ISMB, pp 307–316 Sharan R, Shamir R (2000) CLICK: a clustering algorithm with applications to gene expression analysis. In: Proceedings of 8th ISMB, pp 307–316
37.
Zurück zum Zitat Smith TF, Waterman MS (1981) Identification of common molecular subsequences. J Mol Biol 147:195–197CrossRef Smith TF, Waterman MS (1981) Identification of common molecular subsequences. J Mol Biol 147:195–197CrossRef
38.
Zurück zum Zitat Spath H (1980) Cluster analysis algorithms for data reduction and classification. Ellis Horwood, Chichester Spath H (1980) Cluster analysis algorithms for data reduction and classification. Ellis Horwood, Chichester
39.
Zurück zum Zitat Somervuo P, Kohonen T (2000) Clustering and visualization of large protein sequence databases by means of an extension of the self-organizing map. In: Proceedings of 3rd international conference on discovery science, pp 76–85 Somervuo P, Kohonen T (2000) Clustering and visualization of large protein sequence databases by means of an extension of the self-organizing map. In: Proceedings of 3rd international conference on discovery science, pp 76–85
40.
Zurück zum Zitat Vidal E (1986) An algorithm for finding nearest neighbors in (approximately) constant average time. Pattern Recogn Lett 4:145–157CrossRef Vidal E (1986) An algorithm for finding nearest neighbors in (approximately) constant average time. Pattern Recogn Lett 4:145–157CrossRef
41.
Zurück zum Zitat Vijaya PA, Murty MN, Subramanian DK (2003) An efficient incremental protein sequence clustering algorithm. In: Proceedings of IEEE TENCON, Asia Pacific, pp 409–413 Vijaya PA, Murty MN, Subramanian DK (2003) An efficient incremental protein sequence clustering algorithm. In: Proceedings of IEEE TENCON, Asia Pacific, pp 409–413
42.
Zurück zum Zitat Vijaya PA, Murty MN, Subramanian DK (2004) An efficient hierarchical clustering algorithm for protein sequences. Int J Comput Sci Appl 1(2):61–75 Vijaya PA, Murty MN, Subramanian DK (2004) An efficient hierarchical clustering algorithm for protein sequences. Int J Comput Sci Appl 1(2):61–75
43.
Zurück zum Zitat Vijaya PA, Murty MN, Subramanian DK (2003) Supervised K-medians algorithm for protein sequence classification. In; Proceedings of 5th international conference on advanced pattern recognition, pp 129–132 Vijaya PA, Murty MN, Subramanian DK (2003) Supervised K-medians algorithm for protein sequence classification. In; Proceedings of 5th international conference on advanced pattern recognition, pp 129–132
44.
Zurück zum Zitat Vijaya PA, Murty MN, Subramanian DK (2004) An efficient technique for protein sequence clustering and classification. In: Proceedings of 17th international conference on pattern recognition, Cambridge, UK, Vol II, pp 447–450 Vijaya PA, Murty MN, Subramanian DK (2004) An efficient technique for protein sequence clustering and classification. In: Proceedings of 17th international conference on pattern recognition, Cambridge, UK, Vol II, pp 447–450
45.
Zurück zum Zitat Wang JTL, Thomas GM, Dennis S, Bruce S (1994) Chern, discovering active motifs in sets of related protein sequences and using them for classification. Nucleic Acids Res 6(4):559–571 Wang JTL, Thomas GM, Dennis S, Bruce S (1994) Chern, discovering active motifs in sets of related protein sequences and using them for classification. Nucleic Acids Res 6(4):559–571
46.
Zurück zum Zitat Yi TM, Eric S (1993) Protein secondary structure prediction using nearest neighbour methods. J Mol Biol 232:1117–1129CrossRef Yi TM, Eric S (1993) Protein secondary structure prediction using nearest neighbour methods. J Mol Biol 232:1117–1129CrossRef
47.
Zurück zum Zitat Yona G, Linial N, Linial M (2000) ProtoMap: automatic classification of protein sequences and hierarchy of protein families. Nucleic Acids Res 28(1):49–55CrossRef Yona G, Linial N, Linial M (2000) ProtoMap: automatic classification of protein sequences and hierarchy of protein families. Nucleic Acids Res 28(1):49–55CrossRef
Metadaten
Titel
Efficient median based clustering and classification techniques for protein sequences
verfasst von
P. A. Vijaya
M. Narasimha Murty
D. K. Subramanian
Publikationsdatum
01.10.2006
Verlag
Springer-Verlag
Erschienen in
Pattern Analysis and Applications / Ausgabe 2-3/2006
Print ISSN: 1433-7541
Elektronische ISSN: 1433-755X
DOI
https://doi.org/10.1007/s10044-006-0040-z

Weitere Artikel der Ausgabe 2-3/2006

Pattern Analysis and Applications 2-3/2006 Zur Ausgabe

Premium Partner