nach oben

Pattern Analysis and Applications

Erschienen in:

01.10.2006 | Theoretical Advances

Efficient median based clustering and classification techniques for protein sequences

verfasst von: P. A. Vijaya, M. Narasimha Murty, D. K. Subramanian

Erschienen in: Pattern Analysis and Applications | Ausgabe 2-3/2006

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

In this paper, an efficient K-medians clustering (unsupervised) algorithm for prototype selection and Supervised K-medians (SKM) classification technique for protein sequences are presented. For sequence data sets, a median string/sequence can be used as the cluster/group representative. In K-medians clustering technique, a desired number of clusters, K, each represented by a median string/sequence, is generated and these median sequences are used as prototypes for classifying the new/test sequence whereas in SKM classification technique, median sequence in each group/class of labelled protein sequences is determined and the set of median sequences is used as prototypes for classification purpose. It is found that the K-medians clustering technique outperforms the leader based technique and also SKM classification technique performs better than that of motifs based approach for the data sets used. We further use a simple technique to reduce time and space requirements during protein sequence clustering and classification. During training and testing phase, the similarity score value between a pair of sequences is determined by selecting a portion of the sequence instead of the entire sequence. It is like selecting a subset of features for sequence data sets. The experimental results of the proposed method on K-medians, SKM and Nearest Neighbour Classifier (NNC) techniques show that the Classification Accuracy (CA) using the prototypes generated/used does not degrade much but the training and testing time are reduced significantly. Thus the experimental results indicate that the similarity score does not need to be calculated by considering the entire length of the sequence for achieving a good CA. Even space requirement is reduced during both training and classification.

Vorheriger Artikel A DCT-based ownership identification method with gray-level and colorful signatures

Nächster Artikel Decomposition methodology for classification tasks: a meta decomposer framework

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Bandyopadhyay S (2005) An efficient technique for superfamily classification of amino acid sequences: feature extraction, fuzzy clustering and prototype selection. Fuzzy Sets Syst 152(1):5–16MATHMathSciNetCrossRef

Bolten E, Schliep A, Schneckener S, Schomburg D, Schrader R (2001) Clustering protein sequences-structure prediction by transitive homology. Bioinformatics 17(10):935–941CrossRef

Conte LL, Ailey B, Hubbard TJP, Brenner SE, Murzin AG, Chotia C (2000) SCOP: a structural classification of protein database. Nucleic Acids Res 28(1):257–259CrossRef

Cover T, Hart P (1967) Nearest neighbour pattern classification. IEEE Trans Inform Theory 13(1):21–27MATHCrossRef

Duda RO, Hart PE, Stork DG (2000) Pattern classification, 2nd edn. Wiley, New York

Durbin R, Eddy S, Krogh A, Mitchison G (1998) Biological sequence analysis. Cambridge University Press, Cambridge

Guha S, Meyerson A, Mishra N, Motwani R, O’Callaghan L (2003) Clustering data streams: theory and practice. IEEE Trans Knowl Data Eng 153:515–528CrossRef

Guralnik V, Karypis G (2001) A scalable algorithm for clustering sequential data. In: Proceedings of I IEEE conference on data mining, pp 179–186

Hamamoto Y, Uchimura S, Tomita S (1996) On the behavior of artificial neural network classifiers in high-dimensional spaces. IEEE Trans Pattern Anal Mach Intell 18(5):571–574CrossRef

10.

Han E, Karypis G, Kumar V, Mobasher B (1997) Clustering in a high dimensional space using hypergraph models. In: Proceedings of data mining and knowledge discovery

11.

Henikoff S, Henikoff JG (1994) Protein family classification based on searching a database of blocks. Genomics 19:97–107CrossRef

12.

Huang X, Webb M (1991) A time-efficient, linear-space local similarity algorithm. Adv Appl Math 12:337–357MATHCrossRef

13.

Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice-Hall, Upper Saddle RiverMATH

14.

Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323CrossRef

15.

Kaufman L, Rousseeuw P (1990) Finding groups in data: an introduction to cluster analysis. Wiley, New York

16.

Knuth DE (1998) Art of computer programming, 2nd edn, vol 3. Addison- Wesley, Reading

17.

Kohonen T (1985) Median strings. Pattern Recogn Lett (3):309–313

18.

Krause A (2002) Large scale clustering of protein sequences. Ph.D. Thesis, Berlin

19.

Kriventseva EV, Fleischmann W, Zdobnov EM, Apweiler G (2001) CluSTr: a database of clusters of SWISS-PROT+TrEMBL proteins. Nucleic Acids Res 29(1):33–36CrossRef

20.

Martinez CD, Juan A, Casacuberta F (2003) Median strings for k-nearest neighbour classification. Pattern Recogn Lett 24:173–181CrossRef

21.

MicA L, Oncina J, Vidal E (1994) A new version of the nearest-neighbor approximating and eliminating search algorithm (AESA) with linear preprocessing time and memory requirements. Pattern Recogn Lett 15:9–17CrossRef

22.

MicA L, Oncina J, Carrasco R (1996) A fast branch and bound nearest neighbor classifier in metric spaces. Pattern Recogn Lett 17:731–739CrossRef

23.

Mitra S, Acharya T (2003) Data mining: multimedia, soft computing and bioinformatics. Wiley, New York

24.

Moreno F, MicA L, Oncina J (2003) A modification of the LAESA algorithm for approximated k-NN classification. Pattern Recogn Lett 22:1145–1151

25.

Mount DW (2002) Bioinformatics—sequence and genome analysis. Cold Spring Harbor Lab Press, New York

26.

Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequence of the proteins. J Mol Biol 48:443–453CrossRef

27.

Ng RT, Han J (2002) CLARANS: a method for clustering objects for spatial data mining. IEEE Trans Knowl Data Eng 14(5):1003–1016CrossRef

28.

Pal SK, Mitra P (2004) Pattern recognition algorithms for data mining: scalability, knowledge discovery and soft granular computing. CHAPMAN & HALL/CRC

29.

Pearson W (1999) The FASTA program package http://ftp.virginia.edu/pub/fasta

30.

Peter C, Rolf B (2000) Computational molecular biology—an introduction. Wiley, New York

31.

Pujari AK (2000) Data mining techniques. Universities Press (India) Private Limited

32.

Ramasubramanian V, Paliwal KK (2000) Fast nearest neighbor search algorithms based on approximation-elimination search. Pattern Recogn 33:1497–1510CrossRef

33.

Sahni S (1998) Data Structures, Algorithms and applications in C++. WCB McGraw Hill

34.

Salzberg S, Cost S (1992) Predicting protein secondary structure with a nearest neighbour algorithm. J Mol Biol 227:371–374CrossRef

35.

Schutze H (2004) Single-link, complete-link and average-link clustering. “http://www.csli.stanford.edu/~schuetze/completelink.html

36.

Sharan R, Shamir R (2000) CLICK: a clustering algorithm with applications to gene expression analysis. In: Proceedings of 8th ISMB, pp 307–316

37.

Smith TF, Waterman MS (1981) Identification of common molecular subsequences. J Mol Biol 147:195–197CrossRef

38.

Spath H (1980) Cluster analysis algorithms for data reduction and classification. Ellis Horwood, Chichester

39.

Somervuo P, Kohonen T (2000) Clustering and visualization of large protein sequence databases by means of an extension of the self-organizing map. In: Proceedings of 3rd international conference on discovery science, pp 76–85

40.

Vidal E (1986) An algorithm for finding nearest neighbors in (approximately) constant average time. Pattern Recogn Lett 4:145–157CrossRef

41.

Vijaya PA, Murty MN, Subramanian DK (2003) An efficient incremental protein sequence clustering algorithm. In: Proceedings of IEEE TENCON, Asia Pacific, pp 409–413

42.

Vijaya PA, Murty MN, Subramanian DK (2004) An efficient hierarchical clustering algorithm for protein sequences. Int J Comput Sci Appl 1(2):61–75

43.

Vijaya PA, Murty MN, Subramanian DK (2003) Supervised K-medians algorithm for protein sequence classification. In; Proceedings of 5th international conference on advanced pattern recognition, pp 129–132

44.

Vijaya PA, Murty MN, Subramanian DK (2004) An efficient technique for protein sequence clustering and classification. In: Proceedings of 17th international conference on pattern recognition, Cambridge, UK, Vol II, pp 447–450

45.

Wang JTL, Thomas GM, Dennis S, Bruce S (1994) Chern, discovering active motifs in sets of related protein sequences and using them for classification. Nucleic Acids Res 6(4):559–571

46.

Yi TM, Eric S (1993) Protein secondary structure prediction using nearest neighbour methods. J Mol Biol 232:1117–1129CrossRef

47.

Yona G, Linial N, Linial M (2000) ProtoMap: automatic classification of protein sequences and hierarchy of protein families. Nucleic Acids Res 28(1):49–55CrossRef

48.

http://www.pfam.cgb.ki.se/

49.

http://www.systers.molgen.mpg.de/

50.

http://www.protonet.cs.huji.ac.il/

51.

http://www.pir.georgetown.edu/pirsf/

Titel: Efficient median based clustering and classification techniques for protein sequences
verfasst von: P. A. Vijaya
M. Narasimha Murty
D. K. Subramanian
Publikationsdatum: 01.10.2006
Verlag: Springer-Verlag
Erschienen in: Pattern Analysis and Applications / Ausgabe 2-3/2006
Print ISSN: 1433-7541
Elektronische ISSN: 1433-755X
DOI: https://doi.org/10.1007/s10044-006-0040-z

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 2-3/2006

Moving object detection by mobile Stereo Omni-directional System (SOS) using spherical depth image

A review on Gabor wavelets for face recognition

Fusion of phase and orientation information for palmprint authentication

Classifier-independent feature selection on the basis of divergence criterion

A novel look-ahead optimization strategy for trie-based approximate string matching

A novel feature-based tracking approach to the detection, localization, and 3-D reconstruction of internal defects in hardwood logs using computer tomography

Premium Partner