Skip to main content

2016 | OriginalPaper | Buchkapitel

Evaluation of Descriptor Algorithms of Biological Sequences and Distance Measures for the Intelligent Cluster Index (ICIx)

verfasst von : Stefan Schildbach, Florian Heinke, Wolfgang Benn, Dirk Labudde

Erschienen in: Beyond Databases, Architectures and Structures. Advanced Technologies for Data Mining and Knowledge Discovery

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

In hindsight of the previous decades, a rapid growth of data in all fields of life sciences is perceptible. Most notably is the general tendency of retaining well established techniques regarding specific biological requirements and common taxonomies for data classification. Therefore a change in perspective towards advanced technological concepts for persisting, organizing and analyzing these huge amounts of data is essential. The Intelligent Cluster Index (ICIx) is a modern technology capable of indexing multidimensional data through semantic criteria, qualified for this challenge. In this paper methodical approaches for indexing biological sequences with the ICIx are discussed and evaluated. This includes the examination of established methods concentrating on vector transformation as well as outlining the efficiency of different distance measures applied to these vectors. Based on our results, it becomes apparent that position conserving methods are superior to other approaches and that the applied distance measures heavily influence performance and quality.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
Other commonly used notions for n-grams are k-, t- or n-tuples and k-, t- or n-mers.
 
Literatur
1.
Zurück zum Zitat Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25(17), 3389–3402 (1997)CrossRef Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25(17), 3389–3402 (1997)CrossRef
2.
Zurück zum Zitat Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. J. Mol. Biol. 215(3), 403–410 (1990)CrossRef Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. J. Mol. Biol. 215(3), 403–410 (1990)CrossRef
3.
Zurück zum Zitat Baby, J., Kannan, T., Vinod, P., Gopal, V.: Distance indices for the detection of similarity in C programs. In: International Conference on Computation of Power, Energy, Information and Communication (ICCPEIC), pp. 462–467. IEEE (2014) Baby, J., Kannan, T., Vinod, P., Gopal, V.: Distance indices for the detection of similarity in C programs. In: International Conference on Computation of Power, Energy, Information and Communication (ICCPEIC), pp. 462–467. IEEE (2014)
4.
Zurück zum Zitat Bao, J., Yuan, R., Bao, Z.: An improved alignment-free model for dna sequence similarity metric. BMC Bioinform. 15(1), 321 (2014)CrossRef Bao, J., Yuan, R., Bao, Z.: An improved alignment-free model for dna sequence similarity metric. BMC Bioinform. 15(1), 321 (2014)CrossRef
5.
Zurück zum Zitat Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J., Sayers, E.W.: Genbank. Nucleic Acids Res. 39(suppl 1), D32–D37 (2011)CrossRef Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J., Sayers, E.W.: Genbank. Nucleic Acids Res. 39(suppl 1), D32–D37 (2011)CrossRef
6.
Zurück zum Zitat Bogan-Marta, A., Hategan, A., Pitas, I.: Language engineering and information theoretic methods in protein sequence similarity studies. Computational Intelligence in Medical Informatics, pp. 151–183. Springer, Heidelberg (2008)CrossRef Bogan-Marta, A., Hategan, A., Pitas, I.: Language engineering and information theoretic methods in protein sequence similarity studies. Computational Intelligence in Medical Informatics, pp. 151–183. Springer, Heidelberg (2008)CrossRef
7.
Zurück zum Zitat Boratyn, G.M., Camacho, C., Cooper, P.S., Coulouris, G., Fong, A., Ma, N., Madden, T.L., Matten, W.T., McGinnis, S.D., Merezhuk, Y., Raytselis, Y., Sayers, E.W., Tao, T., Ye, J., Zaretskaya, I.: BLAST: a more efficient report with usability improvements. Nucleic Acids Res. 41(W1), W29–W33 (2013)CrossRef Boratyn, G.M., Camacho, C., Cooper, P.S., Coulouris, G., Fong, A., Ma, N., Madden, T.L., Matten, W.T., McGinnis, S.D., Merezhuk, Y., Raytselis, Y., Sayers, E.W., Tao, T., Ye, J., Zaretskaya, I.: BLAST: a more efficient report with usability improvements. Nucleic Acids Res. 41(W1), W29–W33 (2013)CrossRef
8.
Zurück zum Zitat Cha, S.H.: Taxonomy of nominal type histogram distance measures. In: Proceedings of the American Conference on Applied Mathematics, pp. 325–330. World Scientific and Engineering Academy and Society (WSEAS) (2008) Cha, S.H.: Taxonomy of nominal type histogram distance measures. In: Proceedings of the American Conference on Applied Mathematics, pp. 325–330. World Scientific and Engineering Academy and Society (WSEAS) (2008)
9.
Zurück zum Zitat Deza, M.M., Deza, E.: Encyclopedia of Distances. Springer, Heidelberg (2012)MATH Deza, M.M., Deza, E.: Encyclopedia of Distances. Springer, Heidelberg (2012)MATH
10.
Zurück zum Zitat Doreswamy, Manohar, M.G., Hemanth, K.S.: A study on similarity measure functions on engineering materials selection. AIAA 1, 157–168 (2011) Doreswamy, Manohar, M.G., Hemanth, K.S.: A study on similarity measure functions on engineering materials selection. AIAA 1, 157–168 (2011)
11.
Zurück zum Zitat Ganapathiraju, M., Manoharan, V., Klein-Seetharaman, J.: BLMT - statistical sequence analysis using N-grams. Appl. Bioinform. 3(2–3), 193–200 (2004)CrossRef Ganapathiraju, M., Manoharan, V., Klein-Seetharaman, J.: BLMT - statistical sequence analysis using N-grams. Appl. Bioinform. 3(2–3), 193–200 (2004)CrossRef
12.
Zurück zum Zitat Gilg, S., Neubert, R.: Semantische Indexierung mittels dynamisch-hierarchischer Neuronaler Netze. Master’s thesis, Chemnitz University of Technology (1999) Gilg, S., Neubert, R.: Semantische Indexierung mittels dynamisch-hierarchischer Neuronaler Netze. Master’s thesis, Chemnitz University of Technology (1999)
13.
Zurück zum Zitat Görlitz, O., Neubert, R., Benn, W.: Access to distributed environmental databases with ICIx technology. Online Inf. Rev. J. 24(5), 364–370 (2000)CrossRef Görlitz, O., Neubert, R., Benn, W.: Access to distributed environmental databases with ICIx technology. Online Inf. Rev. J. 24(5), 364–370 (2000)CrossRef
14.
Zurück zum Zitat Hanley, J.A., McNeil, B.J.: The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143(1), 29–36 (1982)CrossRef Hanley, J.A., McNeil, B.J.: The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143(1), 29–36 (1982)CrossRef
15.
Zurück zum Zitat Hassanat, A.B.: Dimensionality invariant similarity measure. J. Am. Sci. 10(8), 221–226 (2014) Hassanat, A.B.: Dimensionality invariant similarity measure. J. Am. Sci. 10(8), 221–226 (2014)
16.
Zurück zum Zitat Hatzigiorgaki, M., Skodras, A.N.: Compressed domain image retrieval: a comparative study of similarity metrics. In: Visual Communications and Image Processing 2003, pp. 439–448. International Society for Optics and Photonics (2003) Hatzigiorgaki, M., Skodras, A.N.: Compressed domain image retrieval: a comparative study of similarity metrics. In: Visual Communications and Image Processing 2003, pp. 439–448. International Society for Optics and Photonics (2003)
18.
Zurück zum Zitat Kolekar, P., Kale, M., Kulkarni-Kale, U.: Alignment-free distance measure based on return time distribution for sequence analysis: applications to clustering, molecular phylogeny and subtyping. Mol. Phylogenet. Evol. 65(2), 510–522 (2012)CrossRef Kolekar, P., Kale, M., Kulkarni-Kale, U.: Alignment-free distance measure based on return time distribution for sequence analysis: applications to clustering, molecular phylogeny and subtyping. Mol. Phylogenet. Evol. 65(2), 510–522 (2012)CrossRef
19.
Zurück zum Zitat Leuoth, S., Adam, A., Benn, W.: Profit of extending standard relational database with the intelligent cluster index (ICIx). In: 11th ICARCV International Conference ond Control, Automation, Robotics and Vision, vol. 1, pp. 1198–1205 (2010) Leuoth, S., Adam, A., Benn, W.: Profit of extending standard relational database with the intelligent cluster index (ICIx). In: 11th ICARCV International Conference ond Control, Automation, Robotics and Vision, vol. 1, pp. 1198–1205 (2010)
20.
Zurück zum Zitat Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48(3), 443–453 (1970)CrossRef Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48(3), 443–453 (1970)CrossRef
21.
Zurück zum Zitat Neubert, R., Görlitz, O., Benn, W.: Incorporating knowledge technology in databases. In: KnowTech 2000 Conference (2000) Neubert, R., Görlitz, O., Benn, W.: Incorporating knowledge technology in databases. In: KnowTech 2000 Conference (2000)
22.
Zurück zum Zitat Neubert, R., Görlitz, O., Benn, W., Teich, T.: Obstacles for application of neural networks in the ICIx database index. Int. Joint Conf. Neural Networks 1, 2351–2356 (2002) Neubert, R., Görlitz, O., Benn, W., Teich, T.: Obstacles for application of neural networks in the ICIx database index. Int. Joint Conf. Neural Networks 1, 2351–2356 (2002)
23.
Zurück zum Zitat Neubert, R., Görlitz, O., Benn, W.: Towards content-related indexing in databases. Datenbanksysteme in Büro, Technik und Wissenschaft. Informatik aktuell, pp. 305–321. Springer, Heidelberg (2001)CrossRef Neubert, R., Görlitz, O., Benn, W.: Towards content-related indexing in databases. Datenbanksysteme in Büro, Technik und Wissenschaft. Informatik aktuell, pp. 305–321. Springer, Heidelberg (2001)CrossRef
24.
Zurück zum Zitat Pearson, W.R., Lipman, D.J.: Improved tools for biological sequence comparison. PNAS USA 85(8), 2444–2448 (1988)CrossRef Pearson, W.R., Lipman, D.J.: Improved tools for biological sequence comparison. PNAS USA 85(8), 2444–2448 (1988)CrossRef
25.
Zurück zum Zitat Punta, M., Coggill, P.C., Eberhardt, R.Y., Mistry, J., Tate, J., Boursnell, C., Pang, N., Forslund, K., Ceric, G., Clements, J., Heger, A., Holm, L., Sonnhammer, E.L.L., Eddy, S.R., Bateman, A., Finn, R.D.: The pfam protein families database. Nucleic Acids Res. 40(D1), D290–D301 (2012)CrossRef Punta, M., Coggill, P.C., Eberhardt, R.Y., Mistry, J., Tate, J., Boursnell, C., Pang, N., Forslund, K., Ceric, G., Clements, J., Heger, A., Holm, L., Sonnhammer, E.L.L., Eddy, S.R., Bateman, A., Finn, R.D.: The pfam protein families database. Nucleic Acids Res. 40(D1), D290–D301 (2012)CrossRef
26.
Zurück zum Zitat Searls, D.B.: The language of genes. Nature 420(6912), 211–217 (2002)CrossRef Searls, D.B.: The language of genes. Nature 420(6912), 211–217 (2002)CrossRef
27.
Zurück zum Zitat Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. J. Mol. Biol. 147(1), 195–197 (1981)CrossRef Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. J. Mol. Biol. 147(1), 195–197 (1981)CrossRef
28.
Zurück zum Zitat Sun, W.K.: Algorithms in Bioinformatics - A practical Introduction. CRC Press, Boca Raton (2010) Sun, W.K.: Algorithms in Bioinformatics - A practical Introduction. CRC Press, Boca Raton (2010)
29.
Zurück zum Zitat Yao, Y., Han, J., Dai, Q., He, P.: A novel descriptor of protein sequences and its application. J. Theor. Biol. 347, 109–117 (2014)CrossRef Yao, Y., Han, J., Dai, Q., He, P.: A novel descriptor of protein sequences and its application. J. Theor. Biol. 347, 109–117 (2014)CrossRef
30.
Zurück zum Zitat Zvelebil, M., Baum, J.O.: Understanding Bioinformatics. Garland Science (2008) Zvelebil, M., Baum, J.O.: Understanding Bioinformatics. Garland Science (2008)
Metadaten
Titel
Evaluation of Descriptor Algorithms of Biological Sequences and Distance Measures for the Intelligent Cluster Index (ICIx)
verfasst von
Stefan Schildbach
Florian Heinke
Wolfgang Benn
Dirk Labudde
Copyright-Jahr
2016
DOI
https://doi.org/10.1007/978-3-319-34099-9_33

Premium Partner