Skip to main content

2013 | OriginalPaper | Buchkapitel

Proportion of Gaps and Fluctuations of the Optimal Score in Random Sequence Comparison

verfasst von : Jüri Lember, Heinrich Matzinger, Felipe Torres

Erschienen in: Limit Theorems in Probability, Statistics and Number Theory

Verlag: Springer Berlin Heidelberg

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

We study the asymptotic properties of optimal alignments when aligning two independent i.i.d. sequences over finite alphabet. Such kind of alignment is an important tool in many fields of applications including computational molecular biology. We are particularly interested in the (asymptotic) proportion of gaps of the optimal alignment. We show that when the limit of the average optimal score per letter (rescaled score) is considered as a function of the gap penalty, then given a gap penalty, the proportion of the gaps converges to the derivative of the limit score at that particular penalty. Such an approach, where the gap penalty is allowed to vary, has not been explored before. As an application, we solve the long open problem of the fluctuation of the optimal alignment in the case when the gap penalty is sufficiently large. In particular, we prove that for all scoring functions without a certain symmetry, as long as the gap penalty is large enough, the fluctuations of the optimal alignment score are of order square root of the length of the strings. This order was conjectured by Waterman [Phil. Trans. R. Soc. Lond. B 344(1):383–390, 1994] but disproves the conjecture of Chvatal and Sankoff in [J. Appl. Probab. 12:306–315, 1975].

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat K.S. Alexander, The rate of convergence of the mean length of the longest common subsequence. Ann. Appl. Probab. 4(4), 1074–1082 (1994)MathSciNetMATHCrossRef K.S. Alexander, The rate of convergence of the mean length of the longest common subsequence. Ann. Appl. Probab. 4(4), 1074–1082 (1994)MathSciNetMATHCrossRef
2.
Zurück zum Zitat F. Bonetto, H. Matzinger, Fluctuations of the longest common subsequence in the case of2- and 3-letter alphabets. Latin Am. J. Probab. Math. 2, 195–216 (2006)MathSciNetMATH F. Bonetto, H. Matzinger, Fluctuations of the longest common subsequence in the case of2- and 3-letter alphabets. Latin Am. J. Probab. Math. 2, 195–216 (2006)MathSciNetMATH
3.
Zurück zum Zitat J. Boutet de Monvel, Extensive simulations for longest common subsequences. Eur. Phys. J. B 7, 293–308 (1999)CrossRef J. Boutet de Monvel, Extensive simulations for longest common subsequences. Eur. Phys. J. B 7, 293–308 (1999)CrossRef
4.
Zurück zum Zitat N. Christianini, M.W. Hahn, Introduction to Computational Genomics (Cambridge University Press, Cambridge, 2007) N. Christianini, M.W. Hahn, Introduction to Computational Genomics (Cambridge University Press, Cambridge, 2007)
5.
6.
Zurück zum Zitat L. Devroye, G. Lugosi, L. Gyorfi, A Probabilistic Theory of Pattern Recognition (Springer, New York, 1996)MATH L. Devroye, G. Lugosi, L. Gyorfi, A Probabilistic Theory of Pattern Recognition (Springer, New York, 1996)MATH
7.
Zurück zum Zitat R. Durbin, S. Eddy, A. Krogh, G. Mitchison, Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids (Cambridge University Press, Cambridge, 1998)MATHCrossRef R. Durbin, S. Eddy, A. Krogh, G. Mitchison, Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids (Cambridge University Press, Cambridge, 1998)MATHCrossRef
8.
Zurück zum Zitat C. Durringer, J. Lember, H. Matzinger, Deviation from the mean in sequence comparison with a periodic sequence. ALEA 3, 1–29 (2007)MathSciNetMATH C. Durringer, J. Lember, H. Matzinger, Deviation from the mean in sequence comparison with a periodic sequence. ALEA 3, 1–29 (2007)MathSciNetMATH
9.
Zurück zum Zitat C. Houdre, H. Matzinger, Fluctuations of the optimal alignment score with and asymmetric scoring function. [arXiv:math/0702036] C. Houdre, H. Matzinger, Fluctuations of the optimal alignment score with and asymmetric scoring function. [arXiv:math/0702036]
10.
11.
Zurück zum Zitat J. Lember, H. Matzinger, F. Torres, The rate of the convergence of the mean score in random sequence comparison. Ann. Appl. Probab. 22(3), 1046–1058 (2012)MathSciNetMATHCrossRef J. Lember, H. Matzinger, F. Torres, The rate of the convergence of the mean score in random sequence comparison. Ann. Appl. Probab. 22(3), 1046–1058 (2012)MathSciNetMATHCrossRef
12.
Zurück zum Zitat J. Lember, H. Matzinger, F. Torres, General approach to the fluctuations problem in random sequence comparison. arXiv:1211.5072 (Submitted, 2012). J. Lember, H. Matzinger, F. Torres, General approach to the fluctuations problem in random sequence comparison. arXiv:1211.5072 (Submitted, 2012).
13.
Zurück zum Zitat C.-Y. Lin, F.J. Och, Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics, in ACL ’04: Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, Barcelona, Spain (Association for Computational Linguistics, Stroudsburg 2004), p. 605 C.-Y. Lin, F.J. Och, Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics, in ACL ’04: Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, Barcelona, Spain (Association for Computational Linguistics, Stroudsburg 2004), p. 605
14.
Zurück zum Zitat H. Matzinger, F. Torres, Fluctuation of the longest common subsequence for sequences of independent blocks. [arvix math.PR/1001.1273v3] H. Matzinger, F. Torres, Fluctuation of the longest common subsequence for sequences of independent blocks. [arvix math.PR/1001.1273v3]
15.
Zurück zum Zitat H. Matzinger, F. Torres, Random modification effect in the size of the fluctuation of the LCS of two sequences of i.i.d. blocks. [arXiv math.PR/1011.2679v2] H. Matzinger, F. Torres, Random modification effect in the size of the fluctuation of the LCS of two sequences of i.i.d. blocks. [arXiv math.PR/1011.2679v2]
16.
17.
Zurück zum Zitat I.D. Melamed, Bitext maps and alignment via pattern recognition. Comput. Linguist. 25(1), 107–130 (1999) I.D. Melamed, Bitext maps and alignment via pattern recognition. Comput. Linguist. 25(1), 107–130 (1999)
18.
Zurück zum Zitat P. Pevzner, Computational Molecular Biology. An algorithmic approach, A Bradford Book (MIT, Cambridge, 2000) P. Pevzner, Computational Molecular Biology. An algorithmic approach, A Bradford Book (MIT, Cambridge, 2000)
19.
Zurück zum Zitat R.T. Rockafellar, Convex Analysis (Princeton University Press, Princeton, 1970)MATH R.T. Rockafellar, Convex Analysis (Princeton University Press, Princeton, 1970)MATH
20.
Zurück zum Zitat T.F. Smith, M.S. Waterman, Identification of common molecular subsequences. J. Mol. Bio. 147, 195–197 (1981)CrossRef T.F. Smith, M.S. Waterman, Identification of common molecular subsequences. J. Mol. Bio. 147, 195–197 (1981)CrossRef
23.
Zurück zum Zitat M.S. Waterman, Estimating statistical significance of sequence alignments. Phil. Trans. R. Soc. Lond. B 344(1), 383–390 (1994)MathSciNetCrossRef M.S. Waterman, Estimating statistical significance of sequence alignments. Phil. Trans. R. Soc. Lond. B 344(1), 383–390 (1994)MathSciNetCrossRef
24.
Zurück zum Zitat M.S. Waterman, Introduction to Computational Biology (Chapman & Hall, London, 1995)MATH M.S. Waterman, Introduction to Computational Biology (Chapman & Hall, London, 1995)MATH
25.
Zurück zum Zitat K. Wing Li, C.C. Yang, Automatic construction of english/chinese parallel corpora. J. Am. Soc. Inform. Sci. Tech. 54, 730–742 (2003)CrossRef K. Wing Li, C.C. Yang, Automatic construction of english/chinese parallel corpora. J. Am. Soc. Inform. Sci. Tech. 54, 730–742 (2003)CrossRef
Metadaten
Titel
Proportion of Gaps and Fluctuations of the Optimal Score in Random Sequence Comparison
verfasst von
Jüri Lember
Heinrich Matzinger
Felipe Torres
Copyright-Jahr
2013
Verlag
Springer Berlin Heidelberg
DOI
https://doi.org/10.1007/978-3-642-36068-8_10