skip to main content
10.1145/565196.565200acmconferencesArticle/Chapter ViewAbstractPublication PagesrecombConference Proceedingsconference-collections
Article

Monotony of surprise and large-scale quest for unusual words

Published:18 April 2002Publication History

ABSTRACT

The problem of characterizing and detecting recurrent sequence patterns such as substrings or motifs and related associations or rules is variously pursued in order to compress data, unveil structure, infer succinct descriptions, extract and classify features, etc. In Molecular Biology, exceptionally frequent or rare words in bio-sequences have been implicated in various facets of biological function and structure. The discovery, particularly on a massive scale, of such patterns poses interesting methodological and algorithmic problems, and often exposes scenarios in which tables and synopses grow faster and bigger than the raw sequences they are meant to encapsulate. In previous study, the ability to succinctly compute, store, and display unusual substrings has been linked to a subtle interplay between the combinatorics of the subwords of a word and local monotonicities of some scores used to measure the departure from expectation. In this paper, we carry out an extensive analysis of such monotonicities for a broader variety of scores. This supports the construction of data structures and algorithms capable of performing global detection of unusual substrings in time and space linear in the subject sequences, under various probabilistic models.

References

  1. Apostolico, A. Of maps bigger than the empire, Keynote, Proceedings of the 8th International Colloquium on String Processing and Information Retrieval, Laguna de San Rafael, Chile, November 2001, IEEE Computer Society Press, 2--10 (2001).Google ScholarGoogle Scholar
  2. Apostolico, A., Bock, M. E., Lonardi, S., and Xu, X. Efficient detection of unusual words. J. Comput. Bio. 7, 1/2 (Jan. 2000), 71--94.Google ScholarGoogle ScholarCross RefCross Ref
  3. Apostolico, A., Bock, M. E., and Xu, X. Annotated statistical indices for sequence analysis. Keynote, Proceedings of Complexity and Compression of SEQUENCES97, Positano, Italy, June 1997, IEEE Computer Society Press, 215--229 (1998). Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Apostolico, A., and Galil, Z., Eds. Pattern matching algorithms. Oxford University Press, New York, NY, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Blumer, A., Blumer, J., Ehrenfeucht, A., Haussler, D., and McConnel, R. Complete inverted files for efficient text retrieval and analysis. J. Assoc. Comput. Mach. 34, 3 (1987), 578--595. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Borges, J. L. A Universal History of Infamy. Penguin Books, London, 1975.Google ScholarGoogle Scholar
  7. Clift, B., Haussler, D., McConnell, R., Schneider, T. D., and Stormo, G. D. Sequence landscapes. Nucleic Acids Res. 14 (1986), 141--158.Google ScholarGoogle ScholarCross RefCross Ref
  8. Gentleman, J. The distribution of the frequency of subsequences in alphabetic sequences, as exemplified by deoxyribonucleic acid. Appl. Statist. 43 (1994), 404--414.Google ScholarGoogle ScholarCross RefCross Ref
  9. Kleffe, J., and Borodovsky, M. First and second moment of counts of words in random texts generated by Markov chains. Comput. Appl. Biosci. 8 (1992), 433--441.Google ScholarGoogle Scholar
  10. Leung, M. Y., Marsh, G. M., and Speed, T. P. Over and underrepresentation of short DNA words in herpesvirus genomes. J. Comput. Bio. 3 (1996), 345--360.Google ScholarGoogle ScholarCross RefCross Ref
  11. Lonardi, S. Global Detectors of Unusual Words: Design, Implementation, and Applications to Pattern Discovery in Biosequences. PhD thesis, Purdue University, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Lundstrom, R. Stochastic models and statistical methods for DNA sequence data. PhD thesis, University of Utah, 1990.Google ScholarGoogle Scholar
  13. Pevzner, P. A., Borodovsky, M. Y., and Mironov, A. A. Linguistics of nucleotides sequences I: The significance of deviations from mean statistical characteristics and prediction of the frequencies of occurrence of words. J. Biomol. Struct. Dynamics 6 (1989), 1013--1026.Google ScholarGoogle ScholarCross RefCross Ref
  14. Régnier, M., and Szpankowski, W. On pattern frequency occurrences in a Markovian sequence. Algorithmica 22 (1998), 631--649.Google ScholarGoogle ScholarCross RefCross Ref
  15. Reinert, G., Schbath, S., and Waterman, M. S. Probabilistic and statistical properties of words: An overview. J. Comput. Bio. 7 (2000), 1--46.Google ScholarGoogle ScholarCross RefCross Ref
  16. Stückle, E., Emmrich, C., Grob, U., and Nielsen, P. Statistical analysis of nucleotide sequences. Nucleic Acids Res. 18, 22 (1990), 6641--6647.Google ScholarGoogle ScholarCross RefCross Ref
  17. Waterman, M. S. Introduction to Computational Biology. Chapman & Hall, 1995.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Monotony of surprise and large-scale quest for unusual words

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            RECOMB '02: Proceedings of the sixth annual international conference on Computational biology
            April 2002
            341 pages
            ISBN:1581134983
            DOI:10.1145/565196

            Copyright © 2002 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 18 April 2002

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • Article

            Acceptance Rates

            RECOMB '02 Paper Acceptance Rate35of118submissions,30%Overall Acceptance Rate148of538submissions,28%

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader