skip to main content
10.1145/565196.565208acmconferencesArticle/Chapter ViewAbstractPublication PagesrecombConference Proceedingsconference-collections
Article

Provably sensitive Indexing strategies for biosequence similarity search

Published:18 April 2002Publication History

ABSTRACT

The field of algorithms for pairwise biosequence similarity search is dominated by heuristic methods of high efficiency but uncertain sensitivity. One reason that more formal string matching algorithms with sensitivity guarantees have not been applied to biosequences is that they cannot directly find similarities that score highly under substitution score functions such as the DNAPAM-TT [20], PAM [9], or BLOSUM [12] families of matrices. We describe a general technique, score simulation, to map ungapped similarity search problems using these score functions into the problem of finding pairs of strings that are close in Hamming space. Score simulation leads to indexing schemes for biosequences that permit efficient ungapped similarity searches with formal guarantees of sensitivity using arbitrary score functions. In particular, we introduce the lsh-all-pairs-sim algorithm for finding local similarities in large biosequence collections and show that it is both computationally feasible and sensitive in practice.

References

  1. S. F. Altschul, T. L. Madden, A. A. Schäffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research, 25(17):3389--402, 1997.Google ScholarGoogle ScholarCross RefCross Ref
  2. J. Amilhastre. Complexity of minimum biclique decomposition of bipartite graphs. http://citeseer.nj.nec.com/148735.html.Google ScholarGoogle Scholar
  3. F. R. Blattner, G. Plunkett III, C. A. Bloch, N. T. Perna, V. Burland, et al. The complete genome sequence of Escherichia coli K-12. Science, 277:1453--74, 1997.Google ScholarGoogle ScholarCross RefCross Ref
  4. J. Buhler. Efficient large-scale sequence comparison by locality-sensitive hashing. Bioinformatics, 17:419--28, 2001.Google ScholarGoogle ScholarCross RefCross Ref
  5. J. Buhler. Search Algorithms for Biosequences Using Random Projection. PhD thesis, University of Washington, Seattle, WA, 2001. Available online at http://www.cs.wustl.edu/~jbuhler/thesis.html. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. J. Buhler and M. Tompa. Finding motifs using random projections. In RECOMB01: Proceedings of the Fifth Annual International Conference on Computational Molecular Biology, Montreal, Canada, Apr. 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. S. Center. Bacteroides fragilis genome, 2001. ftp://ftp.sanger.ac.uk/pub/pathogens/bf/.Google ScholarGoogle Scholar
  8. W. I. Chang and E. L. Lawler. Sublinear expected time approximate string matching and biological applications. Algorithmica, 12:327--44, 1994.Google ScholarGoogle ScholarCross RefCross Ref
  9. M. O. Dayhoff, R. Schwartz, and B. C. Orcutt. A model of evolutionary change in proteins. Atlas of Protein Sequence and Structure, 5:345--52, 1978.Google ScholarGoogle Scholar
  10. A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In Proceedings of the 25th International Conference on Very Large Databases, Edinburgh, Scotland, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. F. Heidelberg, J. A. Eisen, W. C. Nelson, R. A. Clayton, M. L. Gwinn, et al. DNA sequence of both chromosomes of the cholera pathogen Vibrio cholerae. Nature, 406:477--83, 2000.Google ScholarGoogle ScholarCross RefCross Ref
  12. S. Henikoff and J. G. Henikoff. Amino acid substitution matrices from protein blocks. Proceedings of the National Academy of Science USA, 89(22):10915--10919, Nov. 1992.Google ScholarGoogle ScholarCross RefCross Ref
  13. P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, Dallas, TX, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. S. Karlin and S. F. Altschul. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proceedings of the National Academy of Science USA, 87:2264--8, 1990.Google ScholarGoogle ScholarCross RefCross Ref
  15. P. Pevzner and M. S. Waterman. Multiple filtration and approximate pattern matching. Algorithmica, 13:135--54, 1995.Google ScholarGoogle ScholarCross RefCross Ref
  16. I. Rigoutsos and A. Califano. FLASH: a fast look-up algorithm for string homology. In Proceedings of the First International Conference on Intelligent Systems for Molecular Biology, pages 56--64, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. S. Schwartz, Z. Zhang, K. A. Frazer, A. F. Smit, C. Riemer, J. Bouck, R. A. Gibbs, R. Hardison, and W. Miller. PipMaker -- a web server for aligning two genomic DNA sequences. Genome Research, 10:577--86, 2000.Google ScholarGoogle ScholarCross RefCross Ref
  18. A. F. Smit and P. Green. Repeatmasker, 1999. http://ftp.genome.washington.edu/RM/RepeatMasker.html.Google ScholarGoogle Scholar
  19. T. F. Smith and M. S. Waterman. Identification of common molecular subsequences. Journal of Molecular Biology, 147(1):195--97, Mar. 1981.Google ScholarGoogle ScholarCross RefCross Ref
  20. D. J. States, W. Gish, and S. F. Altschul. Improved sensitivity of nucleic acid database searches using application-specific scoring matrices. Methods: a Companion to Methods in Enzymology, 3(1):66--70, 1991.Google ScholarGoogle Scholar
  21. R. L. Tatusov, D. A. Natale, I. V. Garkavtsev, T. A. Tatusova, U. T. Shankavaram, B. S. Rao, B. Kiryutin, M. Y. Galperin, N. D. Fedorova, and E. V. Koonin. The COG database: new developments in phylogenetic classifications of proteins from complete genomes. Nucleic Acids Research, 1:22--8, 2001.Google ScholarGoogle ScholarCross RefCross Ref
  22. R. K. Wilson, E. Lai, P. Concannon, R. K. Barth, and L. E. Hood. Structure, organization, and polymorphism of murine and human T-cell receptor α and β chain gene families. Immunological Reviews, 101:149--72, 1988.Google ScholarGoogle ScholarCross RefCross Ref
  23. J. C. Wooton and S. Federhen. Statistics of local complexity in amino acid sequences and sequence databases. Computers and Chemistry, 17:149--63, 1993.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Provably sensitive Indexing strategies for biosequence similarity search

              Recommendations

              Comments

              Login options

              Check if you have access through your login credentials or your institution to get full access on this article.

              Sign in
              • Published in

                cover image ACM Conferences
                RECOMB '02: Proceedings of the sixth annual international conference on Computational biology
                April 2002
                341 pages
                ISBN:1581134983
                DOI:10.1145/565196

                Copyright © 2002 ACM

                Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                Publisher

                Association for Computing Machinery

                New York, NY, United States

                Publication History

                • Published: 18 April 2002

                Permissions

                Request permissions about this article.

                Request Permissions

                Check for updates

                Qualifiers

                • Article

                Acceptance Rates

                RECOMB '02 Paper Acceptance Rate35of118submissions,30%Overall Acceptance Rate148of538submissions,28%

              PDF Format

              View or Download as a PDF file.

              PDF

              eReader

              View online with eReader.

              eReader