Article

Provably sensitive Indexing strategies for biosequence similarity search

Author:
Jeremy Buhler

Washington University, St. Louis, MO

Washington University, St. Louis, MO
View Profile

RECOMB '02: Proceedings of the sixth annual international conference on Computational biologyApril 2002Pages 90–99https://doi.org/10.1145/565196.565208

Published:18 April 2002Publication History

RECOMB '02: Proceedings of the sixth annual international conference on Computational biology

Pages 90–99

ABSTRACT

The field of algorithms for pairwise biosequence similarity search is dominated by heuristic methods of high efficiency but uncertain sensitivity. One reason that more formal string matching algorithms with sensitivity guarantees have not been applied to biosequences is that they cannot directly find similarities that score highly under substitution score functions such as the DNAPAM-TT [20], PAM [9], or BLOSUM [12] families of matrices. We describe a general technique, score simulation, to map ungapped similarity search problems using these score functions into the problem of finding pairs of strings that are close in Hamming space. Score simulation leads to indexing schemes for biosequences that permit efficient ungapped similarity searches with formal guarantees of sensitivity using arbitrary score functions. In particular, we introduce the lsh-all-pairs-sim algorithm for finding local similarities in large biosequence collections and show that it is both computationally feasible and sensitive in practice.

References

S. F. Altschul, T. L. Madden, A. A. Schäffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research, 25(17):3389--402, 1997.Google ScholarCross Ref
J. Amilhastre. Complexity of minimum biclique decomposition of bipartite graphs. http://citeseer.nj.nec.com/148735.html.Google Scholar
F. R. Blattner, G. Plunkett III, C. A. Bloch, N. T. Perna, V. Burland, et al. The complete genome sequence of Escherichia coli K-12. Science, 277:1453--74, 1997.Google ScholarCross Ref
J. Buhler. Efficient large-scale sequence comparison by locality-sensitive hashing. Bioinformatics, 17:419--28, 2001.Google ScholarCross Ref
J. Buhler. Search Algorithms for Biosequences Using Random Projection. PhD thesis, University of Washington, Seattle, WA, 2001. Available online at http://www.cs.wustl.edu/~jbuhler/thesis.html. Google ScholarDigital Library
J. Buhler and M. Tompa. Finding motifs using random projections. In RECOMB01: Proceedings of the Fifth Annual International Conference on Computational Molecular Biology, Montreal, Canada, Apr. 2001. Google ScholarDigital Library
S. Center. Bacteroides fragilis genome, 2001. ftp://ftp.sanger.ac.uk/pub/pathogens/bf/.Google Scholar
W. I. Chang and E. L. Lawler. Sublinear expected time approximate string matching and biological applications. Algorithmica, 12:327--44, 1994.Google ScholarCross Ref
M. O. Dayhoff, R. Schwartz, and B. C. Orcutt. A model of evolutionary change in proteins. Atlas of Protein Sequence and Structure, 5:345--52, 1978.Google Scholar
A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In Proceedings of the 25th International Conference on Very Large Databases, Edinburgh, Scotland, 1999. Google ScholarDigital Library
J. F. Heidelberg, J. A. Eisen, W. C. Nelson, R. A. Clayton, M. L. Gwinn, et al. DNA sequence of both chromosomes of the cholera pathogen Vibrio cholerae. Nature, 406:477--83, 2000.Google ScholarCross Ref
S. Henikoff and J. G. Henikoff. Amino acid substitution matrices from protein blocks. Proceedings of the National Academy of Science USA, 89(22):10915--10919, Nov. 1992.Google ScholarCross Ref
P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, Dallas, TX, 1998. Google ScholarDigital Library
S. Karlin and S. F. Altschul. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proceedings of the National Academy of Science USA, 87:2264--8, 1990.Google ScholarCross Ref
P. Pevzner and M. S. Waterman. Multiple filtration and approximate pattern matching. Algorithmica, 13:135--54, 1995.Google ScholarCross Ref
I. Rigoutsos and A. Califano. FLASH: a fast look-up algorithm for string homology. In Proceedings of the First International Conference on Intelligent Systems for Molecular Biology, pages 56--64, 1993. Google ScholarDigital Library
S. Schwartz, Z. Zhang, K. A. Frazer, A. F. Smit, C. Riemer, J. Bouck, R. A. Gibbs, R. Hardison, and W. Miller. PipMaker -- a web server for aligning two genomic DNA sequences. Genome Research, 10:577--86, 2000.Google ScholarCross Ref
A. F. Smit and P. Green. Repeatmasker, 1999. http://ftp.genome.washington.edu/RM/RepeatMasker.html.Google Scholar
T. F. Smith and M. S. Waterman. Identification of common molecular subsequences. Journal of Molecular Biology, 147(1):195--97, Mar. 1981.Google ScholarCross Ref
D. J. States, W. Gish, and S. F. Altschul. Improved sensitivity of nucleic acid database searches using application-specific scoring matrices. Methods: a Companion to Methods in Enzymology, 3(1):66--70, 1991.Google Scholar
R. L. Tatusov, D. A. Natale, I. V. Garkavtsev, T. A. Tatusova, U. T. Shankavaram, B. S. Rao, B. Kiryutin, M. Y. Galperin, N. D. Fedorova, and E. V. Koonin. The COG database: new developments in phylogenetic classifications of proteins from complete genomes. Nucleic Acids Research, 1:22--8, 2001.Google ScholarCross Ref
R. K. Wilson, E. Lai, P. Concannon, R. K. Barth, and L. E. Hood. Structure, organization, and polymorphism of murine and human T-cell receptor α and β chain gene families. Immunological Reviews, 101:149--72, 1988.Google ScholarCross Ref
J. C. Wooton and S. Federhen. Statistics of local complexity in amino acid sequences and sequence databases. Computers and Chemistry, 17:149--63, 1993.Google ScholarCross Ref

Index Terms

Provably sensitive Indexing strategies for biosequence similarity search

Recommendations

Biosequence Similarity Search on the Mercury System

Biosequence similarity search is an important application in modern molecular biology. Search algorithms aim to identify sets of sequences whose extensional similarity suggests a common evolutionary origin or function. The most widely used similarity ...
Read More
Biosequence Similarity Search on the Mercury System
ASAP '04: Proceedings of the Application-Specific Systems, Architectures and Processors, 15th IEEE International Conference

Biosequence similarity search is an important application in modern molecular biology. Search algorithms aim to identify sets of sequences whose extensional similarity suggests a common evolutionary origin or function. The most widely used similarity ...
Read More
Application of Comparative Biosequence Analysis to Understand Antibiotic Resistance in Superbug
BCB '19: Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

The advent of synthetic antibiotics revolutionized the treatment of bacterial infections in humans. Antibiotics are classified into several different families, depending on their mode of action, including beta-lactams, macrolides, and quinolones. Beta-...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
RECOMB '02: Proceedings of the sixth annual international conference on Computational biology
April 2002
341 pages
ISBN:1581134983
DOI:10.1145/565196
Editors:
Gene Myers
Celera, USA
,
Sridhar Hannenhalli
Celera, USA
,
David Sankoff
University of Montréal, Canada
,
Sorin Istrail
Celera, USA
,
Pavel Pevzner
University of California at San Diego, USA
,
Michael Waterman
University of California, USA
Copyright © 2002 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 18 April 2002
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- Article
Conference

Acceptance Rates
RECOMB '02 Paper Acceptance Rate35of118submissions,30%Overall Acceptance Rate148of538submissions,28%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 10
  Total Citations
  View Citations
- 590
  Total Downloads
- Downloads (Last 12 months)1
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.