ABSTRACT
The field of algorithms for pairwise biosequence similarity search is dominated by heuristic methods of high efficiency but uncertain sensitivity. One reason that more formal string matching algorithms with sensitivity guarantees have not been applied to biosequences is that they cannot directly find similarities that score highly under substitution score functions such as the DNAPAM-TT [20], PAM [9], or BLOSUM [12] families of matrices. We describe a general technique, score simulation, to map ungapped similarity search problems using these score functions into the problem of finding pairs of strings that are close in Hamming space. Score simulation leads to indexing schemes for biosequences that permit efficient ungapped similarity searches with formal guarantees of sensitivity using arbitrary score functions. In particular, we introduce the lsh-all-pairs-sim algorithm for finding local similarities in large biosequence collections and show that it is both computationally feasible and sensitive in practice.
- S. F. Altschul, T. L. Madden, A. A. Schäffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research, 25(17):3389--402, 1997.Google ScholarCross Ref
- J. Amilhastre. Complexity of minimum biclique decomposition of bipartite graphs. http://citeseer.nj.nec.com/148735.html.Google Scholar
- F. R. Blattner, G. Plunkett III, C. A. Bloch, N. T. Perna, V. Burland, et al. The complete genome sequence of Escherichia coli K-12. Science, 277:1453--74, 1997.Google ScholarCross Ref
- J. Buhler. Efficient large-scale sequence comparison by locality-sensitive hashing. Bioinformatics, 17:419--28, 2001.Google ScholarCross Ref
- J. Buhler. Search Algorithms for Biosequences Using Random Projection. PhD thesis, University of Washington, Seattle, WA, 2001. Available online at http://www.cs.wustl.edu/~jbuhler/thesis.html. Google ScholarDigital Library
- J. Buhler and M. Tompa. Finding motifs using random projections. In RECOMB01: Proceedings of the Fifth Annual International Conference on Computational Molecular Biology, Montreal, Canada, Apr. 2001. Google ScholarDigital Library
- S. Center. Bacteroides fragilis genome, 2001. ftp://ftp.sanger.ac.uk/pub/pathogens/bf/.Google Scholar
- W. I. Chang and E. L. Lawler. Sublinear expected time approximate string matching and biological applications. Algorithmica, 12:327--44, 1994.Google ScholarCross Ref
- M. O. Dayhoff, R. Schwartz, and B. C. Orcutt. A model of evolutionary change in proteins. Atlas of Protein Sequence and Structure, 5:345--52, 1978.Google Scholar
- A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In Proceedings of the 25th International Conference on Very Large Databases, Edinburgh, Scotland, 1999. Google ScholarDigital Library
- J. F. Heidelberg, J. A. Eisen, W. C. Nelson, R. A. Clayton, M. L. Gwinn, et al. DNA sequence of both chromosomes of the cholera pathogen Vibrio cholerae. Nature, 406:477--83, 2000.Google ScholarCross Ref
- S. Henikoff and J. G. Henikoff. Amino acid substitution matrices from protein blocks. Proceedings of the National Academy of Science USA, 89(22):10915--10919, Nov. 1992.Google ScholarCross Ref
- P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, Dallas, TX, 1998. Google ScholarDigital Library
- S. Karlin and S. F. Altschul. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proceedings of the National Academy of Science USA, 87:2264--8, 1990.Google ScholarCross Ref
- P. Pevzner and M. S. Waterman. Multiple filtration and approximate pattern matching. Algorithmica, 13:135--54, 1995.Google ScholarCross Ref
- I. Rigoutsos and A. Califano. FLASH: a fast look-up algorithm for string homology. In Proceedings of the First International Conference on Intelligent Systems for Molecular Biology, pages 56--64, 1993. Google ScholarDigital Library
- S. Schwartz, Z. Zhang, K. A. Frazer, A. F. Smit, C. Riemer, J. Bouck, R. A. Gibbs, R. Hardison, and W. Miller. PipMaker -- a web server for aligning two genomic DNA sequences. Genome Research, 10:577--86, 2000.Google ScholarCross Ref
- A. F. Smit and P. Green. Repeatmasker, 1999. http://ftp.genome.washington.edu/RM/RepeatMasker.html.Google Scholar
- T. F. Smith and M. S. Waterman. Identification of common molecular subsequences. Journal of Molecular Biology, 147(1):195--97, Mar. 1981.Google ScholarCross Ref
- D. J. States, W. Gish, and S. F. Altschul. Improved sensitivity of nucleic acid database searches using application-specific scoring matrices. Methods: a Companion to Methods in Enzymology, 3(1):66--70, 1991.Google Scholar
- R. L. Tatusov, D. A. Natale, I. V. Garkavtsev, T. A. Tatusova, U. T. Shankavaram, B. S. Rao, B. Kiryutin, M. Y. Galperin, N. D. Fedorova, and E. V. Koonin. The COG database: new developments in phylogenetic classifications of proteins from complete genomes. Nucleic Acids Research, 1:22--8, 2001.Google ScholarCross Ref
- R. K. Wilson, E. Lai, P. Concannon, R. K. Barth, and L. E. Hood. Structure, organization, and polymorphism of murine and human T-cell receptor α and β chain gene families. Immunological Reviews, 101:149--72, 1988.Google ScholarCross Ref
- J. C. Wooton and S. Federhen. Statistics of local complexity in amino acid sequences and sequence databases. Computers and Chemistry, 17:149--63, 1993.Google ScholarCross Ref
Index Terms
- Provably sensitive Indexing strategies for biosequence similarity search
Recommendations
Biosequence Similarity Search on the Mercury System
Biosequence similarity search is an important application in modern molecular biology. Search algorithms aim to identify sets of sequences whose extensional similarity suggests a common evolutionary origin or function. The most widely used similarity ...
Biosequence Similarity Search on the Mercury System
ASAP '04: Proceedings of the Application-Specific Systems, Architectures and Processors, 15th IEEE International ConferenceBiosequence similarity search is an important application in modern molecular biology. Search algorithms aim to identify sets of sequences whose extensional similarity suggests a common evolutionary origin or function. The most widely used similarity ...
Application of Comparative Biosequence Analysis to Understand Antibiotic Resistance in Superbug
BCB '19: Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health InformaticsThe advent of synthetic antibiotics revolutionized the treatment of bacterial infections in humans. Antibiotics are classified into several different families, depending on their mode of action, including beta-lactams, macrolides, and quinolones. Beta-...
Comments