skip to main content
10.1145/640075.640091acmconferencesArticle/Chapter ViewAbstractPublication PagesrecombConference Proceedingsconference-collections
Article

Finding recurrent sources in sequences

Published:10 April 2003Publication History

ABSTRACT

Many genomic sequences and, more generally, (multivariate) time series display tremendous variability. However, often it is reasonable to assume that the sequence is actually generated by or assembled from a small number of sources, each of which might contribute several segments to the sequence. That is, there are h hidden sources such that the sequence can be written as a concatenation of k > h pieces, each of which stems from one of the h sources. We define this (k,h)-segmentation problem and show that it is NP-hard in the general case. We give approximation algorithms achieving approximation ratios of 3 for the L1 error measure and √5 for the L2 error measure, and generalize the results to higher dimensions. We give empirical results on real (chromosome 22) and artificial data showing that the methods work well in practice.

References

  1. S. Arora, P. Raghavan, and S. Rao. Approximation schemes for euclidean k -medians and related problems. In ACM Symposium on Theory of Computing, pages 106--113, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. R. K. Azad, J. S. Rao, W. Li, and R. Ramaswamy. Simplifying the mosaic description of DNA sequences. Physical Review E, 66, article 031913, 2002.Google ScholarGoogle Scholar
  3. R. Bellman. On the approximation of curves by line segments using dynamic programming. Commun. ACM, 4(6), 1961. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. K. Bennett. Determination of the number of zones in a biostratigraphical sequence. New Phytol., 132:155--170, 1996.Google ScholarGoogle ScholarCross RefCross Ref
  5. A. Cantoni. Optimal curve fitting with piecewise linear functions. IEEE Transactions on Computers, C-20(1):59--67, 1971.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. G. A. Churchill. Stochastic models for heterogeneous DNA sequences. Bulletin of Mathematical Biology, 51:79--94, 1989.Google ScholarGoogle ScholarCross RefCross Ref
  7. J. Himberg, K. Korpiaho, H. Mannila, J. Tikanmaki, and H. T. Toivonen. Time series segmentation for context recognition in mobile devices. In The 2001 IEEE International Conference on Data Mining (ICDM'01), pages 203--210, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. T. Kanungo, D. Mount, N. Netanyahu, C. Piatko, R.Silverman, and A. Wu. A local search approximation algorithm for k-means clustering. In Proceedings of the 18th Annual Symposium on Computational Geometry, pages 10--18, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. E. Keogh, S. Chu, D. Hart, and M. Pazzani. An online algorithm for segmenting time series. In IEEE International Conference on Data Mining, pages 289--296, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. W. Li. DNA segmentation as a model selection process. In RECOMB 2001, pages 204--210, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. Liu and C. Lawrence. Bayesian inference on biopolymer models. Bioinformatics, 15(1):38--52, 1999.Google ScholarGoogle ScholarCross RefCross Ref
  12. N. Megiddo and K. J. Supowit. On the complexity of some common geometric location problems. SIAM Journal on Computing, 13(1):182--196, 1984.Google ScholarGoogle ScholarCross RefCross Ref
  13. N. Megiddo, E. Zemel, and S. L. Hakimi. The maximum coverage location problem. SIAM Journal on Algebraic and Discrete Methods, 4:253--261, 1983.Google ScholarGoogle ScholarCross RefCross Ref
  14. A. Pavlicek, J. Paces, O. Clay, and G. Bernardi. A compact view of isochores in the draft human genome sequence. FEBS Letters, 511:165--169, 2002.Google ScholarGoogle ScholarCross RefCross Ref
  15. V. Ramensky, V. Makeev, M. Roytberg, and V. Tumanyan. DNA segmentation through the bayesian approach. Journal of Computational Biology, 7(1/2):215--231, 2000.Google ScholarGoogle ScholarCross RefCross Ref
  16. M. Salmenkivi, J. Kere, and H. Mannila. Genome segmentation using piecewise constant intensity models and reversible jump MCMC. In European Conference on Computational Biology (ECCB2003), 2003. To appear.Google ScholarGoogle Scholar
  17. G. Schwarz. Estimating the dimension of a model. The Annals of Statistics, 7(2):461--464, 1978.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Finding recurrent sources in sequences

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            RECOMB '03: Proceedings of the seventh annual international conference on Research in computational molecular biology
            April 2003
            352 pages
            ISBN:1581136358
            DOI:10.1145/640075

            Copyright © 2003 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 10 April 2003

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • Article

            Acceptance Rates

            RECOMB '03 Paper Acceptance Rate35of175submissions,20%Overall Acceptance Rate148of538submissions,28%

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader