ABSTRACT
Many genomic sequences and, more generally, (multivariate) time series display tremendous variability. However, often it is reasonable to assume that the sequence is actually generated by or assembled from a small number of sources, each of which might contribute several segments to the sequence. That is, there are h hidden sources such that the sequence can be written as a concatenation of k > h pieces, each of which stems from one of the h sources. We define this (k,h)-segmentation problem and show that it is NP-hard in the general case. We give approximation algorithms achieving approximation ratios of 3 for the L1 error measure and √5 for the L2 error measure, and generalize the results to higher dimensions. We give empirical results on real (chromosome 22) and artificial data showing that the methods work well in practice.
- S. Arora, P. Raghavan, and S. Rao. Approximation schemes for euclidean k -medians and related problems. In ACM Symposium on Theory of Computing, pages 106--113, 1998. Google ScholarDigital Library
- R. K. Azad, J. S. Rao, W. Li, and R. Ramaswamy. Simplifying the mosaic description of DNA sequences. Physical Review E, 66, article 031913, 2002.Google Scholar
- R. Bellman. On the approximation of curves by line segments using dynamic programming. Commun. ACM, 4(6), 1961. Google ScholarDigital Library
- K. Bennett. Determination of the number of zones in a biostratigraphical sequence. New Phytol., 132:155--170, 1996.Google ScholarCross Ref
- A. Cantoni. Optimal curve fitting with piecewise linear functions. IEEE Transactions on Computers, C-20(1):59--67, 1971.Google ScholarDigital Library
- G. A. Churchill. Stochastic models for heterogeneous DNA sequences. Bulletin of Mathematical Biology, 51:79--94, 1989.Google ScholarCross Ref
- J. Himberg, K. Korpiaho, H. Mannila, J. Tikanmaki, and H. T. Toivonen. Time series segmentation for context recognition in mobile devices. In The 2001 IEEE International Conference on Data Mining (ICDM'01), pages 203--210, 2001. Google ScholarDigital Library
- T. Kanungo, D. Mount, N. Netanyahu, C. Piatko, R.Silverman, and A. Wu. A local search approximation algorithm for k-means clustering. In Proceedings of the 18th Annual Symposium on Computational Geometry, pages 10--18, 2002. Google ScholarDigital Library
- E. Keogh, S. Chu, D. Hart, and M. Pazzani. An online algorithm for segmenting time series. In IEEE International Conference on Data Mining, pages 289--296, 2001. Google ScholarDigital Library
- W. Li. DNA segmentation as a model selection process. In RECOMB 2001, pages 204--210, 2001. Google ScholarDigital Library
- J. Liu and C. Lawrence. Bayesian inference on biopolymer models. Bioinformatics, 15(1):38--52, 1999.Google ScholarCross Ref
- N. Megiddo and K. J. Supowit. On the complexity of some common geometric location problems. SIAM Journal on Computing, 13(1):182--196, 1984.Google ScholarCross Ref
- N. Megiddo, E. Zemel, and S. L. Hakimi. The maximum coverage location problem. SIAM Journal on Algebraic and Discrete Methods, 4:253--261, 1983.Google ScholarCross Ref
- A. Pavlicek, J. Paces, O. Clay, and G. Bernardi. A compact view of isochores in the draft human genome sequence. FEBS Letters, 511:165--169, 2002.Google ScholarCross Ref
- V. Ramensky, V. Makeev, M. Roytberg, and V. Tumanyan. DNA segmentation through the bayesian approach. Journal of Computational Biology, 7(1/2):215--231, 2000.Google ScholarCross Ref
- M. Salmenkivi, J. Kere, and H. Mannila. Genome segmentation using piecewise constant intensity models and reversible jump MCMC. In European Conference on Computational Biology (ECCB2003), 2003. To appear.Google Scholar
- G. Schwarz. Estimating the dimension of a model. The Annals of Statistics, 7(2):461--464, 1978.Google ScholarCross Ref
Index Terms
- Finding recurrent sources in sequences
Recommendations
Efficient algorithms for finding interleaving relationship between sequences
The longest common subsequence and sequence alignment problems have been studied extensively and they can be regarded as the relationship measurement between sequences. However, most of them treat sequences evenly or consider only two sequences. ...
The relation between preset distinguishing sequences and synchronizing sequences
AbstractWe study the relation between synchronizing sequences and preset distinguishing sequences which are some special sequences used in finite state machine based testing. We show that the problems related to preset distinguishing sequences can be ...
Finding effective compilation sequences
LCTES '04Most modern compilers operate by applying a fixed, program-independent sequence of optimizations to all programs. Compiler writers choose a single "compilation sequence", or perhaps a couple of compilation sequences. In choosing a sequence, they may ...
Comments