Article

Finding recurrent sources in sequences

Authors:
Aristides Gionis

Stanford University, Stanford, CA

Stanford University, Stanford, CA
View Profile

,
Heikki Mannila

University of Helsinki, Helsinki, Finland

University of Helsinki, Helsinki, Finland
View Profile

RECOMB '03: Proceedings of the seventh annual international conference on Research in computational molecular biologyApril 2003Pages 123–130https://doi.org/10.1145/640075.640091

Published:10 April 2003Publication History

RECOMB '03: Proceedings of the seventh annual international conference on Research in computational molecular biology

Pages 123–130

ABSTRACT

Many genomic sequences and, more generally, (multivariate) time series display tremendous variability. However, often it is reasonable to assume that the sequence is actually generated by or assembled from a small number of sources, each of which might contribute several segments to the sequence. That is, there are h hidden sources such that the sequence can be written as a concatenation of k > h pieces, each of which stems from one of the h sources. We define this (k,h)-segmentation problem and show that it is NP-hard in the general case. We give approximation algorithms achieving approximation ratios of 3 for the L₁ error measure and √5 for the L₂ error measure, and generalize the results to higher dimensions. We give empirical results on real (chromosome 22) and artificial data showing that the methods work well in practice.

References

S. Arora, P. Raghavan, and S. Rao. Approximation schemes for euclidean k -medians and related problems. In ACM Symposium on Theory of Computing, pages 106--113, 1998. Google ScholarDigital Library
R. K. Azad, J. S. Rao, W. Li, and R. Ramaswamy. Simplifying the mosaic description of DNA sequences. Physical Review E, 66, article 031913, 2002.Google Scholar
R. Bellman. On the approximation of curves by line segments using dynamic programming. Commun. ACM, 4(6), 1961. Google ScholarDigital Library
K. Bennett. Determination of the number of zones in a biostratigraphical sequence. New Phytol., 132:155--170, 1996.Google ScholarCross Ref
A. Cantoni. Optimal curve fitting with piecewise linear functions. IEEE Transactions on Computers, C-20(1):59--67, 1971.Google ScholarDigital Library
G. A. Churchill. Stochastic models for heterogeneous DNA sequences. Bulletin of Mathematical Biology, 51:79--94, 1989.Google ScholarCross Ref
J. Himberg, K. Korpiaho, H. Mannila, J. Tikanmaki, and H. T. Toivonen. Time series segmentation for context recognition in mobile devices. In The 2001 IEEE International Conference on Data Mining (ICDM'01), pages 203--210, 2001. Google ScholarDigital Library
T. Kanungo, D. Mount, N. Netanyahu, C. Piatko, R.Silverman, and A. Wu. A local search approximation algorithm for k-means clustering. In Proceedings of the 18th Annual Symposium on Computational Geometry, pages 10--18, 2002. Google ScholarDigital Library
E. Keogh, S. Chu, D. Hart, and M. Pazzani. An online algorithm for segmenting time series. In IEEE International Conference on Data Mining, pages 289--296, 2001. Google ScholarDigital Library
W. Li. DNA segmentation as a model selection process. In RECOMB 2001, pages 204--210, 2001. Google ScholarDigital Library
J. Liu and C. Lawrence. Bayesian inference on biopolymer models. Bioinformatics, 15(1):38--52, 1999.Google ScholarCross Ref
N. Megiddo and K. J. Supowit. On the complexity of some common geometric location problems. SIAM Journal on Computing, 13(1):182--196, 1984.Google ScholarCross Ref
N. Megiddo, E. Zemel, and S. L. Hakimi. The maximum coverage location problem. SIAM Journal on Algebraic and Discrete Methods, 4:253--261, 1983.Google ScholarCross Ref
A. Pavlicek, J. Paces, O. Clay, and G. Bernardi. A compact view of isochores in the draft human genome sequence. FEBS Letters, 511:165--169, 2002.Google ScholarCross Ref
V. Ramensky, V. Makeev, M. Roytberg, and V. Tumanyan. DNA segmentation through the bayesian approach. Journal of Computational Biology, 7(1/2):215--231, 2000.Google ScholarCross Ref
M. Salmenkivi, J. Kere, and H. Mannila. Genome segmentation using piecewise constant intensity models and reversible jump MCMC. In European Conference on Computational Biology (ECCB2003), 2003. To appear.Google Scholar
G. Schwarz. Estimating the dimension of a model. The Annals of Statistics, 7(2):461--464, 1978.Google ScholarCross Ref

Index Terms

Finding recurrent sources in sequences
1. Applied computing
  1. Life and medical sciences
2. Theory of computation
  1. Randomness, geometry and discrete structures
    1. Computational geometry

Recommendations

Efficient algorithms for finding interleaving relationship between sequences

The longest common subsequence and sequence alignment problems have been studied extensively and they can be regarded as the relationship measurement between sequences. However, most of them treat sequences evenly or consider only two sequences. ...
Read More
The relation between preset distinguishing sequences and synchronizing sequences
Abstract
We study the relation between synchronizing sequences and preset distinguishing sequences which are some special sequences used in finite state machine based testing. We show that the problems related to preset distinguishing sequences can be ...
Read More
Finding effective compilation sequences
LCTES '04

Most modern compilers operate by applying a fixed, program-independent sequence of optimizations to all programs. Compiler writers choose a single "compilation sequence", or perhaps a couple of compilation sequences. In choosing a sequence, they may ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
RECOMB '03: Proceedings of the seventh annual international conference on Research in computational molecular biology
April 2003
352 pages
ISBN:1581136358
DOI:10.1145/640075
Editors:
Martin Vingron
Max-Planck-Institute for Molecular Genetics, Germany
,
Sorin Istrail
Celera Genomics/Applied Biosystems
,
Pavel Pevzner
University of California at San Diego, CA
,
Michael Waterman
University of Southern California, CA
,
Program Chair:
Webb Miller
The Pennsylvania State University
Copyright © 2003 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 10 April 2003
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- Article
Conference

Acceptance Rates
RECOMB '03 Paper Acceptance Rate35of175submissions,20%Overall Acceptance Rate148of538submissions,28%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 39
  Total Citations
  View Citations
- 758
  Total Downloads
- Downloads (Last 12 months)28
- Downloads (Last 6 weeks)9
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Finding recurrent sources in sequences

RECOMB '03: Proceedings of the seventh annual international conference on Research in computational molecular biology

ABSTRACT

References

Cited By

Index Terms

Recommendations

Efficient algorithms for finding interleaving relationship between sequences

The relation between preset distinguishing sequences and synchronizing sequences

Finding effective compilation sequences