2009 | OriginalPaper | Chapter
Discovering Almost Any Hidden Motif from Multiple Sequences in Polynomial Time with Low Sample Complexity and High Success Probability
Authors : Bin Fu, Ming-Yang Kao, Lusheng Wang
Published in: Theory and Applications of Models of Computation
Publisher: Springer Berlin Heidelberg
Activate our intelligent search to find suitable subject content or patents.
Select sections of text to find matching patents with Artificial Intelligence. powered by
Select sections of text to find additional relevant content using AI-assisted search. powered by
We study a natural probabilistic model for motif discovery that has been used to experimentally test the effectiveness of motif discovery programs. In this model, there are
k
background sequences, and each character in a background sequence is a random character from an alphabet
Σ
. A motif
G
=
g
1
g
2
...
g
m
is a string of
m
characters. Each background sequence is implanted a probabilistically generated approximate copy of
G
. For a probabilistically generated approximate copy
b
1
b
2
...
b
m
of
G
, every character is probabilistically generated such that the probability for
b
i
≠
g
i
is at most
α
. It has been conjectured that multiple background sequences can help with finding faint motifs
G
.
In this paper, we develop an efficient algorithm that can discover a hidden motif from a set of sequences for any alphabet
Σ
with |
Σ
| ≥ 2 and is applicable to DNA motif discovery. We prove that for
$\alpha<{1\over 4}(1-{1\over |\Sigma|})$
and any constant
x
≥ 8, there exist positive constants
c
0
,
ε
,
δ
1
and
δ
2
such that if the length
ρ
of motif
G
is at least
δ
1
log
n
, and there are
k
≥
c
0
log
n
input sequences, then in
O
(
n
2
+
kn
) time this algorithm finds the motif with probability at least
$1-{1\over 2^x}$
for every
$G\in \Sigma^{\rho}-\Psi_{\rho, h,\epsilon}(\Sigma)$
, where
ρ
is the length of the motif,
h
is a parameter with
ρ
≥ 4
h
≥
δ
2
log
n
, and
Ψ
ρ
,
h
,
ε
(
Σ
) is a small subset of at most
$2^{-\Theta(\epsilon^2 h)}$
fraction of the sequences in
Σ
ρ
. The constants
c
0
,
ε
,
δ
1
and
δ
2
do not depend on
x
when
x
is a parameter of order
O
(log
n
). Our algorithm can take any number
k
sequences as input.