2007 | OriginalPaper | Buchkapitel
Indexing a Dictionary for Subset Matching Queries
verfasst von : Gad M. Landau, Dekel Tsur, Oren Weimann
Erschienen in: String Processing and Information Retrieval
Verlag: Springer Berlin Heidelberg
Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.
Wählen Sie Textabschnitte aus um mit Künstlicher Intelligenz passenden Patente zu finden. powered by
Markieren Sie Textabschnitte, um KI-gestützt weitere passende Inhalte zu finden. powered by
We consider a subset matching variant of the
Dictionary Query
problem. Consider a dictionary
D
of
n
strings, where each string location contains a set of characters drawn from some alphabet
Σ
. Our goal is to preprocess
D
so when given a query pattern
p
, where each location in
p
contains a single character from
Σ
, we answer if
p
matches to
D
.
p
is said to match to
D
if there is some
s
∈
D
where |
p
| = |
s
| and
p
[
i
] ∈
s
[
i
] for every 1 ≤
i
≤ |
p
|.
To achieve a query time of
O
(|
p
|), we construct a compressed trie of all possible patterns that appear in
D
. Assuming that for every
s
∈
D
there are at most
k
locations where |
s
[
i
]| > 1, we present two constructions of the trie that yield a preprocessing time of
$O(nm +|\Sigma|^k n \lg (\min\{n,m\}))$
, where
n
is the number of strings in
D
and
m
is the maximum length of a string in
D
. The first construction is based on divide and conquer and the second construction uses ideas introduced in [2] for text fingerprinting. Furthermore, we show how to obtain
$O(nm+|\Sigma|^k n +|\Sigma|^{k/2} n\lg (\min\{n,m\}))$
preprocessing time and
$O(|p|\lg\lg|\Sigma|+\min\{|p|,\lg(|\Sigma|^kn)\}\lg \lg(|\Sigma|^kn))$
query time by cutting the dictionary strings and constructing two compressed tries.
Our problem is motivated by haplotype inference from a library of genotypes [14,7]. There,
D
is a known library of genotypes (|
Σ
| = 2), and
p
is a haplotype. Indexing all possible haplotypes that can be inferred from
D
as well as gathering statistical information about them can be used to accelerate various haplotype inference algorithms. In particular, algorithms based on the “pure parsimony criteria” [13,16], greedy heuristics such as “Clarks rule” [6,18], EM based algorithms [1,11,12,20,26,30], and algorithms for inferring haplotypes from a set of Trios [4,27].