2008 | OriginalPaper | Buchkapitel
A Wikipedia-Based Multilingual Retrieval Model
verfasst von : Martin Potthast, Benno Stein, Maik Anderka
Erschienen in: Advances in Information Retrieval
Verlag: Springer Berlin Heidelberg
Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.
Wählen Sie Textabschnitte aus um mit Künstlicher Intelligenz passenden Patente zu finden. powered by
Markieren Sie Textabschnitte, um KI-gestützt weitere passende Inhalte zu finden. powered by
This paper introduces CL-ESA, a new multilingual retrieval model for the analysis of cross-language similarity. The retrieval model exploits the multilingual alignment of Wikipedia: given a document
d
written in language
L
we construct a concept vector
d
for
d
, where each dimension
i
in
d
quantifies the similarity of
d
with respect to a document
$d^*_i$
chosen from the “
L
-subset” of Wikipedia. Likewise, for a second document
d
′ written in language
L
′,
$L\not=L'$
, we construct a concept vector
d
′, using from the
L
′-subset of the Wikipedia the topic-aligned counterparts
$d'^*_i$
of our previously chosen documents.
Since the two concept vectors
d
and
d
′ are
collection-relative representations
of
d
and
d
′ they are language-independent. I. e., their similarity can directly be computed with the cosine similarity measure, for instance.
We present results of an extensive analysis that demonstrates the power of this new retrieval model: for a query document
d
the topically most similar documents from a corpus in another language are properly ranked. Salient property of the new retrieval model is its robustness with respect to both the size and the quality of the index document collection.