2009 | OriginalPaper | Buchkapitel
Exploiting Sentence-Level Features for Near-Duplicate Document Detection
verfasst von : Jenq-Haur Wang, Hung-Chi Chang
Erschienen in: Information Retrieval Technology
Verlag: Springer Berlin Heidelberg
Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.
Wählen Sie Textabschnitte aus um mit Künstlicher Intelligenz passenden Patente zu finden. powered by
Markieren Sie Textabschnitte, um KI-gestützt weitere passende Inhalte zu finden. powered by
Digital documents are easy to copy. How to effectively detect possible near-duplicate copies is critical in Web search. Conventional copy detection approaches such as document fingerprinting and bag-of-word similarity target at different levels of granularity in document features, from word
n
-grams to whole documents. In this paper, we focus on the
mutual-inclusive
type of near-duplicates where only partial overlap among documents makes them similar. We propose using a simple and compact sentence-level feature,
the sequence of sentence lengths
, for near-duplicate copy detection. Various configurations of sentence-level and word-level algorithms are evaluated. The experimental results show that sentence-level algorithms achieved higher efficiency with comparable precision and recall rates.