2009 | OriginalPaper | Chapter
Exploiting Sentence-Level Features for Near-Duplicate Document Detection
Authors : Jenq-Haur Wang, Hung-Chi Chang
Published in: Information Retrieval Technology
Publisher: Springer Berlin Heidelberg
Activate our intelligent search to find suitable subject content or patents.
Select sections of text to find matching patents with Artificial Intelligence. powered by
Select sections of text to find additional relevant content using AI-assisted search. powered by
Digital documents are easy to copy. How to effectively detect possible near-duplicate copies is critical in Web search. Conventional copy detection approaches such as document fingerprinting and bag-of-word similarity target at different levels of granularity in document features, from word
n
-grams to whole documents. In this paper, we focus on the
mutual-inclusive
type of near-duplicates where only partial overlap among documents makes them similar. We propose using a simple and compact sentence-level feature,
the sequence of sentence lengths
, for near-duplicate copy detection. Various configurations of sentence-level and word-level algorithms are evaluated. The experimental results show that sentence-level algorithms achieved higher efficiency with comparable precision and recall rates.