2005 | OriginalPaper | Buchkapitel
A Sentence-Based Copy Detection Approach for Web Documents
verfasst von : Rajiv Yerra, Yiu-Kai Ng
Erschienen in: Fuzzy Systems and Knowledge Discovery
Verlag: Springer Berlin Heidelberg
Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.
Wählen Sie Textabschnitte aus um mit Künstlicher Intelligenz passenden Patente zu finden. powered by
Markieren Sie Textabschnitte, um KI-gestützt weitere passende Inhalte zu finden. powered by
Web documents that are either partially or completely duplicated in content are easily found on the Internet these days. Not only these documents create redundant information on the Web, which take longer to filter unique information and cause additional storage space, but they also degrade the efficiency of Web information retrieval. In this paper, we present a sentence-based copy detection approach on Web documents, which determines the existence of overlapped portions of any two given Web documents and graphically displays the locations of (semantically the) same sentences detected in the documents. Two sentences are treated as either the same or different according to the degree of similarity of the sentences computed by using either the
three least-frequent 4-gram
approach or the
fuzzy-set information retrieval
(
IR
) approach. Experimental results show that the fuzzy-set IR approach outperforms the three least-frequent 4-gram approach in our copy detection approach, which handles wide range of documents in different subject areas and does not require static word lists.