Skip to main content

2004 | OriginalPaper | Buchkapitel

A Scalable System for Identifying Co-derivative Documents

verfasst von : Yaniv Bernstein, Justin Zobel

Erschienen in: String Processing and Information Retrieval

Verlag: Springer Berlin Heidelberg

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Documents are co-derivative if they share content: for two documents to be co-derived, some portion of one must be derived from the other or some portion of both must be derived from a third document. The current technique for concurrently detecting all co-derivatives in a collection is document fingerprinting, which matches documents based on the hash values of selected document subsequences, or chunks. Fingerprinting is currently hampered by an inability to accurately isolate information that is useful in identifying co-derivatives. In this paper we present spex, a novel hash-based algorithm for extracting duplicated chunks from a document collection. We discuss how information about shared chunks can be used for efficiently and reliably identifying co-derivative clusters, and describe deco, a prototype system that makes use of spex. Our experiments with several document collections demonstrate the effectiveness of the approach.

Metadaten
Titel
A Scalable System for Identifying Co-derivative Documents
verfasst von
Yaniv Bernstein
Justin Zobel
Copyright-Jahr
2004
Verlag
Springer Berlin Heidelberg
DOI
https://doi.org/10.1007/978-3-540-30213-1_6

Premium Partner