Article

Near-duplicate detection by instance-level constrained clustering

Authors:
Hui Yang

Carnegie Mellon University, Pittsburgh, PA

Carnegie Mellon University, Pittsburgh, PA
View Profile

,
Jamie Callan

Carnegie Mellon University, Pittsburgh, PA

Carnegie Mellon University, Pittsburgh, PA
View Profile

SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrievalAugust 2006Pages 421–428https://doi.org/10.1145/1148170.1148243

Published:06 August 2006Publication History

SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval

Pages 421–428

ABSTRACT

For the task of near-duplicated document detection, both traditional fingerprinting techniques used in database community and bag-of-word comparison approaches used in information retrieval community are not sufficiently accurate. This is due to the fact that the characteristics of near-duplicated documents are different from that of both "almost-identical" documents in the data cleaning task and "relevant" documents in the search task. This paper presents an instance-level constrained clustering approach for near-duplicate detection. The framework incorporates information such as document attributes and content structure into the clustering process to form near-duplicate clusters. Gathered from several collections of public comments sent to U.S. government agencies on proposed new regulations, the experimental results demonstrate that our approach outperforms other near-duplicate detection algorithms and as about as effective as human assessors.

References

S. Brin, J. Davis, and H. Garcia-Molina. Copy detection mechanisms for digital documents. In Proceedings of the Special Interest Group on Management of Data (SIGMOD 1995), pages 398--409. ACM Press, May 1995. Google ScholarDigital Library
Y. Bernstein and J. Zobel, A scalable system for identifying co-derivative documents. In Proceedings of the String Processing and Information Retrieval Symposium (SPIRE), page 55--67, Padova, Italy, September 2004.Google ScholarCross Ref
A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. In Proceedings of WWW6 '97, pages 391--404. Elsevier Science, April 1997. Google ScholarDigital Library
A. Chowdhury. O. Frieder, D. Grossman, and M. McCabe. Collection statistics for fast Duplicate document detection. In ACM Transactions on Information Systems (TOIS), Volume 20, Issue 2, 2002. Google ScholarDigital Library
J. Conrad and C. P. Schriber. Online duplicate document detection: signature reliability in a dynamic retrieval environment. Proceedings of the twelfth international conference on Information and knowledge management, pages: 443-452, New Orleans, LA, USA, 2003. Google ScholarDigital Library
K. Gwet. Kappa Statistic is not Satisfactory for Assessing the Extent of Agreement between Raters. Statistical Methods for Inter-rater Reliability Assessment, No.1, April 2002.Google Scholar
T. Hoad and J. Zobel. Methods for identifying versioned and plagiarized documents. In Journal of the American Society or Information Science and Technology, Vol 54, I 3, 2003. Google ScholarDigital Library
D. Klein, S. D. Kamvar, and C. D. Manning. From instance-level constraints to space-level constraints: Making the most of prior knowledge in data clustering. In Proceedings of the 19th International Conference on Machine Learning, pages 307--314, 2002. Google ScholarDigital Library
A. Kołcz, A. Chowdhury, J. Alspector. Improved Robustness of Signature-Based Near-Replica Detection via Lexicon Randomization. Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, page 605--610, Seattle, WA, USA, 2004. Google ScholarDigital Library
D. Metzler, Y. Bernstein and W. Bruce Croft. Similarity Measures for Tracking Information Flow, Proceedings of the fourteenth international conference on Information and knowledge management, CIKM'05, October 31.November 5, 2005, Bremen, Germany. Google ScholarDigital Library
NIST, "Secure Hash Standard", Federal Information Processing Standards Publication 180--1, 1995.Google Scholar
N. Shivakumar and H. Garcia-Molina. SCAM: a copy detection mechanism for digital documents. In Proc. International Conference on Theory and Practice of Digital Libraries, Austin, Texas, June 1995.Google Scholar
S.W. Shulman, E-Rulemaking: Issues in Current Research and Practice. International Journal of Public Administration 28: 621--641. 2005.Google ScholarCross Ref
K, Wagstaff and C, Cardie, 2000. Clustering with instance-level constraints. In Proceedings of ICML-2000. pp. 1103--1110. Palo Alto, CA. Google ScholarDigital Library
E. P. Xing, A. Y. Ng, M. I. Jordan, and S. Russell. Distance metric learning, with application to clustering with side-information. In Advances in Neural Information Processing Systems, 2003.Google Scholar
H. Yang and J. Callan. Near-Duplicate Detection for eRulemaking. In Proceedings of the 5th National Conference on Digital Government Research (DG.O2005), Atlanta, GA, USA, 15-18 May 2005. Google ScholarDigital Library
C. Zhai and Lafferty, J. (2001b). A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of SIGIR 2001, pages 334--342. Google ScholarDigital Library

Index Terms

Near-duplicate detection by instance-level constrained clustering
1. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Clustering and classification
  2. Information systems applications
    1. Data mining
      1. Clustering

Recommendations

Next steps in near-duplicate detection for eRulemaking
dg.o '06: Proceedings of the 2006 international conference on Digital government research

Large volume public comment campaigns and web portals that encourage the public to customize form letters produce many near-duplicate documents, which increases processing and storage costs, but is rarely a serious problem. A more serious concern is that ...
Read More
The Impact of Global and Local Features on Multiple Sequence Alignment Clustering-Based Near-Duplicate Video Retrieval
Proceedings of the 14th Pacific-Rim Conference on Advances in Multimedia Information Processing PCM 2013 - Volume 8294

Traditionally, the performance of Near-Duplicate Video Retrieval (NDVR) is enhanced through different video features, matching scheme and indexing methods. The video features have been intensively investigated and it has been shown that local features ...
Read More
Near-duplicate video retrieval based on clustering by multiple sequence alignment
MM '12: Proceedings of the 20th ACM international conference on Multimedia

In Near-Duplicate Video Retrieval (NDVR), recent works have focused on bettering index structures and matching schemes not only to improve retrieval accuracy but also to enforce scalability in an effort to keep up with the ever-growing size of video ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
August 2006
768 pages
ISBN:1595933697
DOI:10.1145/1148170
General Chair:
Efthimis N. Efthimiadis
University of Washington
,
Program Chairs:
Susan Dumais
Microsoft Research, Redmond
,
David Hawking
CSIRO ICT Centre, Canberra, Australia
,
Kalervo Järvelin,
University of Tampere, Finland
Copyright © 2006 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 6 August 2006
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
clustering
duplicate detection
public comments
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate792of3,983submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 68
  Total Citations
  View Citations
- 1,296
  Total Downloads
- Downloads (Last 12 months)11
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Near-duplicate detection by instance-level constrained clustering

SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

Next steps in near-duplicate detection for eRulemaking

The Impact of Global and Local Features on Multiple Sequence Alignment Clustering-Based Near-Duplicate Video Retrieval

Near-duplicate video retrieval based on clustering by multiple sequence alignment