ABSTRACT
For the task of near-duplicated document detection, both traditional fingerprinting techniques used in database community and bag-of-word comparison approaches used in information retrieval community are not sufficiently accurate. This is due to the fact that the characteristics of near-duplicated documents are different from that of both "almost-identical" documents in the data cleaning task and "relevant" documents in the search task. This paper presents an instance-level constrained clustering approach for near-duplicate detection. The framework incorporates information such as document attributes and content structure into the clustering process to form near-duplicate clusters. Gathered from several collections of public comments sent to U.S. government agencies on proposed new regulations, the experimental results demonstrate that our approach outperforms other near-duplicate detection algorithms and as about as effective as human assessors.
- S. Brin, J. Davis, and H. Garcia-Molina. Copy detection mechanisms for digital documents. In Proceedings of the Special Interest Group on Management of Data (SIGMOD 1995), pages 398--409. ACM Press, May 1995. Google ScholarDigital Library
- Y. Bernstein and J. Zobel, A scalable system for identifying co-derivative documents. In Proceedings of the String Processing and Information Retrieval Symposium (SPIRE), page 55--67, Padova, Italy, September 2004.Google ScholarCross Ref
- A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. In Proceedings of WWW6 '97, pages 391--404. Elsevier Science, April 1997. Google ScholarDigital Library
- A. Chowdhury. O. Frieder, D. Grossman, and M. McCabe. Collection statistics for fast Duplicate document detection. In ACM Transactions on Information Systems (TOIS), Volume 20, Issue 2, 2002. Google ScholarDigital Library
- J. Conrad and C. P. Schriber. Online duplicate document detection: signature reliability in a dynamic retrieval environment. Proceedings of the twelfth international conference on Information and knowledge management, pages: 443-452, New Orleans, LA, USA, 2003. Google ScholarDigital Library
- K. Gwet. Kappa Statistic is not Satisfactory for Assessing the Extent of Agreement between Raters. Statistical Methods for Inter-rater Reliability Assessment, No.1, April 2002.Google Scholar
- T. Hoad and J. Zobel. Methods for identifying versioned and plagiarized documents. In Journal of the American Society or Information Science and Technology, Vol 54, I 3, 2003. Google ScholarDigital Library
- D. Klein, S. D. Kamvar, and C. D. Manning. From instance-level constraints to space-level constraints: Making the most of prior knowledge in data clustering. In Proceedings of the 19th International Conference on Machine Learning, pages 307--314, 2002. Google ScholarDigital Library
- A. Kołcz, A. Chowdhury, J. Alspector. Improved Robustness of Signature-Based Near-Replica Detection via Lexicon Randomization. Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, page 605--610, Seattle, WA, USA, 2004. Google ScholarDigital Library
- D. Metzler, Y. Bernstein and W. Bruce Croft. Similarity Measures for Tracking Information Flow, Proceedings of the fourteenth international conference on Information and knowledge management, CIKM'05, October 31.November 5, 2005, Bremen, Germany. Google ScholarDigital Library
- NIST, "Secure Hash Standard", Federal Information Processing Standards Publication 180--1, 1995.Google Scholar
- N. Shivakumar and H. Garcia-Molina. SCAM: a copy detection mechanism for digital documents. In Proc. International Conference on Theory and Practice of Digital Libraries, Austin, Texas, June 1995.Google Scholar
- S.W. Shulman, E-Rulemaking: Issues in Current Research and Practice. International Journal of Public Administration 28: 621--641. 2005.Google ScholarCross Ref
- K, Wagstaff and C, Cardie, 2000. Clustering with instance-level constraints. In Proceedings of ICML-2000. pp. 1103--1110. Palo Alto, CA. Google ScholarDigital Library
- E. P. Xing, A. Y. Ng, M. I. Jordan, and S. Russell. Distance metric learning, with application to clustering with side-information. In Advances in Neural Information Processing Systems, 2003.Google Scholar
- H. Yang and J. Callan. Near-Duplicate Detection for eRulemaking. In Proceedings of the 5th National Conference on Digital Government Research (DG.O2005), Atlanta, GA, USA, 15-18 May 2005. Google ScholarDigital Library
- C. Zhai and Lafferty, J. (2001b). A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of SIGIR 2001, pages 334--342. Google ScholarDigital Library
Index Terms
- Near-duplicate detection by instance-level constrained clustering
Recommendations
Next steps in near-duplicate detection for eRulemaking
dg.o '06: Proceedings of the 2006 international conference on Digital government researchLarge volume public comment campaigns and web portals that encourage the public to customize form letters produce many near-duplicate documents, which increases processing and storage costs, but is rarely a serious problem. A more serious concern is that ...
The Impact of Global and Local Features on Multiple Sequence Alignment Clustering-Based Near-Duplicate Video Retrieval
Proceedings of the 14th Pacific-Rim Conference on Advances in Multimedia Information Processing PCM 2013 - Volume 8294Traditionally, the performance of Near-Duplicate Video Retrieval (NDVR) is enhanced through different video features, matching scheme and indexing methods. The video features have been intensively investigated and it has been shown that local features ...
Near-duplicate video retrieval based on clustering by multiple sequence alignment
MM '12: Proceedings of the 20th ACM international conference on MultimediaIn Near-Duplicate Video Retrieval (NDVR), recent works have focused on bettering index structures and matching schemes not only to improve retrieval accuracy but also to enforce scalability in an effort to keep up with the ever-growing size of video ...
Comments