skip to main content
10.1145/1148170.1148243acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
Article

Near-duplicate detection by instance-level constrained clustering

Published:06 August 2006Publication History

ABSTRACT

For the task of near-duplicated document detection, both traditional fingerprinting techniques used in database community and bag-of-word comparison approaches used in information retrieval community are not sufficiently accurate. This is due to the fact that the characteristics of near-duplicated documents are different from that of both "almost-identical" documents in the data cleaning task and "relevant" documents in the search task. This paper presents an instance-level constrained clustering approach for near-duplicate detection. The framework incorporates information such as document attributes and content structure into the clustering process to form near-duplicate clusters. Gathered from several collections of public comments sent to U.S. government agencies on proposed new regulations, the experimental results demonstrate that our approach outperforms other near-duplicate detection algorithms and as about as effective as human assessors.

References

  1. S. Brin, J. Davis, and H. Garcia-Molina. Copy detection mechanisms for digital documents. In Proceedings of the Special Interest Group on Management of Data (SIGMOD 1995), pages 398--409. ACM Press, May 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Y. Bernstein and J. Zobel, A scalable system for identifying co-derivative documents. In Proceedings of the String Processing and Information Retrieval Symposium (SPIRE), page 55--67, Padova, Italy, September 2004.Google ScholarGoogle ScholarCross RefCross Ref
  3. A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. In Proceedings of WWW6 '97, pages 391--404. Elsevier Science, April 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. A. Chowdhury. O. Frieder, D. Grossman, and M. McCabe. Collection statistics for fast Duplicate document detection. In ACM Transactions on Information Systems (TOIS), Volume 20, Issue 2, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. J. Conrad and C. P. Schriber. Online duplicate document detection: signature reliability in a dynamic retrieval environment. Proceedings of the twelfth international conference on Information and knowledge management, pages: 443-452, New Orleans, LA, USA, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. K. Gwet. Kappa Statistic is not Satisfactory for Assessing the Extent of Agreement between Raters. Statistical Methods for Inter-rater Reliability Assessment, No.1, April 2002.Google ScholarGoogle Scholar
  7. T. Hoad and J. Zobel. Methods for identifying versioned and plagiarized documents. In Journal of the American Society or Information Science and Technology, Vol 54, I 3, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. D. Klein, S. D. Kamvar, and C. D. Manning. From instance-level constraints to space-level constraints: Making the most of prior knowledge in data clustering. In Proceedings of the 19th International Conference on Machine Learning, pages 307--314, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. A. Kołcz, A. Chowdhury, J. Alspector. Improved Robustness of Signature-Based Near-Replica Detection via Lexicon Randomization. Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, page 605--610, Seattle, WA, USA, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. D. Metzler, Y. Bernstein and W. Bruce Croft. Similarity Measures for Tracking Information Flow, Proceedings of the fourteenth international conference on Information and knowledge management, CIKM'05, October 31.November 5, 2005, Bremen, Germany. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. NIST, "Secure Hash Standard", Federal Information Processing Standards Publication 180--1, 1995.Google ScholarGoogle Scholar
  12. N. Shivakumar and H. Garcia-Molina. SCAM: a copy detection mechanism for digital documents. In Proc. International Conference on Theory and Practice of Digital Libraries, Austin, Texas, June 1995.Google ScholarGoogle Scholar
  13. S.W. Shulman, E-Rulemaking: Issues in Current Research and Practice. International Journal of Public Administration 28: 621--641. 2005.Google ScholarGoogle ScholarCross RefCross Ref
  14. K, Wagstaff and C, Cardie, 2000. Clustering with instance-level constraints. In Proceedings of ICML-2000. pp. 1103--1110. Palo Alto, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. E. P. Xing, A. Y. Ng, M. I. Jordan, and S. Russell. Distance metric learning, with application to clustering with side-information. In Advances in Neural Information Processing Systems, 2003.Google ScholarGoogle Scholar
  16. H. Yang and J. Callan. Near-Duplicate Detection for eRulemaking. In Proceedings of the 5th National Conference on Digital Government Research (DG.O2005), Atlanta, GA, USA, 15-18 May 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. C. Zhai and Lafferty, J. (2001b). A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of SIGIR 2001, pages 334--342. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Near-duplicate detection by instance-level constrained clustering

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
        August 2006
        768 pages
        ISBN:1595933697
        DOI:10.1145/1148170

        Copyright © 2006 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 6 August 2006

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • Article

        Acceptance Rates

        Overall Acceptance Rate792of3,983submissions,20%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader