skip to main content
10.1145/511446.511522acmconferencesArticle/Chapter ViewAbstractPublication PageswwwConference Proceedingsconference-collections
Article

Template detection via data mining and its applications

Published:07 May 2002Publication History

ABSTRACT

We formulate and propose the template detection problem, and suggest a practical solution for it based on counting frequent item sets. We show that the use of templates is pervasive on the web. We describe three principles, which characterize the assumptions made by hypertext information retrieval (IR) and data mining (DM) systems, and show that templates are a major source of violation of these principles. As a consequence, basic "pure" implementations of simple search algorithms coupled with template detection and elimination show surprising increases in precision at all levels of recall.

References

  1. R. Agrawal and R. Srikant. Fast Algorithms for Mining Association Rules in Large Databases. In Proceedings of the Twentieth International Conference on Very Large Databases, pages 487--499, Santiago, Chile, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. K. Bharat and M. Henzinger. Improved algorithms for topic distillation in a hyperlinked environment. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 104--111, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In Proceedings of the 7th International World Wide Web Conference (WWW7), pages 107--117, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. A. Z. Broder, S. C. Glassman, and M. S. Manasse. Syntactic clustering of the web. In Proceedings of the 6th International World Wide Web Conference (WWW6), pages 1157--1166, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. V. Bush. As we may think. The Atlantic Monthly, 176(1):101--108, July 1945.Google ScholarGoogle Scholar
  6. S. Chakrabarti. Integrating the document object model with hyperlinks for enhanced topic distillation and information extraction. In Proceedings of the 10th International World Wide Web Conference (WWW2001), pages 211--220, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, P. Raghavan, and S. Rajagopalan. Automatic resource list compilation by analyzing hyperlink structure and associated text. In Proceedings of the 7th International World Wide Web Conference (WWW7), pages 65--74, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. S. Chakrabarti, B. Dom, D. Gibson, R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Topic distillation and spectral filtering. Artificial Intelligence Review, 13(5-6):409--435, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. S. Chakrabarti, B. Dom, and P. Indyk. Enhanced hypertext categorization using hyperlinks. In SIGMOD 1998, Proceedings ACM SIGMOD International Conference on Management of Data, pages 307--318, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. S. Chakrabarti, B. Dom, P. Raghavan, S. Rajagopalan, D. Gibson, and J. Kleinberg. Automatic resource compilation by analyzing hyperlink structure and associated text. In Proceedings of the 7th International World Wide Web Conference (WWW7), pages 65--74, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. S. Chakrabarti, M. Joshi, and V. Tawde. Enhanced topic distillation using text, markup tags, and hyperlinks. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. S. Chakrabarti, M. van den Berg, and B. Dom. Distributed hypertext resource discovery through examples. In Proceedings of the 25th International Conference on Very Large Databases (VLDB), pages 375--386, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. S. Chakrabarti, M. van den Berg, and B. Dom. Focused crawling: A new approach to topic-specific web resource discovery. In Proceedings of the 8th International World Wide Web Conference (WWW8), pages 1623--1640, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. B. D. Davison. Recognizing nepotistic links on the web. In Proceedings of the AAAI-2000 Workshop on Artificial Intelligence for Web Search, pages 23--28, 2000.Google ScholarGoogle Scholar
  15. J. Dean and M. Henzinger. Finding related pages in the world wide web. In Proceedings of the 8th International World Wide Web Conference (WWW8), pages 1467--1479, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. E. Garfield. "Citation Analysis as a Tool in Journal Evaluation". Science, 178:471--479, 1972.Google ScholarGoogle ScholarCross RefCross Ref
  17. Google. http://www.google.com.Google ScholarGoogle Scholar
  18. M. Kessler. Bibliographic coupling between scientific papers. American Documentation, 14:10--25, 1963.Google ScholarGoogle ScholarCross RefCross Ref
  19. J. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, pages 604--632, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Trawling the Web for emerging cyber-communities. In Proceedings of the 8th International World Wide Web Conference (WWW8), pages 1481--1493, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. R. Lempel and S. Moran. The stochastic approach for link-structure analysis (SALSA) and the TKC effect. Computer Networks (Amsterdam, Netherlands: 1999), 33(1--6):387--401, June 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Y. Maarek, D. Berry, and G. Kaiser. An information retrieval approach for automatically constructing software libraries. Transactions on Software Engineering, 17(8):800--813, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing order to the web. Technical report, Computer Science Department, Stanford University, 1998.Google ScholarGoogle Scholar
  24. G. Pinski and F. Narin. Citation influence for journal aggregates of scientific publications: Theory, with application to the literature of physics. Inf. Proc. and Management, 12, 1976.Google ScholarGoogle Scholar
  25. P. Pirolli, J. E. Pitkow, and R. Rao. Silk from a sow's ear: Extracting usable structures from the Web. In Conference Proceedings on Human Factors and Computing (CHI), pages 118--125, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. H. Small. Co-citation in the scientific literature: A new measure of the relationship between two documents. Journal of the American Society for Information Science, 24:265--269, 1973.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Template detection via data mining and its applications

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      WWW '02: Proceedings of the 11th international conference on World Wide Web
      May 2002
      754 pages
      ISBN:1581134495
      DOI:10.1145/511446

      Copyright © 2002 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 7 May 2002

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • Article

      Acceptance Rates

      Overall Acceptance Rate1,899of8,196submissions,23%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader