skip to main content
10.1145/584792.584802acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
Article

Topic-oriented collaborative crawling

Published:04 November 2002Publication History

ABSTRACT

A major concern in the implementation of a distributed Web crawler is the choice of a strategy for partitioning the Web among the nodes in the system. Our goal in selecting this strategy is to minimize the overlap between the activities of individual nodes. We propose a topic-oriented approach, in which the Web is partitioned into general subject areas with a crawler assigned to each. We examine design alternatives for a topic-oriented distributed crawler, including the creation of a Web page classifier for use in this context. The approach is compared experimentally with a hash-based partitioning, in which crawler assignments are determined by hash functions computed over URLs and page contents. The experimental evaluation demonstrates the feasibility of the approach, addressing issues of communication overhead, duplicate content detection, and page quality assessment.

References

  1. Charu C Aggarwal, Fatima Al-Garawi, and Philip S. Yu. Intelligent crawling on the World Wide Web with arbitrary predicates. In Tenth International World Wide Web Conference, pages 96--105, May 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Brian Amento, Loren Terveen, and Will Hill. Does "authority" mean quality: Predicting expert quality ratings of Web documents. In 23rd Annual International ACM SIGIR Conference on Research and Developement in Information Retrieval, pages 296--303, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Dario Benedetto, Emanuele Caglioti, and Vittorio Loreto. Language trees and zipping. Physical Review Letters, 88(4), January 2002.Google ScholarGoogle ScholarCross RefCross Ref
  4. Sergey Brin and Lawrence Page. The anatomy of a large-scale hypertextual Web search engine. In Seventh International World Wide Web Conference, pages 107--117, April 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, and Geoffrey Zweig. Syntactic clustering of the Web. In Sixth International World Wide Web Conference, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Mike Burner. Crawling towards Eternity. Web Techniques, 2(5), May 1997.Google ScholarGoogle Scholar
  7. W. Cavnar and J. Trenkle. N-gram based text categorization. In 3rd Annual Symposium on Document Analysis and Information Retrieval, pages 161--175, 1994.Google ScholarGoogle Scholar
  8. Soumen Chakrabarti, Byron Dom, and Piotr Indyk. Enhanced hypertext categorization using hyperlinks. In ACM SIGMOD International Conference on Management of Data, pages 307--318, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Soumen Chakrabarti, Martin van~den Burg, and Byron Dom. Focused crawling: A new approach to topic-specific Web resource discovery. In Eighth International World Wide Web Conference, pages 545--562, May 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Junghoo Cho. Crawling the Web: Discovery and Maintenance of Large-Scale Web Data. PhD thesis, Stanford University, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Junghoo Cho and Hector Garcia-Molina. Parallel crawling. In Eleventh International World Wide Web Conference, May 2002.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Junghoo Cho, Hector Garcia-Molina, and Lawrence Page. Efficient crawling through URL ordering. In Seventh International World Wide Web Conference, April 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Junghoo Cho, Narayanan Shivakumar, and Hector Garcia-Molina. Finding replicated Web collections. In ACM SIGMOD International Conference on Management of Data, pages 355--366, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Charles L. A. Clarke, Gordon~V. Cormack, and Thomas~R. Lynam. Exploiting redundancy in question answering. In 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 358--365, September 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Jim Cowie, Yevgeny Ludovik, and Ron Zacharski. An autonomous, Web-based, multilingual corpus collection tool. In International Conference on Natural Language Processing and Industrial Applications, 1998.Google ScholarGoogle Scholar
  16. Mark Craven, Dan DiPasquo, Dayne Freitag, Andrew McCallum, Tom Mitchell, Kamal Nigan, and Sean Slattery. Learning to construct knowledge bases from the World Wide Web. Artificial Intelligence, 118(1--2):69--113, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Brian D. Davison. Topical locality on the Web. In 23rd Annual International ACM SIGIR Conference on Research and Developement in Information Retrieval, pages 272--279, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. M. Diligenti, F. M. Coetzee, S. Lawrence, C. L. Giles, and M. Gori. Focused crawling using context graphs. In 26th International Conference on Very Large Databases, pages 527--534, September 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Susan T. Dumais and Hao Chen. Hierarchical classification of Web content. In 23rd Annual International ACM SIGIR Conference on Research and Developement in Information Retrieval, pages 256--263, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Jenny Edwards, Kevin McCurley, and John Tomlin. An adaptive model of optimizing performance of an incremental Web crawler. In Tenth International World Wide Web Conference, pages 106--113, May 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Susan Gauch, Guijun Wang, and Mario Gomez. ProFusion:Intelligent fusion from multiple, distributed search engines. Journal of Universal Computer Science, 2(9), September 1996.Google ScholarGoogle Scholar
  22. David Hawking and Paul Thistlewaite. Methods for information server selection. ACM Transactions on Information Systems, 17(1):40--76, January 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Allan Heydon and Marc Najork. Mercator: A scalable, extensible Web crawler. World Wide Web, 1(2):219--229, December 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Allan Heydon and Marc Najork. Performance limitations of the Java Core libraries. In ACM 1999 Java Grande Conference, pages 35--41, June 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Thorsten Joachims. A probabilistic analysis of the Rocchio algorithm with TFIDF for text classification. In 14th International Conference on Machine Learning, pages 143--151, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Thorsten Joachims. A statistical learning model of text classification for support vector machines. In 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 128--136, September 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Jon M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604--632, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Cody C. T. Kwok, Oren Etzioni, and Daniel~S Weld. Scaling question answering to the Web. In Tenth International World Wide Web Conference, pages 150--161, May 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. D. Lewis. An evaluation of phrasal and clustered representations on a text categorization task. In 15th Annual International ACM SIGIR Conference on Research and Developement in Information Retrieval, pages 37--50, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Andrew McCallum, Kamal Nigam, Jason Rennie, and Kristie Seymore. Building domain-specific search engines with machine learning techniques. In AAAI-99 Spring Symposium on Intelligent Agents in Cyberspace, 1999.Google ScholarGoogle Scholar
  31. Filippo Menczer, Gautam Pant, Padmini Srinivasan, and Miguel~E. Ruiz. Evaluating topic-driven Web crawlers. In 24th Annual International ACM SIGIR Conference on Research and Developement in Information Retrieval, pages 241--249, September 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Tom Mitchell. Machine Learning. McGraw Hill, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Sougata Mukherjea. WTMS: A system for collecting and analyzing topic-specific Web information. In Ninth International World Wide Web Conference, May 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Kamal Nigam, Andrew~Kachites McCallum, Sebastian Thrun, and Tom Mitchell. Text classification from labeled and unlabeled documents using EM. Machine Learning, 39(2/3):103--134, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Robert E. Schapire, Yoram Singer, and Amit Singhal. Boosting and Rocchio applied to text filtering. In 21st Annual International ACM SIGIR Conference on Research and Developement in Information Retrieval, pages 215--223, August 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Vaughan Shanks and Hugh~E. Williams. Fast categorisation of large document collections. In 8th International Symposium on String Processing and Information Retrieval, pages 194--204, November 2001.Google ScholarGoogle ScholarCross RefCross Ref
  37. Yiming Yang and Xin Liu. A re-examination of text categorization methods. In 22th Annual International ACM SIGIR Conference on Research and Developement in Information Retrieval, pages 42--49, August 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Topic-oriented collaborative crawling

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        CIKM '02: Proceedings of the eleventh international conference on Information and knowledge management
        November 2002
        704 pages
        ISBN:1581134924
        DOI:10.1145/584792

        Copyright © 2002 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 4 November 2002

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • Article

        Acceptance Rates

        Overall Acceptance Rate1,861of8,427submissions,22%

        Upcoming Conference

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader