ABSTRACT
A major concern in the implementation of a distributed Web crawler is the choice of a strategy for partitioning the Web among the nodes in the system. Our goal in selecting this strategy is to minimize the overlap between the activities of individual nodes. We propose a topic-oriented approach, in which the Web is partitioned into general subject areas with a crawler assigned to each. We examine design alternatives for a topic-oriented distributed crawler, including the creation of a Web page classifier for use in this context. The approach is compared experimentally with a hash-based partitioning, in which crawler assignments are determined by hash functions computed over URLs and page contents. The experimental evaluation demonstrates the feasibility of the approach, addressing issues of communication overhead, duplicate content detection, and page quality assessment.
- Charu C Aggarwal, Fatima Al-Garawi, and Philip S. Yu. Intelligent crawling on the World Wide Web with arbitrary predicates. In Tenth International World Wide Web Conference, pages 96--105, May 2001. Google ScholarDigital Library
- Brian Amento, Loren Terveen, and Will Hill. Does "authority" mean quality: Predicting expert quality ratings of Web documents. In 23rd Annual International ACM SIGIR Conference on Research and Developement in Information Retrieval, pages 296--303, 2000. Google ScholarDigital Library
- Dario Benedetto, Emanuele Caglioti, and Vittorio Loreto. Language trees and zipping. Physical Review Letters, 88(4), January 2002.Google ScholarCross Ref
- Sergey Brin and Lawrence Page. The anatomy of a large-scale hypertextual Web search engine. In Seventh International World Wide Web Conference, pages 107--117, April 1998. Google ScholarDigital Library
- Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, and Geoffrey Zweig. Syntactic clustering of the Web. In Sixth International World Wide Web Conference, 1997. Google ScholarDigital Library
- Mike Burner. Crawling towards Eternity. Web Techniques, 2(5), May 1997.Google Scholar
- W. Cavnar and J. Trenkle. N-gram based text categorization. In 3rd Annual Symposium on Document Analysis and Information Retrieval, pages 161--175, 1994.Google Scholar
- Soumen Chakrabarti, Byron Dom, and Piotr Indyk. Enhanced hypertext categorization using hyperlinks. In ACM SIGMOD International Conference on Management of Data, pages 307--318, 1998. Google ScholarDigital Library
- Soumen Chakrabarti, Martin van~den Burg, and Byron Dom. Focused crawling: A new approach to topic-specific Web resource discovery. In Eighth International World Wide Web Conference, pages 545--562, May 1999. Google ScholarDigital Library
- Junghoo Cho. Crawling the Web: Discovery and Maintenance of Large-Scale Web Data. PhD thesis, Stanford University, 2001. Google ScholarDigital Library
- Junghoo Cho and Hector Garcia-Molina. Parallel crawling. In Eleventh International World Wide Web Conference, May 2002.Google ScholarDigital Library
- Junghoo Cho, Hector Garcia-Molina, and Lawrence Page. Efficient crawling through URL ordering. In Seventh International World Wide Web Conference, April 1998. Google ScholarDigital Library
- Junghoo Cho, Narayanan Shivakumar, and Hector Garcia-Molina. Finding replicated Web collections. In ACM SIGMOD International Conference on Management of Data, pages 355--366, 2000. Google ScholarDigital Library
- Charles L. A. Clarke, Gordon~V. Cormack, and Thomas~R. Lynam. Exploiting redundancy in question answering. In 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 358--365, September 2001. Google ScholarDigital Library
- Jim Cowie, Yevgeny Ludovik, and Ron Zacharski. An autonomous, Web-based, multilingual corpus collection tool. In International Conference on Natural Language Processing and Industrial Applications, 1998.Google Scholar
- Mark Craven, Dan DiPasquo, Dayne Freitag, Andrew McCallum, Tom Mitchell, Kamal Nigan, and Sean Slattery. Learning to construct knowledge bases from the World Wide Web. Artificial Intelligence, 118(1--2):69--113, 2000. Google ScholarDigital Library
- Brian D. Davison. Topical locality on the Web. In 23rd Annual International ACM SIGIR Conference on Research and Developement in Information Retrieval, pages 272--279, 2000. Google ScholarDigital Library
- M. Diligenti, F. M. Coetzee, S. Lawrence, C. L. Giles, and M. Gori. Focused crawling using context graphs. In 26th International Conference on Very Large Databases, pages 527--534, September 2000. Google ScholarDigital Library
- Susan T. Dumais and Hao Chen. Hierarchical classification of Web content. In 23rd Annual International ACM SIGIR Conference on Research and Developement in Information Retrieval, pages 256--263, 2000. Google ScholarDigital Library
- Jenny Edwards, Kevin McCurley, and John Tomlin. An adaptive model of optimizing performance of an incremental Web crawler. In Tenth International World Wide Web Conference, pages 106--113, May 2001. Google ScholarDigital Library
- Susan Gauch, Guijun Wang, and Mario Gomez. ProFusion:Intelligent fusion from multiple, distributed search engines. Journal of Universal Computer Science, 2(9), September 1996.Google Scholar
- David Hawking and Paul Thistlewaite. Methods for information server selection. ACM Transactions on Information Systems, 17(1):40--76, January 1999. Google ScholarDigital Library
- Allan Heydon and Marc Najork. Mercator: A scalable, extensible Web crawler. World Wide Web, 1(2):219--229, December 1999. Google ScholarDigital Library
- Allan Heydon and Marc Najork. Performance limitations of the Java Core libraries. In ACM 1999 Java Grande Conference, pages 35--41, June 1999. Google ScholarDigital Library
- Thorsten Joachims. A probabilistic analysis of the Rocchio algorithm with TFIDF for text classification. In 14th International Conference on Machine Learning, pages 143--151, 1997. Google ScholarDigital Library
- Thorsten Joachims. A statistical learning model of text classification for support vector machines. In 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 128--136, September 2001. Google ScholarDigital Library
- Jon M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604--632, 1999. Google ScholarDigital Library
- Cody C. T. Kwok, Oren Etzioni, and Daniel~S Weld. Scaling question answering to the Web. In Tenth International World Wide Web Conference, pages 150--161, May 2001. Google ScholarDigital Library
- D. Lewis. An evaluation of phrasal and clustered representations on a text categorization task. In 15th Annual International ACM SIGIR Conference on Research and Developement in Information Retrieval, pages 37--50, 1992. Google ScholarDigital Library
- Andrew McCallum, Kamal Nigam, Jason Rennie, and Kristie Seymore. Building domain-specific search engines with machine learning techniques. In AAAI-99 Spring Symposium on Intelligent Agents in Cyberspace, 1999.Google Scholar
- Filippo Menczer, Gautam Pant, Padmini Srinivasan, and Miguel~E. Ruiz. Evaluating topic-driven Web crawlers. In 24th Annual International ACM SIGIR Conference on Research and Developement in Information Retrieval, pages 241--249, September 2001. Google ScholarDigital Library
- Tom Mitchell. Machine Learning. McGraw Hill, 1997. Google ScholarDigital Library
- Sougata Mukherjea. WTMS: A system for collecting and analyzing topic-specific Web information. In Ninth International World Wide Web Conference, May 2000. Google ScholarDigital Library
- Kamal Nigam, Andrew~Kachites McCallum, Sebastian Thrun, and Tom Mitchell. Text classification from labeled and unlabeled documents using EM. Machine Learning, 39(2/3):103--134, 2000. Google ScholarDigital Library
- Robert E. Schapire, Yoram Singer, and Amit Singhal. Boosting and Rocchio applied to text filtering. In 21st Annual International ACM SIGIR Conference on Research and Developement in Information Retrieval, pages 215--223, August 1998. Google ScholarDigital Library
- Vaughan Shanks and Hugh~E. Williams. Fast categorisation of large document collections. In 8th International Symposium on String Processing and Information Retrieval, pages 194--204, November 2001.Google ScholarCross Ref
- Yiming Yang and Xin Liu. A re-examination of text categorization methods. In 22th Annual International ACM SIGIR Conference on Research and Developement in Information Retrieval, pages 42--49, August 1999. Google ScholarDigital Library
Index Terms
- Topic-oriented collaborative crawling
Recommendations
Current challenges in web crawling
ICWE'13: Proceedings of the 13th international conference on Web EngineeringWeb crawling, a process of collecting web pages in an automated manner, is the primary and ubiquitous operation used by a large number of web systems and agents starting from a simple program for website backup to a major web search engine. Due to an ...
Dist-RIA Crawler: A Distributed Crawler for Rich Internet Applications
3PGCIC '13: Proceedings of the 2013 Eighth International Conference on P2P, Parallel, Grid, Cloud and Internet ComputingCrawling web applications is important for indexing, accessibility and security assessment. Crawling traditional web applications is an old problem, as old as the web itself. Crawling Rich Internet Applications (RIA) quickly and efficiently, however, is ...
A statistical approach for efficient crawling of rich internet applications
ICWE'12: Proceedings of the 12th international conference on Web EngineeringModern web technologies, like AJAX result in more responsive and usable web applications, sometimes called Rich Internet Applications (RIAs). Traditional crawling techniques are not sufficient for crawling RIAs. We present a new strategy for crawling ...
Comments