Article

Topic-oriented collaborative crawling

Authors:
Chiasen Chung

University of Waterloo, Canada

University of Waterloo, Canada
View Profile

,
Charles L. A. Clarke

University of Waterloo, Canada

University of Waterloo, Canada
View Profile

CIKM '02: Proceedings of the eleventh international conference on Information and knowledge managementNovember 2002Pages 34–42https://doi.org/10.1145/584792.584802

Published:04 November 2002Publication History

CIKM '02: Proceedings of the eleventh international conference on Information and knowledge management

Pages 34–42

ABSTRACT

A major concern in the implementation of a distributed Web crawler is the choice of a strategy for partitioning the Web among the nodes in the system. Our goal in selecting this strategy is to minimize the overlap between the activities of individual nodes. We propose a topic-oriented approach, in which the Web is partitioned into general subject areas with a crawler assigned to each. We examine design alternatives for a topic-oriented distributed crawler, including the creation of a Web page classifier for use in this context. The approach is compared experimentally with a hash-based partitioning, in which crawler assignments are determined by hash functions computed over URLs and page contents. The experimental evaluation demonstrates the feasibility of the approach, addressing issues of communication overhead, duplicate content detection, and page quality assessment.

References

Charu C Aggarwal, Fatima Al-Garawi, and Philip S. Yu. Intelligent crawling on the World Wide Web with arbitrary predicates. In Tenth International World Wide Web Conference, pages 96--105, May 2001. Google ScholarDigital Library
Brian Amento, Loren Terveen, and Will Hill. Does "authority" mean quality: Predicting expert quality ratings of Web documents. In 23rd Annual International ACM SIGIR Conference on Research and Developement in Information Retrieval, pages 296--303, 2000. Google ScholarDigital Library
Dario Benedetto, Emanuele Caglioti, and Vittorio Loreto. Language trees and zipping. Physical Review Letters, 88(4), January 2002.Google ScholarCross Ref
Sergey Brin and Lawrence Page. The anatomy of a large-scale hypertextual Web search engine. In Seventh International World Wide Web Conference, pages 107--117, April 1998. Google ScholarDigital Library
Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, and Geoffrey Zweig. Syntactic clustering of the Web. In Sixth International World Wide Web Conference, 1997. Google ScholarDigital Library
Mike Burner. Crawling towards Eternity. Web Techniques, 2(5), May 1997.Google Scholar
W. Cavnar and J. Trenkle. N-gram based text categorization. In 3rd Annual Symposium on Document Analysis and Information Retrieval, pages 161--175, 1994.Google Scholar
Soumen Chakrabarti, Byron Dom, and Piotr Indyk. Enhanced hypertext categorization using hyperlinks. In ACM SIGMOD International Conference on Management of Data, pages 307--318, 1998. Google ScholarDigital Library
Soumen Chakrabarti, Martin van~den Burg, and Byron Dom. Focused crawling: A new approach to topic-specific Web resource discovery. In Eighth International World Wide Web Conference, pages 545--562, May 1999. Google ScholarDigital Library
Junghoo Cho. Crawling the Web: Discovery and Maintenance of Large-Scale Web Data. PhD thesis, Stanford University, 2001. Google ScholarDigital Library
Junghoo Cho and Hector Garcia-Molina. Parallel crawling. In Eleventh International World Wide Web Conference, May 2002.Google ScholarDigital Library
Junghoo Cho, Hector Garcia-Molina, and Lawrence Page. Efficient crawling through URL ordering. In Seventh International World Wide Web Conference, April 1998. Google ScholarDigital Library
Junghoo Cho, Narayanan Shivakumar, and Hector Garcia-Molina. Finding replicated Web collections. In ACM SIGMOD International Conference on Management of Data, pages 355--366, 2000. Google ScholarDigital Library
Charles L. A. Clarke, Gordon~V. Cormack, and Thomas~R. Lynam. Exploiting redundancy in question answering. In 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 358--365, September 2001. Google ScholarDigital Library
Jim Cowie, Yevgeny Ludovik, and Ron Zacharski. An autonomous, Web-based, multilingual corpus collection tool. In International Conference on Natural Language Processing and Industrial Applications, 1998.Google Scholar
Mark Craven, Dan DiPasquo, Dayne Freitag, Andrew McCallum, Tom Mitchell, Kamal Nigan, and Sean Slattery. Learning to construct knowledge bases from the World Wide Web. Artificial Intelligence, 118(1--2):69--113, 2000. Google ScholarDigital Library
Brian D. Davison. Topical locality on the Web. In 23rd Annual International ACM SIGIR Conference on Research and Developement in Information Retrieval, pages 272--279, 2000. Google ScholarDigital Library
M. Diligenti, F. M. Coetzee, S. Lawrence, C. L. Giles, and M. Gori. Focused crawling using context graphs. In 26th International Conference on Very Large Databases, pages 527--534, September 2000. Google ScholarDigital Library
Susan T. Dumais and Hao Chen. Hierarchical classification of Web content. In 23rd Annual International ACM SIGIR Conference on Research and Developement in Information Retrieval, pages 256--263, 2000. Google ScholarDigital Library
Jenny Edwards, Kevin McCurley, and John Tomlin. An adaptive model of optimizing performance of an incremental Web crawler. In Tenth International World Wide Web Conference, pages 106--113, May 2001. Google ScholarDigital Library
Susan Gauch, Guijun Wang, and Mario Gomez. ProFusion:Intelligent fusion from multiple, distributed search engines. Journal of Universal Computer Science, 2(9), September 1996.Google Scholar
David Hawking and Paul Thistlewaite. Methods for information server selection. ACM Transactions on Information Systems, 17(1):40--76, January 1999. Google ScholarDigital Library
Allan Heydon and Marc Najork. Mercator: A scalable, extensible Web crawler. World Wide Web, 1(2):219--229, December 1999. Google ScholarDigital Library
Allan Heydon and Marc Najork. Performance limitations of the Java Core libraries. In ACM 1999 Java Grande Conference, pages 35--41, June 1999. Google ScholarDigital Library
Thorsten Joachims. A probabilistic analysis of the Rocchio algorithm with TFIDF for text classification. In 14th International Conference on Machine Learning, pages 143--151, 1997. Google ScholarDigital Library
Thorsten Joachims. A statistical learning model of text classification for support vector machines. In 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 128--136, September 2001. Google ScholarDigital Library
Jon M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604--632, 1999. Google ScholarDigital Library
Cody C. T. Kwok, Oren Etzioni, and Daniel~S Weld. Scaling question answering to the Web. In Tenth International World Wide Web Conference, pages 150--161, May 2001. Google ScholarDigital Library
D. Lewis. An evaluation of phrasal and clustered representations on a text categorization task. In 15th Annual International ACM SIGIR Conference on Research and Developement in Information Retrieval, pages 37--50, 1992. Google ScholarDigital Library
Andrew McCallum, Kamal Nigam, Jason Rennie, and Kristie Seymore. Building domain-specific search engines with machine learning techniques. In AAAI-99 Spring Symposium on Intelligent Agents in Cyberspace, 1999.Google Scholar
Filippo Menczer, Gautam Pant, Padmini Srinivasan, and Miguel~E. Ruiz. Evaluating topic-driven Web crawlers. In 24th Annual International ACM SIGIR Conference on Research and Developement in Information Retrieval, pages 241--249, September 2001. Google ScholarDigital Library
Tom Mitchell. Machine Learning. McGraw Hill, 1997. Google ScholarDigital Library
Sougata Mukherjea. WTMS: A system for collecting and analyzing topic-specific Web information. In Ninth International World Wide Web Conference, May 2000. Google ScholarDigital Library
Kamal Nigam, Andrew~Kachites McCallum, Sebastian Thrun, and Tom Mitchell. Text classification from labeled and unlabeled documents using EM. Machine Learning, 39(2/3):103--134, 2000. Google ScholarDigital Library
Robert E. Schapire, Yoram Singer, and Amit Singhal. Boosting and Rocchio applied to text filtering. In 21st Annual International ACM SIGIR Conference on Research and Developement in Information Retrieval, pages 215--223, August 1998. Google ScholarDigital Library
Vaughan Shanks and Hugh~E. Williams. Fast categorisation of large document collections. In 8th International Symposium on String Processing and Information Retrieval, pages 194--204, November 2001.Google ScholarCross Ref
Yiming Yang and Xin Liu. A re-examination of text categorization methods. In 22th Annual International ACM SIGIR Conference on Research and Developement in Information Retrieval, pages 42--49, August 1999. Google ScholarDigital Library

Index Terms

Topic-oriented collaborative crawling
1. Information systems
  1. Information retrieval
  2. Information storage systems

Recommendations

Current challenges in web crawling
ICWE'13: Proceedings of the 13th international conference on Web Engineering

Web crawling, a process of collecting web pages in an automated manner, is the primary and ubiquitous operation used by a large number of web systems and agents starting from a simple program for website backup to a major web search engine. Due to an ...
Read More
Dist-RIA Crawler: A Distributed Crawler for Rich Internet Applications
3PGCIC '13: Proceedings of the 2013 Eighth International Conference on P2P, Parallel, Grid, Cloud and Internet Computing

Crawling web applications is important for indexing, accessibility and security assessment. Crawling traditional web applications is an old problem, as old as the web itself. Crawling Rich Internet Applications (RIA) quickly and efficiently, however, is ...
Read More
A statistical approach for efficient crawling of rich internet applications
ICWE'12: Proceedings of the 12th international conference on Web Engineering

Modern web technologies, like AJAX result in more responsive and usable web applications, sometimes called Rich Internet Applications (RIAs). Traditional crawling techniques are not sufficient for crawling RIAs. We present a new strategy for crawling ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CIKM '02: Proceedings of the eleventh international conference on Information and knowledge management
November 2002
704 pages
ISBN:1581134924
DOI:10.1145/584792
General Chair:
Charles Nicholas
University of Maryland Baltimore County
,
Program Chairs:
David Grossman
Illinois Institute of Technology
,
Konstantinos Kalpakis
University of Maryland Baltimore County
,
Sajda Qureshi
Erasmus University, Rotterdam
,
Han van Dissel
Erasmus University, Rotterdam
,
Len Seligman
The MITRE Corporation
Copyright © 2002 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 4 November 2002
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
distributed systems
text categorization
web crawling
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate1,861of8,427submissions,22%
Upcoming Conference
CIKM '24

Sponsor:

sigir

sigir

The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

Boise , ID , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 9
  Total Citations
  View Citations
- 1,004
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Topic-oriented collaborative crawling

CIKM '02: Proceedings of the eleventh international conference on Information and knowledge management

ABSTRACT

References

Cited By

Index Terms

Recommendations

Current challenges in web crawling

Dist-RIA Crawler: A Distributed Crawler for Rich Internet Applications

A statistical approach for efficient crawling of rich internet applications