2006 | OriginalPaper | Chapter
Adaptive Topical Web Crawling for Domain-Specific Resource Discovery Guided by Link-Context
Authors : Tao Peng, Fengling He, Wanli Zuo, Changli Zhang
Published in: MICAI 2006: Advances in Artificial Intelligence
Publisher: Springer Berlin Heidelberg
Activate our intelligent search to find suitable subject content or patents.
Select sections of text to find matching patents with Artificial Intelligence. powered by
Select sections of text to find additional relevant content using AI-assisted search. powered by
Topical web crawling technology is important for domain-specific resource discovery. Topical crawlers yield good recall as well as good precision by restricting themselves to a specific domain from web pages. There is an intuition that the text surrounding a link or the link-context on the HMTL page is a good summary of the target page. Motivated by that, This paper investigates some alternative methods and advocates that the link-context derived from reference page’s HTML tag tree can provide a wealth of illumination for steering crawler to stay on domain-specific topic. In order that crawler can acquire enough illumination from link-context, we initially look for some referring pages by traversing backward from seed URLs, and then build initial term-based feature set by parsing the link-contexts extracted from those reference web pages. Used to measure the similarity between the crawled pages’ link-context, the feature set can be adaptively trained by some link-contexts to relevant pages during crawling. This paper also presents some important metrics and an evaluation function for ranking URLs about pages relevance. A comprehensive experiment has been conducted, the result shows obviously that this approach outperforms Best-First and Breath-First algorithm both in harvest rate and efficiency.