Topical web crawling technology is important for domain-specific resource discovery. Topical crawlers yield good recall as well as good precision by restricting themselves to a specific domain from web pages. There is an intuition that the text surrounding a link or the link-context on the HMTL page is a good summary of the target page. Motivated by that, This paper investigates some alternative methods and advocates that the link-context derived from reference page’s HTML tag tree can provide a wealth of illumination for steering crawler to stay on domain-specific topic. In order that crawler can acquire enough illumination from link-context, we initially look for some referring pages by traversing backward from seed URLs, and then build initial term-based feature set by parsing the link-contexts extracted from those reference web pages. Used to measure the similarity between the crawled pages’ link-context, the feature set can be adaptively trained by some link-contexts to relevant pages during crawling. This paper also presents some important metrics and an evaluation function for ranking URLs about pages relevance. A comprehensive experiment has been conducted, the result shows obviously that this approach outperforms Best-First and Breath-First algorithm both in harvest rate and efficiency.
Weitere Kapitel dieses Buchs durch Wischen aufrufen
Bitte loggen Sie sich ein, um Zugang zu diesem Inhalt zu erhalten
Sie möchten Zugang zu diesem Inhalt erhalten? Dann informieren Sie sich jetzt über unsere Produkte:
- Adaptive Topical Web Crawling for Domain-Specific Resource Discovery Guided by Link-Context
- Springer Berlin Heidelberg
Neuer Inhalt/© ITandMEDIA