2015 | OriginalPaper | Buchkapitel
Adaptive Clustering-Based Change Prediction for Refreshing Web Repository
verfasst von : Bundit Manaskasemsak, Petchpoom Pumjang, Arnon Rungsawang
Erschienen in: Computational Science and Its Applications -- ICCSA 2015
Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.
Wählen Sie Textabschnitte aus um mit Künstlicher Intelligenz passenden Patente zu finden. powered by
Markieren Sie Textabschnitte, um KI-gestützt weitere passende Inhalte zu finden. powered by
Resource constraints, such as time and network bandwidth, hinder modern search engine providers to keep local database completely synchronize with the Web. In this paper, we propose an adaptive clustering based change prediction approach to refresh the local web repository. Especially, we first group the existing web pages in the current repository into web clusters based on their similar change characteristics. We then sample and examine some pages in each cluster to estimate their change patterns. Selected cluster of web pages with higher change probability will be later downloaded to update the current repository. Finally, the effectiveness of the current download cycle will be examined; either auxiliary (non-downloaded), reward (correct change prediction), or penalty (wrong change prediction) score will be assigned to a web page. This score will later be used to reinforce the consecutive web clustering as well as the change prediction processes. To evaluate the performance of the proposed approach, we run extensive experiments on snapshots of real Web dataset of about 282,000 distinct URLs which are belonging to more than 12,500 websites. The results clearly show that the proposed approach outperforms the existing state-of-the-art on clustering-based web crawling policy in that it can provide fresher local web repository with limited resource.