2011 | OriginalPaper | Buchkapitel
Coherence-Oriented Crawling and Navigation Using Patterns for Web Archives
verfasst von : Myriam Ben Saad, Zeynep Pehlivan, Stéphane Gançarski
Erschienen in: Research and Advanced Technology for Digital Libraries
Verlag: Springer Berlin Heidelberg
Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.
Wählen Sie Textabschnitte aus um mit Künstlicher Intelligenz passenden Patente zu finden. powered by
Markieren Sie Textabschnitte, um KI-gestützt weitere passende Inhalte zu finden. powered by
We point out, in this paper, the issue of improving the coherence of web archives under limited resources (
e.g.
bandwidth, storage space, etc.). Coherence measures how much a collection of archived pages versions reflects the real state (or the snapshot) of a set of related web pages at different points in time. An ideal approach to preserve the coherence of archives is to prevent pages content from changing during the crawl of a complete collection. However, this is practically infeasible because web sites are autonomous and dynamic. We propose two solutions:
a priori
and
a posteriori
. As
a priori
solution, our idea is to crawl sites during the
off-peak
hours (
i.e.
the periods of time where very little changes is expected on the pages) based on patterns. A pattern models the behavior of the importance of pages changes during a period of time. As an
a posteriori
solution, based on the same patterns, we introduce a novel navigation approach that enables users to browse the most coherent page versions at a given query time.