Skip to main content

2004 | OriginalPaper | Buchkapitel

An Automated Algorithm for Extracting Website Skeleton

verfasst von : Zehua Liu, Wee Keong Ng, Ee-Peng Lim

Erschienen in: Database Systems for Advanced Applications

Verlag: Springer Berlin Heidelberg

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

The huge amount of information available on the Web has attracted many research efforts into developing wrappers that extract data from webpages. However, as most of the systems for generating wrappers focus on extracting data at page-level, data extraction at site-level remains a manual or semi-automatic process. In this paper, we study the problem of extracting website skeleton, i.e. extracting the underlying hyperlink structure that is used to organize the content pages in a given website. We propose an automated algorithm, called the Sew algorithm, to discover the skeleton of a website. Given a page, the algorithm examines hyperlinks in groups and identifies the navigation links that point to pages in the next level in the website structure. The entire skeleton is then constructed by recursively fetching pages pointed by the discovered links and analyzing these pages using the same process. Our experiments on real life websites show that the algorithm achieves a high recall with moderate precision.

Metadaten
Titel
An Automated Algorithm for Extracting Website Skeleton
verfasst von
Zehua Liu
Wee Keong Ng
Ee-Peng Lim
Copyright-Jahr
2004
Verlag
Springer Berlin Heidelberg
DOI
https://doi.org/10.1007/978-3-540-24571-1_70

Premium Partner