2011 | OriginalPaper | Buchkapitel
An Indent Shape Based Approach for Web Lists Mining
verfasst von : Yanxu Zhu, Gang Yin, Huaimin Wang, Dianxi Shi, Xiang Li, Lin Yuan
Erschienen in: Web Information Systems and Mining
Verlag: Springer Berlin Heidelberg
Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.
Wählen Sie Textabschnitte aus um mit Künstlicher Intelligenz passenden Patente zu finden. powered by
Markieren Sie Textabschnitte, um KI-gestützt weitere passende Inhalte zu finden. powered by
Mining repeated patterns from HTML documents is a key step for typical applications of Web information extraction, which require efficient techniques of patterns mining to generate wrappers automatically. Existing approaches such as tree matching and string matching can detect repeated patterns with a high precision, but their efficiency is still a challenge. In this paper, we present a novel approach for Web lists mining based on the indent shape of HTML documents. Indent shape is a simplified abstraction of HTML documents in which tandem repeated waves indicate the potential repeated patterns to be detected. By identifying the tandem repeated waves efficiently with a horizontal line scanning along an indent shape, the repeated patterns in the documents can be recognized, from which the lists of the target Web page can be extracted. Extensive experiments show that our approach achieves better performance and efficiency compared with existing approaches.