Recently it was shown that Inductive Logic Programming can be successfully applied to data extraction from HTML. However, the approach suffers from two problems: high computational complexity with respect to the number of nodes of the target document and to the arity of the extracted tuples. In this note we address the first problem by proposing an efficient path generalization algorithm for learning rules to extract single information items. The presentation is supplemented with a description of a sample experiment.
Swipe to navigate through the chapters of this book
Please log in to get access to this content
To get access to this content you need the following product:
- A New Path Generalization Algorithm for HTML Wrapper Induction
- Springer Berlin Heidelberg
- Sequence number