Semi-Automatic extraction of product attributes from URLs is an important issue for comparison-shopping agents. In this paper we examine a novel decision tree framework for extracting product attributes. The core induction algorithmic framework consists of three main stages. In the first stage, a large set of regular expression-based patterns are induced by employing a longest common subsequence algorithm. In the second stage we filter the initial set and leave only the most useful patterns. In the last stage we represent the extraction problem (in which the domain values are not known in advance) as a classification problem and employ an ensemble of decision trees. An empirical study performed on a real-world extraction tasks illustrates the capability of the proposed framework.
Swipe to navigate through the chapters of this book
Please log in to get access to this content
To get access to this content you need the following product:
- A Decision Tree Framework for Semi-Automatic Extraction of Product Attributes from the Web
- Springer Berlin Heidelberg
- Sequence number