Abstract
With the current explosion of information on the World Wide Web (WWW) a wealth of information on many different subjects has become available on-line. Numerous sources contain information that can be classified as semi-structured. At present, however, the only way to access the information is by browsing individual pages. We cannot query web documents in a database-like fashion based on their underlying structure. However, we can provide database-like querying for semi-structured WWW sources by building wrappers around these sources. We present an approach for semi-automatically generating such wrappers. The key idea is to exploit the formatting information in pages from the source to hypothesize the underlying structure of a page. From this structure the system generates a wrapper that facilitates querying of a source and possibly integrating it with other sources. We demonstrate the ease with which we are able to build wrappers for a number of internet sources in different domains using our implemented wrapper generation toolkit.
Index Terms
- Wrapper generation for semi-structured Internet sources
Recommendations
Semi-Automatic Wrapper Generation for Internet Information Sources
COOPIS '97: Proceedings of the Second IFCIS International Conference on Cooperative Information SystemsTo simplify the task of obtaining information from the vast number of information sources that are available on the World Wide Web (WWW), the authors are building information mediators for extracting and integrating data from multiple Web sources. In a ...
Wrapper Generation for Web Accessible Data Sources
COOPIS '98: Proceedings of the 3rd IFCIS International Conference on Cooperative Information SystemsThere is an increase in the number of data sources that can be queried across the WWW. Such sources typically support HTML forms-based interfaces and search engines query collections of suitably indexed data. The data is displayed via a browser. One ...
Automatically maintaining wrappers for semi-structured web sources
In order to let software programs gain full benefit from semi-structured web sources, wrapper programs must be built to provide a ''machine-readable'' view over them. Wrappers are able to accept a query against the source and return a set of structured ...
Comments