Skip to main content
Log in

Hierarchical Wrapper Induction for Semistructured Information Sources

  • Published:
Autonomous Agents and Multi-Agent Systems Aims and scope Submit manuscript

Abstract

With the tremendous amount of information that becomes available on the Web on a daily basis, the ability to quickly develop information agents has become a crucial problem. A vital component of any Web-based information agent is a set of wrappers that can extract the relevant data from semistructured information sources. Our novel approach to wrapper induction is based on the idea of hierarchical information extraction, which turns the hard problem of extracting data from an arbitrarily complex document into a series of simpler extraction tasks. We introduce an inductive algorithm, STALKER, that generates high accuracy extraction rules based on user-labeled training examples. Labeling the training data represents the major bottleneck in using wrapper induction techniques, and our experimental results show that STALKER requires up to two orders of magnitude fewer examples than other algorithms. Furthermore, STALKER can wrap information sources that could not be wrapped by existing inductive techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. N. Ashish and C. Knoblock, "Semi-automatic wrapper generation for Internet information sources," in Proc. Cooperative Inform. Syst. 1997, pp. 160-169.

  2. P. Atzeni and G. Mecca, "Cut and paste," in Proc. 16th ACM SIGMOD Symp. Principles of Database Syst. 1997, pp. 144-153.

  3. P. Atzeni G. Mecca, and P. Merialdo, "Semi-structured and structured data in the Web: going back and forth," in Proc ACM SIGMOD workshop on Management of Semi-structured Data, 1997, pp. 1-9.

  4. M. Califf and R. Mooney, "Relational learning of pattern-match rules for information extraction," in Proc. Sixteenth Natl. Conf. Artif. Intell. (AAAI-99), 1999, pp. 328-334.

  5. S. Chawathe, H. Garcia-Molina, J. Hammer, K. Ireland, Y. Papakonstantinou, J. Ullman, and J. Widom, "The TSIMMIS project: integration of heterogeneous information sources," in Proc. 10th Meeting of the Informat. Processing Soc. Jpn., 1994, pp. 7-18.

  6. B. Chidlovskii, U. Borghoff, and P. Chevalie, "Towards sophisticated wrapping of Web-based information repositories," in Proc. 5th Int. RIAO Conf., 1997, pp. 123-35.

  7. W. Cohen "A Web-based information system that reasons with structured collections of text," in Proc. Second Int. Conf. Autonomous Agents (AA-98), 1998, pp. 400-407. 114 muslea, minton and knoblock

  8. D. Freitag, "Information extraction from HTML: application of a general learning approach," in Proc. 15th Conf. Artif. Intell. (AAAI-98), 1998, pp. 517-523.

  9. C. Hsu and M. Dung, "Generating nite-state transducers for semi-structured data extraction from the Web," J. Infom. Syst. vol. 23, no. 8, pp. 521-538, 1998.

    Google Scholar 

  10. T. Kirk, A. Levy, Y. Sagiv, and D. Srivastava, "The information manifold," in Proc. AAAI Spring Symp.: Inf. Gathering from Heterogeneous Distributed Environments, 1995, pp. 85-91.

  11. C. Knoblock, S. Minton, J. Ambite, N. Ashish, J. Margulis, J. Modi, I. Muslea, A. Philpot, and S. Tejada, "Modeling web sources for information integration," in Proc. 15th Natl. Conf. Artif. Intell. (AAAI-98), 1998, pp. 211-218.

  12. N. Kushmerick, "Wrapper induction for information extraction," Ph.D. thesis, Department of Computer Science, University of Washington, TR UW-CSE-97-11-04, 1997.

  13. T. Raychaudhuri and L. Hamey, "Active learning-approaches and issues," J. Intell. Syst. vol. 7, pp. 205-243, 1997.

    Google Scholar 

  14. R. L. Rivest, "Learning decision lists," Mach. Learn. vol. 2, no. 3, pp. 229-246, 1987.

    Google Scholar 

  15. S. Soderland, "Learning information extraction rules for semi-structured and free text," Mach. Learn. vol. 34, no. 1/2/3, pp. 233-272, 1999.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Muslea, I., Minton, S. & Knoblock, C.A. Hierarchical Wrapper Induction for Semistructured Information Sources. Autonomous Agents and Multi-Agent Systems 4, 93–114 (2001). https://doi.org/10.1023/A:1010022931168

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1010022931168

Navigation