skip to main content
10.1145/511446.511477acmconferencesArticle/Chapter ViewAbstractPublication PageswwwConference Proceedingsconference-collections
Article

A flexible learning system for wrapping tables and lists in HTML documents

Published:07 May 2002Publication History

ABSTRACT

A program that makes an existing website look like a database is called a wrapper. Wrapper learning is the problem of learning website wrappers from examples. We present a wrapper-learning system called WL2 that can exploit several different representations of a document. Examples of such different representations include DOM-level and token-level representations, as well as two-dimensional geometric views of the rendered page (for tabular data) and representations of the visual appearance of text asm it will be rendered. Additionally, the learning system is modular, and can be easily adapted to new domains and tasks. The learning system described is part of an "industrial-strength" wrapper management system that is in active use at WhizBang Labs. Controlled experiments show that the learner has broader coverage and a faster learning rate than earlier wrapper-learning systems.

References

  1. H. Adé, L. de Raedt, and M. Bruynooghe. Declarative bias for general-to-specific ILP systems. Machine Learning, 20(1/2):119--154, 1995.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. Blum. Empirical support for WINNOW and weighted majority algorithms: results on a calendar scheduling domain. In Machine Learning: Proceedings of the Twelfth International Conference, Lake Tahoe, California, 1995. Morgan Kaufmann.]]Google ScholarGoogle Scholar
  3. B. Chidlovskii. Wrapper generation by k-reversible grammar induction. In Proceedings of the Workshop on Machine Learning and Information Extraction, Berlin, Germany, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. XML path language (XPath) version 1.0. Available from http://www.w3.org/TR/1999/REC-xpath-19991116, 1999.]]Google ScholarGoogle Scholar
  5. W. W. Cohen. Grammatically biased learning: learning logic programs using an explicit antecedent description language. Artificial Intelligence, 68:303--366, 1994.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. W. W. Cohen. Recognizing structure in web pages using similarity queries. In Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI-99), Orlando, FL, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. W. W. Cohen and W. Fan. Learning page-independent heuristics for extracting data from web pages. In Proceedings of the Eighth International World Wide Web Conference (WWW-99), Toronto, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. L. De Raedt, editor. Advances in Inductive Logic Programming. IOS Press, 1995.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. D. Freitag. Multistrategy learning for information extraction. In Proceedings of the Fifteenth International Conference on Machine Learning. Morgan Kaufmann, 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. D. Freitag and N. Kushmeric. Boosted wrapper induction. In Proceedings of the Seventeenth National Conference on Artificial Intelligence (AAAI-2000), Austin, TX, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. C.-N. Hsu. Initial results on wrapping semistructured web pages with finite-state transducers and contextual rules. In Papers from the 1998 Workshop on AI and Information Integration, Madison, WI, 1998. AAAI Press.]]Google ScholarGoogle Scholar
  12. HTML 4.01 specification. http://www.w3.org/TR/html4/, 1999.]]Google ScholarGoogle Scholar
  13. M. Hurst. The Interpretation of Tables in Texts. PhD thesis, University of Edinburgh, School of Cognitive Science, Informatics, University of Edinburgh, 2000.]]Google ScholarGoogle Scholar
  14. L. S. Jensen and W. W. Cohen. A structured wrapper induction system for extracting information from semi-structured documents. In Proceedings of the IJCAI-2001 Workshop on Adaptive Text Extraction and Mining, Seattle, WA, 2001.]]Google ScholarGoogle Scholar
  15. N. Kushmeric. Regression testing for wrapper maintenance. In Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI-99), Orlando, FL, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. N. Kushmeric. Wrapper induction: efficiency and expressiveness. Artificial Intelligence, 118:15--68, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. D. Lewis. Naive (bayes) at forty: The independence assumption in information retrieval. In Proceedings of ECML-98, 10th European Conference on Machine Learning, 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. N. Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine Learning, 2(4), 1988.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. A. McCallum and K. Nigam. A comparison of event models for naive bayes text classification. In AAAI-98 Workshop on Learning for Text Categorization, 1998.]]Google ScholarGoogle Scholar
  20. S. Muggleton and L. De Raedt. Inductive logic programming: Theory and methods. Journal of Logic Programming, 19/20(7):629--679, 1994.]]Google ScholarGoogle ScholarCross RefCross Ref
  21. I. Muslea, S. Minton, and C. Knoblock. A hierarchical approach to wrapper induction. In Proceedings of the Third International Conference on Autonomous Agents (Agents'99), Seattle, WA, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. I. Muslea, S. Minton, and C. Knoblock. Wrapper induction for semistructured information sources. Journal of Autonomous Agents and Multi-Agent Systems, 16(12), 1999.]]Google ScholarGoogle Scholar
  23. K. Nigam, J. Lafferty, and A. McCallum. Using maximum entropy for text classification. In Proceedings of Machine Learning for Information Filtering Workshop, IJCAI '99, Stockholm, Sweden, 1999.]]Google ScholarGoogle Scholar
  24. J. R. Quinlan. Learning logical definitions from relations. Machine Learning, 5(3):239--266, 1990.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. J. R. Quinlan. C4.5: programs for machine learning. Morgan Kaufmann, 1994.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. J. R. Quinlan and R. M. Cameron-Jones. FOIL: A midterm report. In P. B. Brazdil, editor, Machine Learning: ECML-93, Vienna, Austria, 1993. Springer-Verlag. Lecture notes in Computer Science # 667.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. A. Sahuget and F. Azavant. Building light-weight wrappers for legacy web datasources using W4F. In Proceedings of VLDB '99, pages pages 738--741, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Clean up your web pages with HTML TIDY. http://www.w3.org/People/Raggett/tidy/, 1999.]]Google ScholarGoogle Scholar
  29. X. Wang. Tabular Abstraction, Editing, and Formatting. PhD thesis, University of Waterloo, Waterloo, Ontario, Canada, 1996.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. J. M. Zelle and R. J. Mooney. Inducing deterministic Prolog parsers from treebanks: a machine learning approach. In Proceedings of the Twelfth National Conference on Artificial Intelligence, Seattle, Washington, 1994. MIT Press.]] Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A flexible learning system for wrapping tables and lists in HTML documents

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          WWW '02: Proceedings of the 11th international conference on World Wide Web
          May 2002
          754 pages
          ISBN:1581134495
          DOI:10.1145/511446

          Copyright © 2002 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 7 May 2002

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • Article

          Acceptance Rates

          Overall Acceptance Rate1,899of8,196submissions,23%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader