Article

A flexible learning system for wrapping tables and lists in HTML documents

Authors:
William W. Cohen

WhizBang Labs, Pittsburgh, PA

WhizBang Labs, Pittsburgh, PA
View Profile

,
Matthew Hurst

WhizBang Labs, Pittsburgh, PA

WhizBang Labs, Pittsburgh, PA
View Profile

,
Lee S. Jensen

WhizBang Labs, Pittsburgh, PA

WhizBang Labs, Pittsburgh, PA
View Profile

WWW '02: Proceedings of the 11th international conference on World Wide WebMay 2002Pages 232–241https://doi.org/10.1145/511446.511477

Published:07 May 2002Publication History

WWW '02: Proceedings of the 11th international conference on World Wide Web

Pages 232–241

ABSTRACT

A program that makes an existing website look like a database is called a wrapper. Wrapper learning is the problem of learning website wrappers from examples. We present a wrapper-learning system called WL² that can exploit several different representations of a document. Examples of such different representations include DOM-level and token-level representations, as well as two-dimensional geometric views of the rendered page (for tabular data) and representations of the visual appearance of text asm it will be rendered. Additionally, the learning system is modular, and can be easily adapted to new domains and tasks. The learning system described is part of an "industrial-strength" wrapper management system that is in active use at WhizBang Labs. Controlled experiments show that the learner has broader coverage and a faster learning rate than earlier wrapper-learning systems.

References

H. Adé, L. de Raedt, and M. Bruynooghe. Declarative bias for general-to-specific ILP systems. Machine Learning, 20(1/2):119--154, 1995.]] Google ScholarDigital Library
A. Blum. Empirical support for WINNOW and weighted majority algorithms: results on a calendar scheduling domain. In Machine Learning: Proceedings of the Twelfth International Conference, Lake Tahoe, California, 1995. Morgan Kaufmann.]]Google Scholar
B. Chidlovskii. Wrapper generation by k-reversible grammar induction. In Proceedings of the Workshop on Machine Learning and Information Extraction, Berlin, Germany, 2000.]] Google ScholarDigital Library
XML path language (XPath) version 1.0. Available from http://www.w3.org/TR/1999/REC-xpath-19991116, 1999.]]Google Scholar
W. W. Cohen. Grammatically biased learning: learning logic programs using an explicit antecedent description language. Artificial Intelligence, 68:303--366, 1994.]] Google ScholarDigital Library
W. W. Cohen. Recognizing structure in web pages using similarity queries. In Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI-99), Orlando, FL, 1999.]] Google ScholarDigital Library
W. W. Cohen and W. Fan. Learning page-independent heuristics for extracting data from web pages. In Proceedings of the Eighth International World Wide Web Conference (WWW-99), Toronto, 1999.]] Google ScholarDigital Library
L. De Raedt, editor. Advances in Inductive Logic Programming. IOS Press, 1995.]] Google ScholarDigital Library
D. Freitag. Multistrategy learning for information extraction. In Proceedings of the Fifteenth International Conference on Machine Learning. Morgan Kaufmann, 1998.]] Google ScholarDigital Library
D. Freitag and N. Kushmeric. Boosted wrapper induction. In Proceedings of the Seventeenth National Conference on Artificial Intelligence (AAAI-2000), Austin, TX, 2000.]] Google ScholarDigital Library
C.-N. Hsu. Initial results on wrapping semistructured web pages with finite-state transducers and contextual rules. In Papers from the 1998 Workshop on AI and Information Integration, Madison, WI, 1998. AAAI Press.]]Google Scholar
HTML 4.01 specification. http://www.w3.org/TR/html4/, 1999.]]Google Scholar
M. Hurst. The Interpretation of Tables in Texts. PhD thesis, University of Edinburgh, School of Cognitive Science, Informatics, University of Edinburgh, 2000.]]Google Scholar
L. S. Jensen and W. W. Cohen. A structured wrapper induction system for extracting information from semi-structured documents. In Proceedings of the IJCAI-2001 Workshop on Adaptive Text Extraction and Mining, Seattle, WA, 2001.]]Google Scholar
N. Kushmeric. Regression testing for wrapper maintenance. In Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI-99), Orlando, FL, 1999.]] Google ScholarDigital Library
N. Kushmeric. Wrapper induction: efficiency and expressiveness. Artificial Intelligence, 118:15--68, 2000.]] Google ScholarDigital Library
D. Lewis. Naive (bayes) at forty: The independence assumption in information retrieval. In Proceedings of ECML-98, 10th European Conference on Machine Learning, 1998.]] Google ScholarDigital Library
N. Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine Learning, 2(4), 1988.]] Google ScholarDigital Library
A. McCallum and K. Nigam. A comparison of event models for naive bayes text classification. In AAAI-98 Workshop on Learning for Text Categorization, 1998.]]Google Scholar
S. Muggleton and L. De Raedt. Inductive logic programming: Theory and methods. Journal of Logic Programming, 19/20(7):629--679, 1994.]]Google ScholarCross Ref
I. Muslea, S. Minton, and C. Knoblock. A hierarchical approach to wrapper induction. In Proceedings of the Third International Conference on Autonomous Agents (Agents'99), Seattle, WA, 1999.]] Google ScholarDigital Library
I. Muslea, S. Minton, and C. Knoblock. Wrapper induction for semistructured information sources. Journal of Autonomous Agents and Multi-Agent Systems, 16(12), 1999.]]Google Scholar
K. Nigam, J. Lafferty, and A. McCallum. Using maximum entropy for text classification. In Proceedings of Machine Learning for Information Filtering Workshop, IJCAI '99, Stockholm, Sweden, 1999.]]Google Scholar
J. R. Quinlan. Learning logical definitions from relations. Machine Learning, 5(3):239--266, 1990.]] Google ScholarDigital Library
J. R. Quinlan. C4.5: programs for machine learning. Morgan Kaufmann, 1994.]] Google ScholarDigital Library
J. R. Quinlan and R. M. Cameron-Jones. FOIL: A midterm report. In P. B. Brazdil, editor, Machine Learning: ECML-93, Vienna, Austria, 1993. Springer-Verlag. Lecture notes in Computer Science # 667.]] Google ScholarDigital Library
A. Sahuget and F. Azavant. Building light-weight wrappers for legacy web datasources using W4F. In Proceedings of VLDB '99, pages pages 738--741, 1999.]] Google ScholarDigital Library
Clean up your web pages with HTML TIDY. http://www.w3.org/People/Raggett/tidy/, 1999.]]Google Scholar
X. Wang. Tabular Abstraction, Editing, and Formatting. PhD thesis, University of Waterloo, Waterloo, Ontario, Canada, 1996.]] Google ScholarDigital Library
J. M. Zelle and R. J. Mooney. Inducing deterministic Prolog parsers from treebanks: a machine learning approach. In Proceedings of the Twelfth National Conference on Artificial Intelligence, Seattle, Washington, 1994. MIT Press.]] Google ScholarDigital Library

Index Terms

A flexible learning system for wrapping tables and lists in HTML documents
1. Computing methodologies
  1. Machine learning
2. Information systems
  1. Data management systems
    1. Database management system engines
  2. Information systems applications
    1. Data mining

Recommendations

Wrapping PDF documents exploiting uncertain knowledge
CAiSE'06: Proceedings of the 18th international conference on Advanced Information Systems Engineering

The PDF format represents the de facto standard for print-oriented documents. In this paper we address the problem of wrapping PDF documents, which raises new challenges in the information extraction field. The proposal is based on a novel bottom-up ...
Read More
A Fuzzy Logic Approach to Wrapping PDF Documents

The PDF format represents the de facto standard for print-oriented documents. In this paper, we address the problem of wrapping PDF documents, which raises new challenges in several contexts of text data management. Our proposal is based on a novel ...
Read More
Multi-Agent Inverse Reinforcement Learning
ICMLA '10: Proceedings of the 2010 Ninth International Conference on Machine Learning and Applications

Learning the reward function of an agent by observing its behavior is termed inverse reinforcement learning and has applications in learning from demonstration or apprenticeship learning. We introduce the problem of multi-agent inverse reinforcement ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WWW '02: Proceedings of the 11th international conference on World Wide Web
May 2002
754 pages
ISBN:1581134495
DOI:10.1145/511446
Conference Chairs:
David Lassner
University of Hawaii
,
Dave De Roure
University of Southampton
,
Arun Iyengar
IBM T.J. Watson Research Center
Copyright © 2002 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 7 May 2002
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
canopy
learning
record linkage
reference matching
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate1,899of8,196submissions,23%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 116
  Total Citations
  View Citations
- 1,345
  Total Downloads
- Downloads (Last 12 months)13
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A flexible learning system for wrapping tables and lists in HTML documents

WWW '02: Proceedings of the 11th international conference on World Wide Web

ABSTRACT

References

Cited By

Index Terms

Recommendations

Wrapping PDF documents exploiting uncertain knowledge

A Fuzzy Logic Approach to Wrapping PDF Documents

Multi-Agent Inverse Reinforcement Learning

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

A flexible learning system for wrapping tables and lists in HTML documents

WWW '02: Proceedings of the 11th international conference on World Wide Web

ABSTRACT

References

Cited By

Index Terms

Recommendations

Wrapping PDF documents exploiting uncertain knowledge

A Fuzzy Logic Approach to Wrapping PDF Documents

Multi-Agent Inverse Reinforcement Learning

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media