ABSTRACT
A program that makes an existing website look like a database is called a wrapper. Wrapper learning is the problem of learning website wrappers from examples. We present a wrapper-learning system called WL2 that can exploit several different representations of a document. Examples of such different representations include DOM-level and token-level representations, as well as two-dimensional geometric views of the rendered page (for tabular data) and representations of the visual appearance of text asm it will be rendered. Additionally, the learning system is modular, and can be easily adapted to new domains and tasks. The learning system described is part of an "industrial-strength" wrapper management system that is in active use at WhizBang Labs. Controlled experiments show that the learner has broader coverage and a faster learning rate than earlier wrapper-learning systems.
- H. Adé, L. de Raedt, and M. Bruynooghe. Declarative bias for general-to-specific ILP systems. Machine Learning, 20(1/2):119--154, 1995.]] Google ScholarDigital Library
- A. Blum. Empirical support for WINNOW and weighted majority algorithms: results on a calendar scheduling domain. In Machine Learning: Proceedings of the Twelfth International Conference, Lake Tahoe, California, 1995. Morgan Kaufmann.]]Google Scholar
- B. Chidlovskii. Wrapper generation by k-reversible grammar induction. In Proceedings of the Workshop on Machine Learning and Information Extraction, Berlin, Germany, 2000.]] Google ScholarDigital Library
- XML path language (XPath) version 1.0. Available from http://www.w3.org/TR/1999/REC-xpath-19991116, 1999.]]Google Scholar
- W. W. Cohen. Grammatically biased learning: learning logic programs using an explicit antecedent description language. Artificial Intelligence, 68:303--366, 1994.]] Google ScholarDigital Library
- W. W. Cohen. Recognizing structure in web pages using similarity queries. In Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI-99), Orlando, FL, 1999.]] Google ScholarDigital Library
- W. W. Cohen and W. Fan. Learning page-independent heuristics for extracting data from web pages. In Proceedings of the Eighth International World Wide Web Conference (WWW-99), Toronto, 1999.]] Google ScholarDigital Library
- L. De Raedt, editor. Advances in Inductive Logic Programming. IOS Press, 1995.]] Google ScholarDigital Library
- D. Freitag. Multistrategy learning for information extraction. In Proceedings of the Fifteenth International Conference on Machine Learning. Morgan Kaufmann, 1998.]] Google ScholarDigital Library
- D. Freitag and N. Kushmeric. Boosted wrapper induction. In Proceedings of the Seventeenth National Conference on Artificial Intelligence (AAAI-2000), Austin, TX, 2000.]] Google ScholarDigital Library
- C.-N. Hsu. Initial results on wrapping semistructured web pages with finite-state transducers and contextual rules. In Papers from the 1998 Workshop on AI and Information Integration, Madison, WI, 1998. AAAI Press.]]Google Scholar
- HTML 4.01 specification. http://www.w3.org/TR/html4/, 1999.]]Google Scholar
- M. Hurst. The Interpretation of Tables in Texts. PhD thesis, University of Edinburgh, School of Cognitive Science, Informatics, University of Edinburgh, 2000.]]Google Scholar
- L. S. Jensen and W. W. Cohen. A structured wrapper induction system for extracting information from semi-structured documents. In Proceedings of the IJCAI-2001 Workshop on Adaptive Text Extraction and Mining, Seattle, WA, 2001.]]Google Scholar
- N. Kushmeric. Regression testing for wrapper maintenance. In Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI-99), Orlando, FL, 1999.]] Google ScholarDigital Library
- N. Kushmeric. Wrapper induction: efficiency and expressiveness. Artificial Intelligence, 118:15--68, 2000.]] Google ScholarDigital Library
- D. Lewis. Naive (bayes) at forty: The independence assumption in information retrieval. In Proceedings of ECML-98, 10th European Conference on Machine Learning, 1998.]] Google ScholarDigital Library
- N. Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine Learning, 2(4), 1988.]] Google ScholarDigital Library
- A. McCallum and K. Nigam. A comparison of event models for naive bayes text classification. In AAAI-98 Workshop on Learning for Text Categorization, 1998.]]Google Scholar
- S. Muggleton and L. De Raedt. Inductive logic programming: Theory and methods. Journal of Logic Programming, 19/20(7):629--679, 1994.]]Google ScholarCross Ref
- I. Muslea, S. Minton, and C. Knoblock. A hierarchical approach to wrapper induction. In Proceedings of the Third International Conference on Autonomous Agents (Agents'99), Seattle, WA, 1999.]] Google ScholarDigital Library
- I. Muslea, S. Minton, and C. Knoblock. Wrapper induction for semistructured information sources. Journal of Autonomous Agents and Multi-Agent Systems, 16(12), 1999.]]Google Scholar
- K. Nigam, J. Lafferty, and A. McCallum. Using maximum entropy for text classification. In Proceedings of Machine Learning for Information Filtering Workshop, IJCAI '99, Stockholm, Sweden, 1999.]]Google Scholar
- J. R. Quinlan. Learning logical definitions from relations. Machine Learning, 5(3):239--266, 1990.]] Google ScholarDigital Library
- J. R. Quinlan. C4.5: programs for machine learning. Morgan Kaufmann, 1994.]] Google ScholarDigital Library
- J. R. Quinlan and R. M. Cameron-Jones. FOIL: A midterm report. In P. B. Brazdil, editor, Machine Learning: ECML-93, Vienna, Austria, 1993. Springer-Verlag. Lecture notes in Computer Science # 667.]] Google ScholarDigital Library
- A. Sahuget and F. Azavant. Building light-weight wrappers for legacy web datasources using W4F. In Proceedings of VLDB '99, pages pages 738--741, 1999.]] Google ScholarDigital Library
- Clean up your web pages with HTML TIDY. http://www.w3.org/People/Raggett/tidy/, 1999.]]Google Scholar
- X. Wang. Tabular Abstraction, Editing, and Formatting. PhD thesis, University of Waterloo, Waterloo, Ontario, Canada, 1996.]] Google ScholarDigital Library
- J. M. Zelle and R. J. Mooney. Inducing deterministic Prolog parsers from treebanks: a machine learning approach. In Proceedings of the Twelfth National Conference on Artificial Intelligence, Seattle, Washington, 1994. MIT Press.]] Google ScholarDigital Library
Index Terms
- A flexible learning system for wrapping tables and lists in HTML documents
Recommendations
Wrapping PDF documents exploiting uncertain knowledge
CAiSE'06: Proceedings of the 18th international conference on Advanced Information Systems EngineeringThe PDF format represents the de facto standard for print-oriented documents. In this paper we address the problem of wrapping PDF documents, which raises new challenges in the information extraction field. The proposal is based on a novel bottom-up ...
A Fuzzy Logic Approach to Wrapping PDF Documents
The PDF format represents the de facto standard for print-oriented documents. In this paper, we address the problem of wrapping PDF documents, which raises new challenges in several contexts of text data management. Our proposal is based on a novel ...
Multi-Agent Inverse Reinforcement Learning
ICMLA '10: Proceedings of the 2010 Ninth International Conference on Machine Learning and ApplicationsLearning the reward function of an agent by observing its behavior is termed inverse reinforcement learning and has applications in learning from demonstration or apprenticeship learning. We introduce the problem of multi-agent inverse reinforcement ...
Comments