research-article

Robust web extraction: an approach based on a probabilistic tree-edit model

Authors:
Nilesh Dalvi

Yahoo! Research, Santa Clara, CA, USA

Yahoo! Research, Santa Clara, CA, USA
View Profile

,
Philip Bohannon

Yahoo! Research, Santa Clara, CA, USA

Yahoo! Research, Santa Clara, CA, USA
View Profile

,
Fei Sha

University of Southern California, Los Angeles, CA, USA

University of Southern California, Los Angeles, CA, USA
View Profile

SIGMOD '09: Proceedings of the 2009 ACM SIGMOD International Conference on Management of dataJune 2009Pages 335–348https://doi.org/10.1145/1559845.1559882

Published:29 June 2009Publication History

SIGMOD '09: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data

Pages 335–348

ABSTRACT

On script-generated web sites, many documents share common HTML tree structure, allowing wrappers to effectively extract information of interest. Of course, the scripts and thus the tree structure evolve over time, causing wrappers to break repeatedly, and resulting in a high cost of maintaining wrappers. In this paper, we explore a novel approach: we use temporal snapshots of web pages to develop a tree-edit model of HTML, and use this model to improve wrapper construction. We view the changes to the tree structure as suppositions of a series of edit operations: deleting nodes, inserting nodes and substituting labels of nodes. The tree structures evolve by choosing these edit operations stochastically.

Our model is attractive in that the probability that a source tree has evolved into a target tree can be estimated efficiently--in quadratic time in the size of the trees--making it a potentially useful tool for a variety of tree-evolution problems. We give an algorithm to learn the probabilistic model from training examples consisting of pairs of trees, and apply this algorithm to collections of web-page snapshots to derive HTML-specific tree edit models. Finally, we describe a novel wrapper-construction framework that takes the tree-edit model into account, and compare the quality of resulting wrappers to that of traditional wrappers on synthetic and real HTML document examples.

References

Mari Abe and Masahiro Hori. Robust pointing by xpath language: Authoring support and empirical evaluation. Applications and the Internet, IEEE/IPSJ International Symposium on, 0:156, 2003. Google ScholarDigital Library
Tobias Anton. Xpath-wrapper induction by generating tree traversal patterns. In LWA, pages 126--133, 2005.Google Scholar
Internet archive: http://www.archive.org/.Google Scholar
Robert Baumgartner, Sergio Flesca, and Georg Gottlob. Visual web information extraction with lixto. In VLDB, pages 119--128, 2001. Google ScholarDigital Library
Marc Bernard, Amaury Habrard, and Marc Sebban. Learning stochastic tree edit distance. In ECML, pages 42--53, 2006. Google ScholarDigital Library
Philip Bille. A survey on tree edit distance and related problems. Theor. Comput. Sci., 337(1--3):217--239, 2005. Google ScholarDigital Library
Laurent Boyer, Amaury Habrard, and Marc Sebban. Learning metrics between tree structured data: Application to image recognition. In ECML, pages 54--66, 2007. Google ScholarDigital Library
Boris Chidlovskii, Bruno Roustant, and Marc Brette. Documentum eci self-repairing wrappers: performance analysis. In SIGMOD '06: Proceedings of the 2006 ACM SIGMOD international conference on Management of data, pages 708--717, New York, NY, USA, 2006. ACM. Google ScholarDigital Library
William W. Cohen, Matthew Hurst, and Lee S. Jensen. A flexible learning system for wrapping tables and lists in html documents. In WWW '02: Proceedings of the 11th international conference on World Wide Web, pages 232--241, New York, NY, USA, 2002. ACM. Google ScholarDigital Library
Valter Crescenzi, Giansalvatore Mecca, and Paolo Merialdo. Roadrunner: Towards automatic data extraction from large web sites. In VLDB, pages 109--118, 2001. Google ScholarDigital Library
J. Graehl and K. Knight. Training tree transducers. In HLT-NAACL, pages 105--112, 2004. Google ScholarDigital Library
Wei Han, David Buttler, and Calton Pu. Wrapping web data into XML. SIGMOD Record, 30(3):33--38, 2001. Google ScholarDigital Library
Chun--Nan Hsu and Ming--Tzung Dung. Generating finite--state transducers for semi--structured data extraction from the web. Information Systems, 23(8):521---538, 1998. Google ScholarDigital Library
Internet movie database: http://www.imdb.com/.Google Scholar
Kevin Knight and JonathaGraehl. An overview of probabilistic tree transducers for natural language processing. In Proceedings of the CICLing, 2005. Google ScholarDigital Library
Marek Kowalkiewicz, Tomasz Kaczmarek, and Witold Abramowicz. Myportal: robust extraction and aggregation of web content. In VLDB '06: Proceedings of the 32nd international conference on Very large data bases, pages 1219--1222. VLDB Endowment, 2006. Google ScholarDigital Library
Marek Kowalkiewicz, Maria E. Orlowska, Tomasz Kaczmarek, and Witold Abramowicz. Robust web content extraction. In WWW '06: Proceedings of the 15th international conference on World Wide Web, pages 887---888, New York, NY, USA, 2006. ACM. Google ScholarDigital Library
Nickolas Kushmerick, Daniel S. Weld, and Robert B. Doorenbos. Wrapper induction for information extraction. In IJCAI, pages 729--737, 1997.Google Scholar
Kristina Lerman, Steven N. Minton, and Craig A. Knoblock. Wrapper maintenance: A machine learning approach. Journal of Artificial Intelligence Research, 18:2003, 2003. Google ScholarDigital Library
A. McCallum, K. Bellare, and P. Pereira. A conditional random field for discriminatively-trained finite-state string edit distance. In UAI, 2005.Google ScholarCross Ref
Robert McCann, Bedoor AlShebli, Quoc Le, Hoa Nguyen, Long Vu, and AnHai Doan. Mapping maintenance for data integration systems. In VLDB, pages 1018--1029, 2005. Google ScholarDigital Library
I. Muslea, S. Minton, and C. Knoblock. Stalker: Learning extraction rules for semistructured, 1998.Google Scholar
Jussi Myllymaki and Jared Jackson. Robust web data extraction with xml path expressions. Technical report, IBM Research Report RJ 10245, May 2002.Google Scholar
Andrew Ng and Michael Jordan. On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes. In Advances in Neural Information Processing Systems 2001 (NIPS 14), 2002.Google Scholar
Jorge Nocedal and Stephen J. Wright. Numerical Optimization. Springer, 1999.Google ScholarCross Ref
Jose Oncina and Marc Sebban. Learning stochastic edit distance: Application in handwritten character recognition. Pattern Recogn., 39(9):1575--1587, 2006. Google ScholarDigital Library
J. S. Provan and M. O. Ball. The complexity of counting cuts and of computing the probability that a graph is connected. SIAM J. Comput., 12(4):777--788, 1983.Google ScholarDigital Library
Eric Sven Ristad and Peter N. Yianilos. Learning string edit distance. In ICML, pages 287--295, 1997. Google ScholarDigital Library
Arnaud Sahuguet and Fabien Azavant. Building light-weight wrappers for legacy web data-sources using w4f. In VLDB, pages 738--741, 1999. Google ScholarDigital Library
http://sourceforge.net/projects/jtidy.Google Scholar
L. Valiant. The complexity of enumeration and reliability problems. SIAM J. Comput., 8:410--421, 1979.Google ScholarDigital Library
K. Zhang and D. Shasha. Simple fast algorithms for the editing distance between trees and related problems. SIAM J. Comput., 18(6):1245--1262, 1989. Google ScholarDigital Library

Index Terms

Robust web extraction: an approach based on a probabilistic tree-edit model
1. Information systems
  1. Information retrieval

Recommendations

Efficient algorithms for descendant-only tree pattern queries

Tree pattern matching is a fundamental problem that has a wide range of applications in Web data management, XML processing, and selective data dissemination. In this paper we develop efficient algorithms for the tree homeomorphism problem, i.e., the ...
Read More
Intelligent crawling of web applications for web archiving
WWW '12 Companion: Proceedings of the 21st International Conference on World Wide Web

The steady growth of the World Wide Web raises challenges regarding the preservation of meaningful Web data. Tools used currently by Web archivists blindly crawl and store Web pages found while crawling, disregarding the kind of Web site currently ...
Read More
Web Data Extraction Based on Simple Tree Matching
ICIE '10: Proceedings of the 2010 WASE International Conference on Information Engineering - Volume 02

The information on the Internet has been grown exponentially, the Internet users are overwhelmed by these information. How to automatically extract useful information from the relevant pages, so as to provide a convenient and rapid information query ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '09: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
June 2009
1168 pages
ISBN:9781605585512
DOI:10.1145/1559845
Editors:
Carsten Binnig,
Benoit Dageville,
General Chairs:
Uğur Çetintemel
Brown University, USA
,
Stan Zdonik
Brown University, USA
,
Program Chair:
Donald Kossmann
ETH Zurich, Switzerland
Copyright © 2009 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 29 June 2009
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
probabilistic tree-edit model
wrappers
xpath
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate785of4,003submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 49
  Total Citations
  View Citations
- 1,062
  Total Downloads
- Downloads (Last 12 months)9
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Robust web extraction: an approach based on a probabilistic tree-edit model

SIGMOD '09: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Efficient algorithms for descendant-only tree pattern queries

Intelligent crawling of web applications for web archiving

Web Data Extraction Based on Simple Tree Matching

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Robust web extraction: an approach based on a probabilistic tree-edit model

SIGMOD '09: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Efficient algorithms for descendant-only tree pattern queries

Intelligent crawling of web applications for web archiving

Web Data Extraction Based on Simple Tree Matching

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media