ABSTRACT
On script-generated web sites, many documents share common HTML tree structure, allowing wrappers to effectively extract information of interest. Of course, the scripts and thus the tree structure evolve over time, causing wrappers to break repeatedly, and resulting in a high cost of maintaining wrappers. In this paper, we explore a novel approach: we use temporal snapshots of web pages to develop a tree-edit model of HTML, and use this model to improve wrapper construction. We view the changes to the tree structure as suppositions of a series of edit operations: deleting nodes, inserting nodes and substituting labels of nodes. The tree structures evolve by choosing these edit operations stochastically.
Our model is attractive in that the probability that a source tree has evolved into a target tree can be estimated efficiently--in quadratic time in the size of the trees--making it a potentially useful tool for a variety of tree-evolution problems. We give an algorithm to learn the probabilistic model from training examples consisting of pairs of trees, and apply this algorithm to collections of web-page snapshots to derive HTML-specific tree edit models. Finally, we describe a novel wrapper-construction framework that takes the tree-edit model into account, and compare the quality of resulting wrappers to that of traditional wrappers on synthetic and real HTML document examples.
- Mari Abe and Masahiro Hori. Robust pointing by xpath language: Authoring support and empirical evaluation. Applications and the Internet, IEEE/IPSJ International Symposium on, 0:156, 2003. Google ScholarDigital Library
- Tobias Anton. Xpath-wrapper induction by generating tree traversal patterns. In LWA, pages 126--133, 2005.Google Scholar
- Internet archive: http://www.archive.org/.Google Scholar
- Robert Baumgartner, Sergio Flesca, and Georg Gottlob. Visual web information extraction with lixto. In VLDB, pages 119--128, 2001. Google ScholarDigital Library
- Marc Bernard, Amaury Habrard, and Marc Sebban. Learning stochastic tree edit distance. In ECML, pages 42--53, 2006. Google ScholarDigital Library
- Philip Bille. A survey on tree edit distance and related problems. Theor. Comput. Sci., 337(1--3):217--239, 2005. Google ScholarDigital Library
- Laurent Boyer, Amaury Habrard, and Marc Sebban. Learning metrics between tree structured data: Application to image recognition. In ECML, pages 54--66, 2007. Google ScholarDigital Library
- Boris Chidlovskii, Bruno Roustant, and Marc Brette. Documentum eci self-repairing wrappers: performance analysis. In SIGMOD '06: Proceedings of the 2006 ACM SIGMOD international conference on Management of data, pages 708--717, New York, NY, USA, 2006. ACM. Google ScholarDigital Library
- William W. Cohen, Matthew Hurst, and Lee S. Jensen. A flexible learning system for wrapping tables and lists in html documents. In WWW '02: Proceedings of the 11th international conference on World Wide Web, pages 232--241, New York, NY, USA, 2002. ACM. Google ScholarDigital Library
- Valter Crescenzi, Giansalvatore Mecca, and Paolo Merialdo. Roadrunner: Towards automatic data extraction from large web sites. In VLDB, pages 109--118, 2001. Google ScholarDigital Library
- J. Graehl and K. Knight. Training tree transducers. In HLT-NAACL, pages 105--112, 2004. Google ScholarDigital Library
- Wei Han, David Buttler, and Calton Pu. Wrapping web data into XML. SIGMOD Record, 30(3):33--38, 2001. Google ScholarDigital Library
- Chun--Nan Hsu and Ming--Tzung Dung. Generating finite--state transducers for semi--structured data extraction from the web. Information Systems, 23(8):521---538, 1998. Google ScholarDigital Library
- Internet movie database: http://www.imdb.com/.Google Scholar
- Kevin Knight and JonathaGraehl. An overview of probabilistic tree transducers for natural language processing. In Proceedings of the CICLing, 2005. Google ScholarDigital Library
- Marek Kowalkiewicz, Tomasz Kaczmarek, and Witold Abramowicz. Myportal: robust extraction and aggregation of web content. In VLDB '06: Proceedings of the 32nd international conference on Very large data bases, pages 1219--1222. VLDB Endowment, 2006. Google ScholarDigital Library
- Marek Kowalkiewicz, Maria E. Orlowska, Tomasz Kaczmarek, and Witold Abramowicz. Robust web content extraction. In WWW '06: Proceedings of the 15th international conference on World Wide Web, pages 887---888, New York, NY, USA, 2006. ACM. Google ScholarDigital Library
- Nickolas Kushmerick, Daniel S. Weld, and Robert B. Doorenbos. Wrapper induction for information extraction. In IJCAI, pages 729--737, 1997.Google Scholar
- Kristina Lerman, Steven N. Minton, and Craig A. Knoblock. Wrapper maintenance: A machine learning approach. Journal of Artificial Intelligence Research, 18:2003, 2003. Google ScholarDigital Library
- A. McCallum, K. Bellare, and P. Pereira. A conditional random field for discriminatively-trained finite-state string edit distance. In UAI, 2005.Google ScholarCross Ref
- Robert McCann, Bedoor AlShebli, Quoc Le, Hoa Nguyen, Long Vu, and AnHai Doan. Mapping maintenance for data integration systems. In VLDB, pages 1018--1029, 2005. Google ScholarDigital Library
- I. Muslea, S. Minton, and C. Knoblock. Stalker: Learning extraction rules for semistructured, 1998.Google Scholar
- Jussi Myllymaki and Jared Jackson. Robust web data extraction with xml path expressions. Technical report, IBM Research Report RJ 10245, May 2002.Google Scholar
- Andrew Ng and Michael Jordan. On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes. In Advances in Neural Information Processing Systems 2001 (NIPS 14), 2002.Google Scholar
- Jorge Nocedal and Stephen J. Wright. Numerical Optimization. Springer, 1999.Google ScholarCross Ref
- Jose Oncina and Marc Sebban. Learning stochastic edit distance: Application in handwritten character recognition. Pattern Recogn., 39(9):1575--1587, 2006. Google ScholarDigital Library
- J. S. Provan and M. O. Ball. The complexity of counting cuts and of computing the probability that a graph is connected. SIAM J. Comput., 12(4):777--788, 1983.Google ScholarDigital Library
- Eric Sven Ristad and Peter N. Yianilos. Learning string edit distance. In ICML, pages 287--295, 1997. Google ScholarDigital Library
- Arnaud Sahuguet and Fabien Azavant. Building light-weight wrappers for legacy web data-sources using w4f. In VLDB, pages 738--741, 1999. Google ScholarDigital Library
- http://sourceforge.net/projects/jtidy.Google Scholar
- L. Valiant. The complexity of enumeration and reliability problems. SIAM J. Comput., 8:410--421, 1979.Google ScholarDigital Library
- K. Zhang and D. Shasha. Simple fast algorithms for the editing distance between trees and related problems. SIAM J. Comput., 18(6):1245--1262, 1989. Google ScholarDigital Library
Index Terms
- Robust web extraction: an approach based on a probabilistic tree-edit model
Recommendations
Efficient algorithms for descendant-only tree pattern queries
Tree pattern matching is a fundamental problem that has a wide range of applications in Web data management, XML processing, and selective data dissemination. In this paper we develop efficient algorithms for the tree homeomorphism problem, i.e., the ...
Intelligent crawling of web applications for web archiving
WWW '12 Companion: Proceedings of the 21st International Conference on World Wide WebThe steady growth of the World Wide Web raises challenges regarding the preservation of meaningful Web data. Tools used currently by Web archivists blindly crawl and store Web pages found while crawling, disregarding the kind of Web site currently ...
Web Data Extraction Based on Simple Tree Matching
ICIE '10: Proceedings of the 2010 WASE International Conference on Information Engineering - Volume 02The information on the Internet has been grown exponentially, the Internet users are overwhelmed by these information. How to automatically extract useful information from the relevant pages, so as to provide a convenient and rapid information query ...
Comments