skip to main content
10.1145/1559845.1559882acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Robust web extraction: an approach based on a probabilistic tree-edit model

Published:29 June 2009Publication History

ABSTRACT

On script-generated web sites, many documents share common HTML tree structure, allowing wrappers to effectively extract information of interest. Of course, the scripts and thus the tree structure evolve over time, causing wrappers to break repeatedly, and resulting in a high cost of maintaining wrappers. In this paper, we explore a novel approach: we use temporal snapshots of web pages to develop a tree-edit model of HTML, and use this model to improve wrapper construction. We view the changes to the tree structure as suppositions of a series of edit operations: deleting nodes, inserting nodes and substituting labels of nodes. The tree structures evolve by choosing these edit operations stochastically.

Our model is attractive in that the probability that a source tree has evolved into a target tree can be estimated efficiently--in quadratic time in the size of the trees--making it a potentially useful tool for a variety of tree-evolution problems. We give an algorithm to learn the probabilistic model from training examples consisting of pairs of trees, and apply this algorithm to collections of web-page snapshots to derive HTML-specific tree edit models. Finally, we describe a novel wrapper-construction framework that takes the tree-edit model into account, and compare the quality of resulting wrappers to that of traditional wrappers on synthetic and real HTML document examples.

References

  1. Mari Abe and Masahiro Hori. Robust pointing by xpath language: Authoring support and empirical evaluation. Applications and the Internet, IEEE/IPSJ International Symposium on, 0:156, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Tobias Anton. Xpath-wrapper induction by generating tree traversal patterns. In LWA, pages 126--133, 2005.Google ScholarGoogle Scholar
  3. Internet archive: http://www.archive.org/.Google ScholarGoogle Scholar
  4. Robert Baumgartner, Sergio Flesca, and Georg Gottlob. Visual web information extraction with lixto. In VLDB, pages 119--128, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Marc Bernard, Amaury Habrard, and Marc Sebban. Learning stochastic tree edit distance. In ECML, pages 42--53, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Philip Bille. A survey on tree edit distance and related problems. Theor. Comput. Sci., 337(1--3):217--239, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Laurent Boyer, Amaury Habrard, and Marc Sebban. Learning metrics between tree structured data: Application to image recognition. In ECML, pages 54--66, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Boris Chidlovskii, Bruno Roustant, and Marc Brette. Documentum eci self-repairing wrappers: performance analysis. In SIGMOD '06: Proceedings of the 2006 ACM SIGMOD international conference on Management of data, pages 708--717, New York, NY, USA, 2006. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. William W. Cohen, Matthew Hurst, and Lee S. Jensen. A flexible learning system for wrapping tables and lists in html documents. In WWW '02: Proceedings of the 11th international conference on World Wide Web, pages 232--241, New York, NY, USA, 2002. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Valter Crescenzi, Giansalvatore Mecca, and Paolo Merialdo. Roadrunner: Towards automatic data extraction from large web sites. In VLDB, pages 109--118, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. Graehl and K. Knight. Training tree transducers. In HLT-NAACL, pages 105--112, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Wei Han, David Buttler, and Calton Pu. Wrapping web data into XML. SIGMOD Record, 30(3):33--38, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Chun--Nan Hsu and Ming--Tzung Dung. Generating finite--state transducers for semi--structured data extraction from the web. Information Systems, 23(8):521---538, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Internet movie database: http://www.imdb.com/.Google ScholarGoogle Scholar
  15. Kevin Knight and JonathaGraehl. An overview of probabilistic tree transducers for natural language processing. In Proceedings of the CICLing, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Marek Kowalkiewicz, Tomasz Kaczmarek, and Witold Abramowicz. Myportal: robust extraction and aggregation of web content. In VLDB '06: Proceedings of the 32nd international conference on Very large data bases, pages 1219--1222. VLDB Endowment, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Marek Kowalkiewicz, Maria E. Orlowska, Tomasz Kaczmarek, and Witold Abramowicz. Robust web content extraction. In WWW '06: Proceedings of the 15th international conference on World Wide Web, pages 887---888, New York, NY, USA, 2006. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Nickolas Kushmerick, Daniel S. Weld, and Robert B. Doorenbos. Wrapper induction for information extraction. In IJCAI, pages 729--737, 1997.Google ScholarGoogle Scholar
  19. Kristina Lerman, Steven N. Minton, and Craig A. Knoblock. Wrapper maintenance: A machine learning approach. Journal of Artificial Intelligence Research, 18:2003, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. A. McCallum, K. Bellare, and P. Pereira. A conditional random field for discriminatively-trained finite-state string edit distance. In UAI, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  21. Robert McCann, Bedoor AlShebli, Quoc Le, Hoa Nguyen, Long Vu, and AnHai Doan. Mapping maintenance for data integration systems. In VLDB, pages 1018--1029, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. I. Muslea, S. Minton, and C. Knoblock. Stalker: Learning extraction rules for semistructured, 1998.Google ScholarGoogle Scholar
  23. Jussi Myllymaki and Jared Jackson. Robust web data extraction with xml path expressions. Technical report, IBM Research Report RJ 10245, May 2002.Google ScholarGoogle Scholar
  24. Andrew Ng and Michael Jordan. On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes. In Advances in Neural Information Processing Systems 2001 (NIPS 14), 2002.Google ScholarGoogle Scholar
  25. Jorge Nocedal and Stephen J. Wright. Numerical Optimization. Springer, 1999.Google ScholarGoogle ScholarCross RefCross Ref
  26. Jose Oncina and Marc Sebban. Learning stochastic edit distance: Application in handwritten character recognition. Pattern Recogn., 39(9):1575--1587, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. J. S. Provan and M. O. Ball. The complexity of counting cuts and of computing the probability that a graph is connected. SIAM J. Comput., 12(4):777--788, 1983.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Eric Sven Ristad and Peter N. Yianilos. Learning string edit distance. In ICML, pages 287--295, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Arnaud Sahuguet and Fabien Azavant. Building light-weight wrappers for legacy web data-sources using w4f. In VLDB, pages 738--741, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. http://sourceforge.net/projects/jtidy.Google ScholarGoogle Scholar
  31. L. Valiant. The complexity of enumeration and reliability problems. SIAM J. Comput., 8:410--421, 1979.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. K. Zhang and D. Shasha. Simple fast algorithms for the editing distance between trees and related problems. SIAM J. Comput., 18(6):1245--1262, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Robust web extraction: an approach based on a probabilistic tree-edit model

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGMOD '09: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
      June 2009
      1168 pages
      ISBN:9781605585512
      DOI:10.1145/1559845

      Copyright © 2009 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 29 June 2009

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate785of4,003submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader