skip to main content
10.1145/1007568.1007584acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
Article

Using the structure of Web sites for automatic segmentation of tables

Published:13 June 2004Publication History

ABSTRACT

Many Web sites, especially those that dynamically generate HTML pages to display the results of a user's query, present information in the form of list or tables. Current tools that allow applications to programmatically extract this information rely heavily on user input, often in the form of labeled extracted records. The sheer size and rate of growth of the Web make any solution that relies primarily on user input is infeasible in the long term. Fortunately, many Web sites contain much explicit and implicit structure, both in layout and content, that we can exploit for the purpose of information extraction. This paper describes an approach to automatic extraction and segmentation of records from Web tables. Automatic methods do not require any user input, but rely solely on the layout and content of the Web source. Our approach relies on the common structure of many Web sites, which present information as a list or a table, with a link in each entry leading to a detail page containing additional information about that item. We describe two algorithms that use redundancies in the content of table and detail pages to aid in information extraction. The first algorithm encodes additional information provided by detail pages as constraints and finds the segmentation by solving a constraint satisfaction problem. The second algorithm uses probabilistic inference to find the record segmentation. We show how each approach can exploit the web site structure in a general, domain-independent manner, and we demonstrate the effectiveness of each algorithm on a set of twelve Web sites.

References

  1. A. Arasu and H. Garcia-Molina. Extracting structured data from web pages. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of data, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. L. Arlotta, V. Crescenzi, G. Mecca, and P. Marialdo. Automatic annotation of data extracted from large web sites. In Proceedings of the Sixth International Workshop on Web and Databases (WebDB03), 2003.Google ScholarGoogle Scholar
  3. V. Borkar, K. Deshmukh, and S. Sarawagi. Automatic segmentation of text into structured records full text. In Proceedings of the 2001 ACM SIGMOD international conference on Management of data, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. C. H. Chang, and S. C. Lui. IEPAD: Information Extraction based on Pattern Discovery. In 10th International World Wide Web Conference (WWW10), Hong Kong, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. H. Chen, S. Tsai, and J. Tsai. Mining tables from large scale html texts. In 18th International Conference on Computational Linguistics (COLING), 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. W. W. Cohen, M. Hurst, and L. S. Jensen. A Flexible Learning System for Wrapping Tables and Lists in HTML Documents. In 11th International World Wide Web Conference (WWW10), Honolulu, Hawaii, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. V. Crescenzi, G. Mecca, and P. Merialdo. Automatic web information extraction in the ROADRUNNER system. In Proceedings of the International Workshop on Data Semantics in Web Information Systems (DASWIS-2001), 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. V. Crescenzi, G. Mecca, and P. Merialdo. RoadRunner: Towards automatic data extraction from large web sites. In Proceedings of the 27th Conference on Very Large Databases (VLDB), Rome, Italy, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. C. Gazen. Thesis proposal, Carnegie Mellon University.Google ScholarGoogle Scholar
  10. Z. Ghahramani and M. I. Jordan. Factorial hidden Markov models. In D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo, editors, Proc. Conf. Advances in Neural Information Processing Systems, NIPS, volume 8, pages 472--478. MIT Press, 1995.Google ScholarGoogle Scholar
  11. M. Hurst. Layout and language: Challenges for table understanding on the web. In In Web Document Analysis, Proceedings of the 1st International Workshop on Web Document Analysis, 2001.Google ScholarGoogle Scholar
  12. M. Hurst and S. Douglas. Layout and language: Preliminary investigations in recognizing the structure of tables. In Proceedings of the International Conference on Document Analysis and Recognition (ICDAR), 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Y. Jiang. Record-Boundary Discovery In Web Documents. PhD thesis, BYU, Utah, 1998.Google ScholarGoogle Scholar
  14. N. Kushmerick and B. Thoma. Intelligent Information Agents R&D in Europe: An AgentLink perspective, chapter Adaptive information extraction: Core technologies for information agents. Springer, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proc. 18th International Conf. on Machine Learning, pages 282--289. Morgan Kaufmann, San Francisco, CA, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. K. Lerman and S. Minton. Learning the Common Structure of Data. In Proceedings of the 15th National Conference on Artificial Intelligence (AAAI-2000), Menlo Park, 2000. AAAI Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. K. Lerman, C. A. Knoblock, and S. Minton. Automatic data extraction from lists and tables in web sources. In Proceedings of the workshop on Advances in Text Extraction and Mining (IJCAI-2001), Menlo Park, 2001. AAAI Press.Google ScholarGoogle Scholar
  18. K. Lerman, S. Minton, and C. Knoblock. Wrapper maintenance: A machine learning approach. Journal of Artificial Intelligence Research, 18:149--181, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. K. Lerman, C. Gazen, S. Minton, and C. A. Knoblock,. Populating the Semantic Web. Submitted to the workshop on Advances in Text Extraction and Mining (ATEM-2004), 2004.Google ScholarGoogle Scholar
  20. K. Murphy. Dynamic bayesian networks: Representation, inference and learning. PhD Thesis, UC Berkeley, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. I. Muslea, S. Minton, and C. Knoblock. Active + semi-supervised learning = robust multi-view learning. In Proceedings of the 19th International Conference on Machine Learning (ICML 2002), pages 435--442. Morgan Kaufmann, San Francisco, CA, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. H. T. Ng, H. Y. Lim, and J. L. T. Koo. Learning to recognize tables in free text. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (ACL99), 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. D. Pinto, A. McCallum, X. Wei, and W. B. Croft. Table extraction using conditional random fields. In Proceedings of the ACM SIGIR), 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. P. Pyreddy and W. B. Croft. Tintin: A system for retrieval in text tables. In Proceedings of 2nd International Conference on Digital Libraries, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. L. R. Rabiner. A tutorial on hidden markov models and selected applications in speech recognition. In Readings in Speech Recognition. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. S. Raghavan and H. Garcia-Molina. Crawling the hidden web. In Proceedings of the Twenty-seventh International Conference on Very Large Databases, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. J. P. Walser. Wsat(oip) package.Google ScholarGoogle Scholar
  28. J. P. Walser. Integer Optimization by Local Search: A Domain Independent Approach, volume 1637 of LNCS. Springer, New York, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Y. Wang and J. Hu. Detecting tables in html documents. In Fifth IAPR International Workshop on Document Analysis Systems, Princeton, New Jersey, August 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Y. Wang and J. Hu. A machine learning based approach for table detection on the web. In In The Elevent International World Web Conference, Honolulu, Hawaii, USA, May 2002., 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Y. Wang, I. T. Phillips, and R. Haralick. Table detection via probability optimization. In Fifth IAPR International Workshop on Document Analysis Systems, Princeton, New Jersey, August 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. M. Yoshida, K. Torisawa, and J. Tsujii. A method to integrate tables of the world wide web. In in Proceedings of the International Workshop on Web Document Analysis (WDA 2001), Seattle, U.S., September 2001.Google ScholarGoogle Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Conferences
    SIGMOD '04: Proceedings of the 2004 ACM SIGMOD international conference on Management of data
    June 2004
    988 pages
    ISBN:1581138598
    DOI:10.1145/1007568

    Copyright © 2004 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 13 June 2004

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • Article

    Acceptance Rates

    Overall Acceptance Rate785of4,003submissions,20%

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader