skip to main content
10.3115/1220355.1220497dlproceedingsArticle/Chapter ViewAbstractPublication PagescolingConference Proceedingsconference-collections
Article
Free Access

Learning table extraction from examples

Published:23 August 2004Publication History

ABSTRACT

Information extraction from tables in web pages is a challenging problem due to the diverse nature of table formats and the vocabulary variants in attribute names. This paper presents a new approach to automated table extraction that exploits formatting cues in semi-structured HTML tables, learns lexical variants from training examples and uses a vector space model to deal with non-exact matches among labels. We conducted experiments with this method on a set of tables collected from 157 university web sites, and obtained the information extraction performance of 91.4% in the Fl-measure, showing the effectiveness of the combined use of structural table parsing and example-based label learning.

References

  1. W. Cohen, M. Hurst and L. Jensen. A flexible learning system for wrapping tables and lists in HTML documents. In the Proc. of WWW2002, Honolulu, Hawaii, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. D. Pinto, A McCallum, X. Wei and B. Croft. Table extraction using conditional random fields. In the Proceedings of SIGIR'03, Toronto, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. M. Hurst. Language and Layout: Challenges for table understanding on the web. In web Document Analysis, In the Proc. of 1st International Workshop on Web Document Analysis, Seattle, 2001.Google ScholarGoogle Scholar
  4. M. Yoshida, K. Torisawa and J. Tsujii. A method to integrate tables of the world wide web. In the Proc. of 1st International Workshop on Web Document Analysis, 2001.Google ScholarGoogle Scholar
  5. H.-H. Chen, S.-C. Tsai, and J.-H. Tsai. Mining Tables from Large Scale HTML Texts. In the Proc. of 18th International Conference on Computational Linguistics, COLING, Saarbruecken, Germany, July 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Yalin Wang and Jianying Hu. A Machine Learning Based Approach for Table Detection on the Web. In the Proceedings of WWW2002, Honolulu, Hawaii, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. J. Hu, R. Kashi, D. Lopresti, and G. Wilfong. Medium-independent table detection. In SPIE Document Recognition and Retrieval VII, pages 291--302, San Jose, California, January 2000.Google ScholarGoogle Scholar
  8. M. Hurst and T. Nasukawa. Layout and language: Integrating spatial and linguistic knowledge for layout understanding tasks. In the Proc. of 18th International Conference on Computational Linguistics, COLING, Saarbruecken, Germany, July 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. M. Hurst and S. Douglas. Layout Language: Preliminary experiments in assigning logical structure to table cells. In the Proc. of 5th Applied Natural Language Processing Conference, Washington, D.C., 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. M. Hurst. The Interpretation of Tables in Texts. PhD thesis, University of Edinburgh, School of Cognitive Science, Informatics, 2000.Google ScholarGoogle Scholar
  11. P. Pyreddy and W. B. Croft. TINTIN: A System for Retrieval in Text Tables. In Proceedings of the Second ACM International Conference on Digital Libraries, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. D. Pinto, W. Croft, M. Branstein, R. Coleman, M. King, W. Li, and X. Wei. Quasm: A system for question answering using semi-structured data. In Proceedings of the JCDL 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Levenstein V. I. Binary codes capable of correcting deletions, insertions and reversals. Cybernetics and, Control Theory, 1966.Google ScholarGoogle Scholar
  14. The Common Data Set (http://www.commondataset.org)Google ScholarGoogle Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image DL Hosted proceedings
    COLING '04: Proceedings of the 20th international conference on Computational Linguistics
    August 2004
    1411 pages

    Publisher

    Association for Computational Linguistics

    United States

    Publication History

    • Published: 23 August 2004

    Qualifiers

    • Article

    Acceptance Rates

    COLING '04 Paper Acceptance Rate1,411of1,411submissions,100%Overall Acceptance Rate1,537of1,537submissions,100%

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader