ABSTRACT
Information extraction from tables in web pages is a challenging problem due to the diverse nature of table formats and the vocabulary variants in attribute names. This paper presents a new approach to automated table extraction that exploits formatting cues in semi-structured HTML tables, learns lexical variants from training examples and uses a vector space model to deal with non-exact matches among labels. We conducted experiments with this method on a set of tables collected from 157 university web sites, and obtained the information extraction performance of 91.4% in the Fl-measure, showing the effectiveness of the combined use of structural table parsing and example-based label learning.
- W. Cohen, M. Hurst and L. Jensen. A flexible learning system for wrapping tables and lists in HTML documents. In the Proc. of WWW2002, Honolulu, Hawaii, 2002. Google ScholarDigital Library
- D. Pinto, A McCallum, X. Wei and B. Croft. Table extraction using conditional random fields. In the Proceedings of SIGIR'03, Toronto, 2003. Google ScholarDigital Library
- M. Hurst. Language and Layout: Challenges for table understanding on the web. In web Document Analysis, In the Proc. of 1st International Workshop on Web Document Analysis, Seattle, 2001.Google Scholar
- M. Yoshida, K. Torisawa and J. Tsujii. A method to integrate tables of the world wide web. In the Proc. of 1st International Workshop on Web Document Analysis, 2001.Google Scholar
- H.-H. Chen, S.-C. Tsai, and J.-H. Tsai. Mining Tables from Large Scale HTML Texts. In the Proc. of 18th International Conference on Computational Linguistics, COLING, Saarbruecken, Germany, July 2000. Google ScholarDigital Library
- Yalin Wang and Jianying Hu. A Machine Learning Based Approach for Table Detection on the Web. In the Proceedings of WWW2002, Honolulu, Hawaii, 2002. Google ScholarDigital Library
- J. Hu, R. Kashi, D. Lopresti, and G. Wilfong. Medium-independent table detection. In SPIE Document Recognition and Retrieval VII, pages 291--302, San Jose, California, January 2000.Google Scholar
- M. Hurst and T. Nasukawa. Layout and language: Integrating spatial and linguistic knowledge for layout understanding tasks. In the Proc. of 18th International Conference on Computational Linguistics, COLING, Saarbruecken, Germany, July 2000. Google ScholarDigital Library
- M. Hurst and S. Douglas. Layout Language: Preliminary experiments in assigning logical structure to table cells. In the Proc. of 5th Applied Natural Language Processing Conference, Washington, D.C., 1997. Google ScholarDigital Library
- M. Hurst. The Interpretation of Tables in Texts. PhD thesis, University of Edinburgh, School of Cognitive Science, Informatics, 2000.Google Scholar
- P. Pyreddy and W. B. Croft. TINTIN: A System for Retrieval in Text Tables. In Proceedings of the Second ACM International Conference on Digital Libraries, 1997. Google ScholarDigital Library
- D. Pinto, W. Croft, M. Branstein, R. Coleman, M. King, W. Li, and X. Wei. Quasm: A system for question answering using semi-structured data. In Proceedings of the JCDL 2002. Google ScholarDigital Library
- Levenstein V. I. Binary codes capable of correcting deletions, insertions and reversals. Cybernetics and, Control Theory, 1966.Google Scholar
- The Common Data Set (http://www.commondataset.org)Google Scholar
Recommendations
Learning labeling functions in distantly supervised relation extraction
Distant supervision has become the leading method for training large-scale information extractors. It could be encoded in the form of labeling functions, which employ knowledge bases to provide labels for the data. However, most previous works use only ...
Coupled semi-supervised learning for information extraction
WSDM '10: Proceedings of the third ACM international conference on Web search and data miningWe consider the problem of semi-supervised learning to extract categories (e.g., academic fields, athletes) and relations (e.g., PlaysSport(athlete, sport)) from web pages, starting with a handful of labeled training examples of each category or ...
Comments