skip to main content
10.3115/990820.990845dlproceedingsArticle/Chapter ViewAbstractPublication PagescolingConference Proceedingsconference-collections
Article
Free Access

Mining tables from large scale HTML texts

Published:31 July 2000Publication History

ABSTRACT

Table is a very common presentation scheme, but few papers touch on table extraction in text data mining. This paper focuses on mining tables from large-scale HTML texts. Table filtering, recognition, interpretation, and presentation are discussed. Heuristic rules and cell similarities are employed to identify tables. The F-measure of table recognition is 86.50%. We also propose an algorithm to capture attribute-value relationships among table cells. Finally, more structured data is extracted and presented.

References

  1. Appelt, D. and Israel, D. (1997) "Tutorial Notes on Building Information Extraction Systems," Tutorial on Fifth Conference on Applied Natural Language Processing, 1997.Google ScholarGoogle Scholar
  2. Chen, H. H.; Ding Y. W.; and Tsai, S. C. (1998) "Named Entity Extraction for Information Retrieval," Computer Processing of Oriental Languages, Special Issue on Information Retrieval on Oriental Languages, Vol. 12, No. 1, 1998, pp. 75--85.Google ScholarGoogle Scholar
  3. Douglas, S.; Hurst, M. and Quinn, D. (1995) "Using Natural Language Processing for Identifying and Interpreting Tables in Plain Text," Proceedings of Fourth Annual Symposium on Document Analysis and Information Retrieval, 1995, pp. 535--545.Google ScholarGoogle Scholar
  4. Douglas, S. and Hurst, M. (1996) "Layout and Language: Lists and Tables in Technical Documents," Proceedings of ACL SIGPARSE Workshop on Punctuation in Computational Linguistics, 1996, pp. 19--24.Google ScholarGoogle Scholar
  5. Gaizauskas, R. and Wilks, Y. (1998) "Information Extraction: Beyond Document Retrieval," Computational Linguistics and Chinese Language Processing, Vol. 3, No. 2, 1998, pp. 17--59.Google ScholarGoogle Scholar
  6. Green, E. and Krishnamoorthy, M. (1995) "Recognition of Tables Using Grammars," Proceedings of the Fourth Annual Symposium on Document Analysis and Information Retrieval, 1995, pp. 261--278.Google ScholarGoogle Scholar
  7. Hurst, M. and Douglas, S. (1997) "Layout and Language: Preliminary Experiments in Assigning Logical Structure to Table Cells," Proceedings of the Fifth Conference on Applied Natural Language Processing, 1997, pp. 217--220. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Hurst, M. (1999a) "Layout and Language: Beyond Simple Text for Information Interaction - Modeling the Table," Proceedings of the 2nd International Conference on Multimodal Interfaces, Hong Kong, January 1999.Google ScholarGoogle Scholar
  9. Hurst, M. (1999b) "Layout and Language: A Corpus of Documents Containing Tables," Proceedings of AAAI Fall Symposium: Using Layout for the Generation, Understanding and Retrieval of Documents, 1999.Google ScholarGoogle Scholar
  10. Mikheev, A. and Finch, S. (1995) "A Workbench for Acquisition of Ontological Knowledge from Natural Text," Proceedings of the 7th Conference of the European Chapter for Computational Linguistics, 1995, pp. 194--201. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. MUC (1998) Proceedings of 7th Message Understanding Conference, http://www.muc.saic.com/proceedings/proceedings_index.html.Google ScholarGoogle Scholar
  12. Ng, H. T.; Lim, C. Y. and Koo, J. L. T. (1999) "Learning to Recognize Tables in Free Text," Proceedings of the 37th Annual Meeting of ACL, 1999, pp. 443--450. Google ScholarGoogle ScholarDigital LibraryDigital Library
  1. Mining tables from large scale HTML texts

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image DL Hosted proceedings
      COLING '00: Proceedings of the 18th conference on Computational linguistics - Volume 1
      July 2000
      616 pages
      ISBN:155860717X

      Publisher

      Association for Computational Linguistics

      United States

      Publication History

      • Published: 31 July 2000

      Qualifiers

      • Article

      Acceptance Rates

      Overall Acceptance Rate1,537of1,537submissions,100%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader