skip to main content
10.1145/2682571.2797069acmconferencesArticle/Chapter ViewAbstractPublication PagesdocengConference Proceedingsconference-collections
research-article

TEXUS: A Task-based Approach for Table Extraction and Understanding

Published:08 September 2015Publication History

ABSTRACT

In this paper, we propose a precise, comprehensive model of table processing which aims to remedy some of the problems in the discussion of table processing in the literature. The model targets application-independent, end-to-end table processing, and thus encompasses a large subset of the work in the area. The model can be used to aid the design of table processing systems (We provide an example of such a system), can be considered as a reference framework for evaluating the performance of table processing systems, and can assist in clarifying terminological differences in the table processing literature.

References

  1. Chen, H.-H., Tsai, S.-C., and Tsai, J.-H. Mining tables from large scale html texts. In Proceedings of the 18th conference on Computational linguistics-Volume 1 (2000), Association for Computational Linguistics, pp. 166--172. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. E. Silva, A. C. Parts that add up to a whole: a framework for the analysis of tables. PhD thesis, The University of Edinburgh, 2010.Google ScholarGoogle Scholar
  3. E. Silva, A. C. Metrics for evaluating performance in document analysis: application to tables. International Journal on Document Analysis and Recognition (IJDAR) 14, 1 (2011), 101--109. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. E Silva, A. C., Jorge, A., and Torgo, L. Automatic selection of table areas in documents for information extraction. In Progress in Artificial Intelligence. Springer, 2003, pp. 460--465.Google ScholarGoogle Scholar
  5. E Silva, A. C., Jorge, A., and Torgo, L. Design of an end-to-end method to extract information from tables. International Journal of Document Analysis and Recognition 8, 2--3 (2006), 144--171.Google ScholarGoogle ScholarCross RefCross Ref
  6. Embley, D. W., Lopresti, D., and Nagy, G. Notes on contemporary table recognition. In Document Analysis Systems VII. Springer, 2006, pp. 164--175. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Fang, J., Gao, L., Bai, K., Qiu, R., Tao, X., and Tang, Z. A table detection method for multipage pdf documents via visual seperators and tabular structures. In Document Analysis and Recognition (ICDAR) (2011), IEEE, pp. 779--783. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Fang, J., Tao, X., Tang, Z., Qiu, R., and Liu, Y. Dataset, ground-truth and performance metrics for table detection evaluation. In Document Analysis Systems (DAS), 2012 10th IAPR International Workshop on (2012), IEEE, pp. 445--449. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Gobel, M., Hassan, T., Oro, E., and Orsi, G. ICDAR 2013 table competition. In 12th International Conference on Document Analysis and Recognition (ICDAR'13) (2013), IEEE, pp. 1449--1453. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Hu, J., Kashi, R., Lopresti, D., Nagy, G., and Wilfong, G. Why table ground-truthing is hard. In Document Analysis and Recognition, 2001. Proceedings. Sixth International Conference on (2001), IEEE, pp. 129--133. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Hu, J., and Liu, Y. Analysis of documents born digital. Handbook of Document Image Processing and Recognition (2014), 775--804.Google ScholarGoogle Scholar
  12. Hurst, M. The interpretation of tables in texts. PhD thesis, The University of Edinburgh, 2000.Google ScholarGoogle Scholar
  13. Hurst, M. Layout and language: Exploring text block discovery in tables using linguistic resources. In International Conference on Document Analysis and Recognition (2001), pp. 523--527. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Jha, P., and Nagy, G. Wang notation tool: Layout independent representation of tables. In Pattern Recognition, 2008. ICPR 2008. 19th International Conference on (2008), IEEE, pp. 1--4.Google ScholarGoogle ScholarCross RefCross Ref
  15. Jin, D. An algebraic approach to building category parse trees for web tables. 2012 NCUR (2013).Google ScholarGoogle Scholar
  16. Kieninger, T., and Dengel, A. Applying the t-recs table recognition system to the business letter domain. In Document Analysis and Recognition, 2001. Proceedings. Sixth International Conference on (2001), IEEE, pp. 518--522. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Lee, M.-H., Kim, Y.-S., and Lee, K.-H. Logical structure analysis: From html to xml. Computer Standards & Interfaces 29, 1 (2007), 109--124. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Liu, Y., Bai, K., Mitra, P., and Giles, C. L. Improving the table boundary detection in pdfs by fixing the sequence error of the sparse lines. In 10th International Conference on Document Analysis and Recognition (ICDAR'09) (2009), IEEE, pp. 1006--1010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Liu, Y., Mitra, P., and Giles, C. L. Identifying table boundaries in digital documents via sparse line detection. In Proceedings of the 17th ACM Conference on Information and Knowledge Management (2008), ACM, pp. 1311--1320. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Long, V. An Agent-Based Approach to Table Recognition and Interpretation. PhD thesis, Macquarie University Sydney, Australia, 2010.Google ScholarGoogle Scholar
  21. Long, V., Cassidy, S., and Dale, R. A multi-level table evaluation method for plain text documents. In Extended Abstracts of the 7th International Association for Pattern Recognition Workshop on Document Analysis Systems (DAS 2006) (2006), pp. 21--24.Google ScholarGoogle Scholar
  22. Lopresti, D., and Nagy, G. A tabular survey of automated table processing. In Graphics Recognition Recent Advances. Springer, 2000, pp. 93--120. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Nagy, G., Padmanabhan, R., Jandhyala, R., Silversmith, W., and Krishnamoorthy, M. Table metadata: Headers, augmentations and aggregates. In Ninth IAPR International Workshop on Document Analysis Systems (2010).Google ScholarGoogle Scholar
  24. Nagy, G., and Tamhankar, M. Vericlick: an efficient tool for table format verification. In IS&T/SPIE Electronic Imaging (2012), International Society for Optics and Photonics, pp. 1--9.Google ScholarGoogle Scholar
  25. Oro, E., and Ruffolo, M. Xonto: An ontology-based system for semantic information extraction from pdf documents. In 20th IEEE International Conference on Tools with Artificial Intelligence (ICTAI'08) (2008), vol. 1, IEEE, pp. 118--125. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Padmanabhan, R. K., Jandhyala, R. C., Krishnamoorthy, M., Nagy, G., Seth, S., and Silversmith, W. Interactive conversion of web tables. In Graphics Recognition. Achievements, Challenges, and Evolution. Springer, 2010, pp. 25--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Seth, S., Jandhyala, R., Krishnamoorthy, M., and Nagy, G. Analysis and taxonomy of column header categories for web tables. In Proceedings of the 9th IAPR International Workshop on Document Analysis Systems (2010), ACM, pp. 81--88. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Shahab, A., Shafait, F., Kieninger, T., and Dengel, A. An open approach towards the benchmarking of table structure recognition systems. In Proceedings of the 9th IAPR International Workshop on Document Analysis Systems (2010), ACM, pp. 113--120. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Tao, C., and Embley, D. W. Automatic hidden-web table interpretation, conceptualization, and semantic annotation. Data & Knowledge Engineering 68, 7 (2009), 683--703. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Wang, X. Tabular abstraction, editing, and formatting. PhD thesis, University of Waterloo, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Wang, Y., and Hu, J. Detecting tables in html documents. In Document Analysis Systems V. Springer, 2002, pp. 249--260. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Yang, Y. Web table mining and database discovery. PhD thesis, Simon Fraser University, 2002.Google ScholarGoogle Scholar
  33. Zanibbi, R., Blostein, D., and Cordy, J. R. A survey of table recognition. Document Analysis and Recognition 7, 1 (2004), 1--16. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. TEXUS: A Task-based Approach for Table Extraction and Understanding

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      DocEng '15: Proceedings of the 2015 ACM Symposium on Document Engineering
      September 2015
      248 pages
      ISBN:9781450333078
      DOI:10.1145/2682571

      Copyright © 2015 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 8 September 2015

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      DocEng '15 Paper Acceptance Rate11of31submissions,35%Overall Acceptance Rate178of537submissions,33%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader