research-article

TEXUS: A Task-based Approach for Table Extraction and Understanding

Authors:
Roya Rastan

University of New South Wales, Sydney, Australia

University of New South Wales, Sydney, Australia
View Profile

,
Hye-Young Paik

University of New South Wales, Sydney, Australia

University of New South Wales, Sydney, Australia
View Profile

,
John Shepherd

University of New South Wales, Sydney, Australia

University of New South Wales, Sydney, Australia
View Profile

DocEng '15: Proceedings of the 2015 ACM Symposium on Document EngineeringSeptember 2015Pages 25–34https://doi.org/10.1145/2682571.2797069

Published:08 September 2015Publication History

DocEng '15: Proceedings of the 2015 ACM Symposium on Document Engineering

Pages 25–34

ABSTRACT

In this paper, we propose a precise, comprehensive model of table processing which aims to remedy some of the problems in the discussion of table processing in the literature. The model targets application-independent, end-to-end table processing, and thus encompasses a large subset of the work in the area. The model can be used to aid the design of table processing systems (We provide an example of such a system), can be considered as a reference framework for evaluating the performance of table processing systems, and can assist in clarifying terminological differences in the table processing literature.

References

Chen, H.-H., Tsai, S.-C., and Tsai, J.-H. Mining tables from large scale html texts. In Proceedings of the 18th conference on Computational linguistics-Volume 1 (2000), Association for Computational Linguistics, pp. 166--172. Google ScholarDigital Library
E. Silva, A. C. Parts that add up to a whole: a framework for the analysis of tables. PhD thesis, The University of Edinburgh, 2010.Google Scholar
E. Silva, A. C. Metrics for evaluating performance in document analysis: application to tables. International Journal on Document Analysis and Recognition (IJDAR) 14, 1 (2011), 101--109. Google ScholarDigital Library
E Silva, A. C., Jorge, A., and Torgo, L. Automatic selection of table areas in documents for information extraction. In Progress in Artificial Intelligence. Springer, 2003, pp. 460--465.Google Scholar
E Silva, A. C., Jorge, A., and Torgo, L. Design of an end-to-end method to extract information from tables. International Journal of Document Analysis and Recognition 8, 2--3 (2006), 144--171.Google ScholarCross Ref
Embley, D. W., Lopresti, D., and Nagy, G. Notes on contemporary table recognition. In Document Analysis Systems VII. Springer, 2006, pp. 164--175. Google ScholarDigital Library
Fang, J., Gao, L., Bai, K., Qiu, R., Tao, X., and Tang, Z. A table detection method for multipage pdf documents via visual seperators and tabular structures. In Document Analysis and Recognition (ICDAR) (2011), IEEE, pp. 779--783. Google ScholarDigital Library
Fang, J., Tao, X., Tang, Z., Qiu, R., and Liu, Y. Dataset, ground-truth and performance metrics for table detection evaluation. In Document Analysis Systems (DAS), 2012 10th IAPR International Workshop on (2012), IEEE, pp. 445--449. Google ScholarDigital Library
Gobel, M., Hassan, T., Oro, E., and Orsi, G. ICDAR 2013 table competition. In 12th International Conference on Document Analysis and Recognition (ICDAR'13) (2013), IEEE, pp. 1449--1453. Google ScholarDigital Library
Hu, J., Kashi, R., Lopresti, D., Nagy, G., and Wilfong, G. Why table ground-truthing is hard. In Document Analysis and Recognition, 2001. Proceedings. Sixth International Conference on (2001), IEEE, pp. 129--133. Google ScholarDigital Library
Hu, J., and Liu, Y. Analysis of documents born digital. Handbook of Document Image Processing and Recognition (2014), 775--804.Google Scholar
Hurst, M. The interpretation of tables in texts. PhD thesis, The University of Edinburgh, 2000.Google Scholar
Hurst, M. Layout and language: Exploring text block discovery in tables using linguistic resources. In International Conference on Document Analysis and Recognition (2001), pp. 523--527. Google ScholarDigital Library
Jha, P., and Nagy, G. Wang notation tool: Layout independent representation of tables. In Pattern Recognition, 2008. ICPR 2008. 19th International Conference on (2008), IEEE, pp. 1--4.Google ScholarCross Ref
Jin, D. An algebraic approach to building category parse trees for web tables. 2012 NCUR (2013).Google Scholar
Kieninger, T., and Dengel, A. Applying the t-recs table recognition system to the business letter domain. In Document Analysis and Recognition, 2001. Proceedings. Sixth International Conference on (2001), IEEE, pp. 518--522. Google ScholarDigital Library
Lee, M.-H., Kim, Y.-S., and Lee, K.-H. Logical structure analysis: From html to xml. Computer Standards & Interfaces 29, 1 (2007), 109--124. Google ScholarDigital Library
Liu, Y., Bai, K., Mitra, P., and Giles, C. L. Improving the table boundary detection in pdfs by fixing the sequence error of the sparse lines. In 10th International Conference on Document Analysis and Recognition (ICDAR'09) (2009), IEEE, pp. 1006--1010. Google ScholarDigital Library
Liu, Y., Mitra, P., and Giles, C. L. Identifying table boundaries in digital documents via sparse line detection. In Proceedings of the 17th ACM Conference on Information and Knowledge Management (2008), ACM, pp. 1311--1320. Google ScholarDigital Library
Long, V. An Agent-Based Approach to Table Recognition and Interpretation. PhD thesis, Macquarie University Sydney, Australia, 2010.Google Scholar
Long, V., Cassidy, S., and Dale, R. A multi-level table evaluation method for plain text documents. In Extended Abstracts of the 7th International Association for Pattern Recognition Workshop on Document Analysis Systems (DAS 2006) (2006), pp. 21--24.Google Scholar
Lopresti, D., and Nagy, G. A tabular survey of automated table processing. In Graphics Recognition Recent Advances. Springer, 2000, pp. 93--120. Google ScholarDigital Library
Nagy, G., Padmanabhan, R., Jandhyala, R., Silversmith, W., and Krishnamoorthy, M. Table metadata: Headers, augmentations and aggregates. In Ninth IAPR International Workshop on Document Analysis Systems (2010).Google Scholar
Nagy, G., and Tamhankar, M. Vericlick: an efficient tool for table format verification. In IS&T/SPIE Electronic Imaging (2012), International Society for Optics and Photonics, pp. 1--9.Google Scholar
Oro, E., and Ruffolo, M. Xonto: An ontology-based system for semantic information extraction from pdf documents. In 20th IEEE International Conference on Tools with Artificial Intelligence (ICTAI'08) (2008), vol. 1, IEEE, pp. 118--125. Google ScholarDigital Library
Padmanabhan, R. K., Jandhyala, R. C., Krishnamoorthy, M., Nagy, G., Seth, S., and Silversmith, W. Interactive conversion of web tables. In Graphics Recognition. Achievements, Challenges, and Evolution. Springer, 2010, pp. 25--36. Google ScholarDigital Library
Seth, S., Jandhyala, R., Krishnamoorthy, M., and Nagy, G. Analysis and taxonomy of column header categories for web tables. In Proceedings of the 9th IAPR International Workshop on Document Analysis Systems (2010), ACM, pp. 81--88. Google ScholarDigital Library
Shahab, A., Shafait, F., Kieninger, T., and Dengel, A. An open approach towards the benchmarking of table structure recognition systems. In Proceedings of the 9th IAPR International Workshop on Document Analysis Systems (2010), ACM, pp. 113--120. Google ScholarDigital Library
Tao, C., and Embley, D. W. Automatic hidden-web table interpretation, conceptualization, and semantic annotation. Data & Knowledge Engineering 68, 7 (2009), 683--703. Google ScholarDigital Library
Wang, X. Tabular abstraction, editing, and formatting. PhD thesis, University of Waterloo, 1996. Google ScholarDigital Library
Wang, Y., and Hu, J. Detecting tables in html documents. In Document Analysis Systems V. Springer, 2002, pp. 249--260. Google ScholarDigital Library
Yang, Y. Web table mining and database discovery. PhD thesis, Simon Fraser University, 2002.Google Scholar
Zanibbi, R., Blostein, D., and Cordy, J. R. A survey of table recognition. Document Analysis and Recognition 7, 1 (2004), 1--16. Google ScholarDigital Library

Index Terms

TEXUS: A Task-based Approach for Table Extraction and Understanding
1. Social and professional topics
  1. Professional topics
    1. Management of computing and information systems
      1. Information system economics

Recommendations

TEXUS: A unified framework for extracting and understanding tables in PDF documents
Abstract
Tables in documents are a widely-available and rich source of information, but not yet well-utilised computationally because of the difficulty in automatically extracting their structure and data content. There has been a plethora of ...
Read More
Web Table Extraction, Retrieval, and Augmentation: A Survey
Survey Paper and Regular Paper

Tables are powerful and popular tools for organizing and manipulating data. A vast number of tables can be found on the Web, which represent a valuable knowledge resource. The objective of this survey is to synthesize and present two decades of research ...
Read More
Table Modelling, Extraction and Processing
DocEng '16: Proceedings of the 2016 ACM Symposium on Document Engineering

This tutorial is targeted at academics and practitioners, both within and outside of the Document Engineering community, who are confronted with table processing tasks such as information extraction and conversion, or have an interest in the topic, and ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
DocEng '15: Proceedings of the 2015 ACM Symposium on Document Engineering
September 2015
248 pages
ISBN:9781450333078
DOI:10.1145/2682571
General Chair:
Christine Vanoirbeek
EPFL, Switzerland
,
Program Chair:
Pierre Genevès
CNRS, France
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 8 September 2015
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
end-to-end table processing
table extraction
table understanding
task-based approach
Qualifiers
- research-article
Conference

Acceptance Rates
DocEng '15 Paper Acceptance Rate11of31submissions,35%Overall Acceptance Rate178of537submissions,33%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 18
  Total Citations
  View Citations
- 319
  Total Downloads
- Downloads (Last 12 months)15
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

TEXUS: A Task-based Approach for Table Extraction and Understanding

DocEng '15: Proceedings of the 2015 ACM Symposium on Document Engineering

ABSTRACT

References

Cited By

Index Terms

Recommendations

TEXUS: A unified framework for extracting and understanding tables in PDF documents

Web Table Extraction, Retrieval, and Augmentation: A Survey

Table Modelling, Extraction and Processing