Article

Free Access

Mining tables from large scale HTML texts

Authors:
Hsin-Hsi Chen

National Taiwan University, Taipei, Taiwan, R.O.C.

National Taiwan University, Taipei, Taiwan, R.O.C.
View Profile

,
Shih-Chung Tsai

National Taiwan University, Taipei, Taiwan, R.O.C.

National Taiwan University, Taipei, Taiwan, R.O.C.
View Profile

,
Jin-He Tsai

National Taiwan University, Taipei, TAIWAN, R.O.C.

National Taiwan University, Taipei, TAIWAN, R.O.C.
View Profile

COLING '00: Proceedings of the 18th conference on Computational linguistics - Volume 1July 2000Pages 166–172https://doi.org/10.3115/990820.990845

Published:31 July 2000Publication History

COLING '00: Proceedings of the 18th conference on Computational linguistics - Volume 1

Pages 166–172

ABSTRACT

Table is a very common presentation scheme, but few papers touch on table extraction in text data mining. This paper focuses on mining tables from large-scale HTML texts. Table filtering, recognition, interpretation, and presentation are discussed. Heuristic rules and cell similarities are employed to identify tables. The F-measure of table recognition is 86.50%. We also propose an algorithm to capture attribute-value relationships among table cells. Finally, more structured data is extracted and presented.

References

Appelt, D. and Israel, D. (1997) "Tutorial Notes on Building Information Extraction Systems," Tutorial on Fifth Conference on Applied Natural Language Processing, 1997.Google Scholar
Chen, H. H.; Ding Y. W.; and Tsai, S. C. (1998) "Named Entity Extraction for Information Retrieval," Computer Processing of Oriental Languages, Special Issue on Information Retrieval on Oriental Languages, Vol. 12, No. 1, 1998, pp. 75--85.Google Scholar
Douglas, S.; Hurst, M. and Quinn, D. (1995) "Using Natural Language Processing for Identifying and Interpreting Tables in Plain Text," Proceedings of Fourth Annual Symposium on Document Analysis and Information Retrieval, 1995, pp. 535--545.Google Scholar
Douglas, S. and Hurst, M. (1996) "Layout and Language: Lists and Tables in Technical Documents," Proceedings of ACL SIGPARSE Workshop on Punctuation in Computational Linguistics, 1996, pp. 19--24.Google Scholar
Gaizauskas, R. and Wilks, Y. (1998) "Information Extraction: Beyond Document Retrieval," Computational Linguistics and Chinese Language Processing, Vol. 3, No. 2, 1998, pp. 17--59.Google Scholar
Green, E. and Krishnamoorthy, M. (1995) "Recognition of Tables Using Grammars," Proceedings of the Fourth Annual Symposium on Document Analysis and Information Retrieval, 1995, pp. 261--278.Google Scholar
Hurst, M. and Douglas, S. (1997) "Layout and Language: Preliminary Experiments in Assigning Logical Structure to Table Cells," Proceedings of the Fifth Conference on Applied Natural Language Processing, 1997, pp. 217--220. Google ScholarDigital Library
Hurst, M. (1999a) "Layout and Language: Beyond Simple Text for Information Interaction - Modeling the Table," Proceedings of the 2nd International Conference on Multimodal Interfaces, Hong Kong, January 1999.Google Scholar
Hurst, M. (1999b) "Layout and Language: A Corpus of Documents Containing Tables," Proceedings of AAAI Fall Symposium: Using Layout for the Generation, Understanding and Retrieval of Documents, 1999.Google Scholar
Mikheev, A. and Finch, S. (1995) "A Workbench for Acquisition of Ontological Knowledge from Natural Text," Proceedings of the 7th Conference of the European Chapter for Computational Linguistics, 1995, pp. 194--201. Google ScholarDigital Library
MUC (1998) Proceedings of 7th Message Understanding Conference, http://www.muc.saic.com/proceedings/proceedings_index.html.Google Scholar
Ng, H. T.; Lim, C. Y. and Koo, J. L. T. (1999) "Learning to Recognize Tables in Free Text," Proceedings of the 37th Annual Meeting of ACL, 1999, pp. 443--450. Google ScholarDigital Library

Mining tables from large scale HTML texts
1. Hardware
  1. Power and energy
    1. Power estimation and optimization

Recommendations

Transaction-item association matrix-based frequent pattern network mining algorithm in large-scale transaction database

To increase the efficiency of data mining is the emphasis in this field at present. Through the establishment of transaction-item association matrix, this paper changes the process of association rule mining to elementary matrix operation, which makes ...
Read More
Efficiently mining frequent itemsets from very large databases
Read More
Frequent Itemset Mining on Large-Scale Shared Memory Machines
CLUSTER '11: Proceedings of the 2011 IEEE International Conference on Cluster Computing

Frequent Item set Mining (FIM) is a data mining task that is used to find frequently-occurring subsets amongst a database of item sets. FIM is a non-numerical data intensive computation and is frequently used in machine learning and computational ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
COLING '00: Proceedings of the 18th conference on Computational linguistics - Volume 1
July 2000
616 pages
ISBN:155860717X
Program Chair:
Martin Kay
Sponsors
In-Cooperation
Publisher
Association for Computational Linguistics
United States
Publication History
- Published: 31 July 2000
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate1,537of1,537submissions,100%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 58
  Total Citations
  View Citations
- 1,103
  Total Downloads
- Downloads (Last 12 months)38
- Downloads (Last 6 weeks)9
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Mining tables from large scale HTML texts

COLING '00: Proceedings of the 18th conference on Computational linguistics - Volume 1

ABSTRACT

References

Cited By

Recommendations

Transaction-item association matrix-based frequent pattern network mining algorithm in large-scale transaction database

Efficiently mining frequent itemsets from very large databases

Frequent Itemset Mining on Large-Scale Shared Memory Machines

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Mining tables from large scale HTML texts

COLING '00: Proceedings of the 18th conference on Computational linguistics - Volume 1

ABSTRACT

References

Cited By

Recommendations

Transaction-item association matrix-based frequent pattern network mining algorithm in large-scale transaction database

Efficiently mining frequent itemsets from very large databases

Frequent Itemset Mining on Large-Scale Shared Memory Machines

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media