ABSTRACT
Table is a very common presentation scheme, but few papers touch on table extraction in text data mining. This paper focuses on mining tables from large-scale HTML texts. Table filtering, recognition, interpretation, and presentation are discussed. Heuristic rules and cell similarities are employed to identify tables. The F-measure of table recognition is 86.50%. We also propose an algorithm to capture attribute-value relationships among table cells. Finally, more structured data is extracted and presented.
- Appelt, D. and Israel, D. (1997) "Tutorial Notes on Building Information Extraction Systems," Tutorial on Fifth Conference on Applied Natural Language Processing, 1997.Google Scholar
- Chen, H. H.; Ding Y. W.; and Tsai, S. C. (1998) "Named Entity Extraction for Information Retrieval," Computer Processing of Oriental Languages, Special Issue on Information Retrieval on Oriental Languages, Vol. 12, No. 1, 1998, pp. 75--85.Google Scholar
- Douglas, S.; Hurst, M. and Quinn, D. (1995) "Using Natural Language Processing for Identifying and Interpreting Tables in Plain Text," Proceedings of Fourth Annual Symposium on Document Analysis and Information Retrieval, 1995, pp. 535--545.Google Scholar
- Douglas, S. and Hurst, M. (1996) "Layout and Language: Lists and Tables in Technical Documents," Proceedings of ACL SIGPARSE Workshop on Punctuation in Computational Linguistics, 1996, pp. 19--24.Google Scholar
- Gaizauskas, R. and Wilks, Y. (1998) "Information Extraction: Beyond Document Retrieval," Computational Linguistics and Chinese Language Processing, Vol. 3, No. 2, 1998, pp. 17--59.Google Scholar
- Green, E. and Krishnamoorthy, M. (1995) "Recognition of Tables Using Grammars," Proceedings of the Fourth Annual Symposium on Document Analysis and Information Retrieval, 1995, pp. 261--278.Google Scholar
- Hurst, M. and Douglas, S. (1997) "Layout and Language: Preliminary Experiments in Assigning Logical Structure to Table Cells," Proceedings of the Fifth Conference on Applied Natural Language Processing, 1997, pp. 217--220. Google ScholarDigital Library
- Hurst, M. (1999a) "Layout and Language: Beyond Simple Text for Information Interaction - Modeling the Table," Proceedings of the 2nd International Conference on Multimodal Interfaces, Hong Kong, January 1999.Google Scholar
- Hurst, M. (1999b) "Layout and Language: A Corpus of Documents Containing Tables," Proceedings of AAAI Fall Symposium: Using Layout for the Generation, Understanding and Retrieval of Documents, 1999.Google Scholar
- Mikheev, A. and Finch, S. (1995) "A Workbench for Acquisition of Ontological Knowledge from Natural Text," Proceedings of the 7th Conference of the European Chapter for Computational Linguistics, 1995, pp. 194--201. Google ScholarDigital Library
- MUC (1998) Proceedings of 7th Message Understanding Conference, http://www.muc.saic.com/proceedings/proceedings_index.html.Google Scholar
- Ng, H. T.; Lim, C. Y. and Koo, J. L. T. (1999) "Learning to Recognize Tables in Free Text," Proceedings of the 37th Annual Meeting of ACL, 1999, pp. 443--450. Google ScholarDigital Library
- Mining tables from large scale HTML texts
Recommendations
Transaction-item association matrix-based frequent pattern network mining algorithm in large-scale transaction database
To increase the efficiency of data mining is the emphasis in this field at present. Through the establishment of transaction-item association matrix, this paper changes the process of association rule mining to elementary matrix operation, which makes ...
Frequent Itemset Mining on Large-Scale Shared Memory Machines
CLUSTER '11: Proceedings of the 2011 IEEE International Conference on Cluster ComputingFrequent Item set Mining (FIM) is a data mining task that is used to find frequently-occurring subsets amongst a database of item sets. FIM is a non-numerical data intensive computation and is frequently used in machine learning and computational ...
Comments