ABSTRACT
We report on a census of the types of HTML tables on the Web according to a fine-grained classification taxonomy describing the semantics that they express. For each relational table type, we describe open challenges for extracting from them semantic triples, i.e., knowledge. We also present TabEx, a supervised framework for web-scale HTML table classification and apply it to the task of classifying HTML tables into our taxonomy. We show empirical evidence, through a large-scale experimental analysis over a crawl of the Web, that classification accuracy significantly outperforms several baselines. We present a detailed feature analysis and outline the most salient features for each table type.
- Cafarella, M.J.; Halevy, A.; Wang, D. Z.; Wu, E.; and Zhang, Y. 2008.. WebTables: Exploring the Powerpower of Tablestables on the Web. In Proceedings of VLDB-08. Auckland, New Zealandthe 34th International Conf. on Very Large Data Bases, pages 538--549, 2008.Google Scholar
- Cafarella, M. J.; Halevy, A.; Zhang, Y.; Wang, D. Z.; and Wu, E. Uncovering the Relational Web. In WebDB, Vancouver, Canada, 2008.Google Scholar
- Chang, W.; Pantel, P.; Popescu, A.-M.; and Gabrilovich, E. 2009. Towards intent-driven bidterm suggestion. In Proceedings of WWW-09 (Short Paper), Madrid, Spain. Google ScholarDigital Library
- Chen, H.; Tsai, S.; and Tsai, J. 2000. Mining Tables from Large-Scale HTML Texts. In Proceedings of COLING-00. Saarbrücken, Germany. Google ScholarDigital Library
- Elmeleegy, H.; Madhavan, J.; and Halevy, A. 2009. Harvesting Relational Tables from Lists on the Web. In Proceedings of the VLDB Endowment (PVLDB). pp. 1078--1089. Google ScholarDigital Library
- Gazen, B. and Minton, S.; 2006. Overview of Autofeed: An Unsupervised Learning System for Generating Webfeeds. In Proceedings of AAAI-06. Boston, MA. Google ScholarDigital Library
- Huanhuan, J. H.; Jiang, D.; Pei, J.; He, Q.; Liao, Z.; Chen, E.; and Li, H. 2008. Context-aware query suggestion by mining click-through and session data. In Proceedings of KDD-08. pp. 875--883. Google ScholarDigital Library
- Friedman, J.H. 2001. Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29(5):1189--1232.Google ScholarCross Ref
- Friedman, J.H. 2006. Recent advances in predictive (machine) learning. Journal of Classification, 23(2):175--197.Google ScholarCross Ref
- Gatterbauer, W.; Bohunsky, P.; Herzog, M.; Krupl, B.; and Pollak, B. 2007. Towards Domain-Independent Information Extraction from Web Tables. In Proceedings WWW-07. pp. 71--80. Banff, Canada. Google ScholarDigital Library
- Lin, D.; Zhao, S.; Qin, L.; and Zhou, M. 2003. Identifying Synonyms among Distributionally Similar Words. In Proceedings of IJCAI-03, pp.1492--1493. Acapulco, Mexico. Google ScholarDigital Library
- Penn, G.; Hu, J.; Luo, H.; and McDonald, R. 2001. Flexible Web Document Analysis for Delivery to Narrow-Bandwidth Devices. In Proceedings of the Sixth International Conference on Document Analysis and Recognition. pp. 1074--1078. Seattle, WA. Google ScholarDigital Library
- Wang, Y. and Hu, J. 2002. A Machine Learning Based Approach for Table Detection on the Web. In Proceedings of WWW-02. Honolulu, Hawaii. Google ScholarDigital Library
- Yoshida, M.; Torisawa, K.; and Tsujii, J. 2001. A Method to Integrate Tables of the World Wide Web. In Proceedings of Workshop on Web Document Analysis. pp. 31--34.Google Scholar
Index Terms
- Web-scale table census and classification
Recommendations
Web-scale knowledge extraction from semi-structured tables
WWW '10: Proceedings of the 19th international conference on World wide webA wealth of knowledge is encoded in the form of tables on the World Wide Web. We propose a classification algorithm and a rich feature set for automatically recognizing layout tables and attribute/value tables. We report the frequencies of these table ...
A fine-grained taxonomy of tables on the web
CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge managementWe propose a classification taxonomy over a large crawl of HTML tables on the Web, focusing primarily on those tables that express structured knowledge. The taxonomy separates tables into two top-level classes: a) those used for layout purposes, ...
Towards combining web classification and web information extraction: a case study
KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data miningWeb content analysis often has two sequential and separate steps: Web Classification to identify the target Web pages, and Web Information Extraction to extract the metadata contained in the target Web pages. This decoupled strategy is highly ...
Comments