poster

Web-scale table census and classification

Authors:
Eric Crestan

Yahoo! Labs, Sunnyvale, CA, USA

Yahoo! Labs, Sunnyvale, CA, USA
View Profile

,
Patrick Pantel

Microsoft Research, Redmond, WA, USA

Microsoft Research, Redmond, WA, USA
View Profile

WSDM '11: Proceedings of the fourth ACM international conference on Web search and data miningFebruary 2011Pages 545–554https://doi.org/10.1145/1935826.1935904

Published:09 February 2011Publication History

WSDM '11: Proceedings of the fourth ACM international conference on Web search and data mining

Pages 545–554

ABSTRACT

We report on a census of the types of HTML tables on the Web according to a fine-grained classification taxonomy describing the semantics that they express. For each relational table type, we describe open challenges for extracting from them semantic triples, i.e., knowledge. We also present TabEx, a supervised framework for web-scale HTML table classification and apply it to the task of classifying HTML tables into our taxonomy. We show empirical evidence, through a large-scale experimental analysis over a crawl of the Web, that classification accuracy significantly outperforms several baselines. We present a detailed feature analysis and outline the most salient features for each table type.

References

Cafarella, M.J.; Halevy, A.; Wang, D. Z.; Wu, E.; and Zhang, Y. 2008.. WebTables: Exploring the Powerpower of Tablestables on the Web. In Proceedings of VLDB-08. Auckland, New Zealandthe 34th International Conf. on Very Large Data Bases, pages 538--549, 2008.Google Scholar
Cafarella, M. J.; Halevy, A.; Zhang, Y.; Wang, D. Z.; and Wu, E. Uncovering the Relational Web. In WebDB, Vancouver, Canada, 2008.Google Scholar
Chang, W.; Pantel, P.; Popescu, A.-M.; and Gabrilovich, E. 2009. Towards intent-driven bidterm suggestion. In Proceedings of WWW-09 (Short Paper), Madrid, Spain. Google ScholarDigital Library
Chen, H.; Tsai, S.; and Tsai, J. 2000. Mining Tables from Large-Scale HTML Texts. In Proceedings of COLING-00. Saarbrücken, Germany. Google ScholarDigital Library
Elmeleegy, H.; Madhavan, J.; and Halevy, A. 2009. Harvesting Relational Tables from Lists on the Web. In Proceedings of the VLDB Endowment (PVLDB). pp. 1078--1089. Google ScholarDigital Library
Gazen, B. and Minton, S.; 2006. Overview of Autofeed: An Unsupervised Learning System for Generating Webfeeds. In Proceedings of AAAI-06. Boston, MA. Google ScholarDigital Library
Huanhuan, J. H.; Jiang, D.; Pei, J.; He, Q.; Liao, Z.; Chen, E.; and Li, H. 2008. Context-aware query suggestion by mining click-through and session data. In Proceedings of KDD-08. pp. 875--883. Google ScholarDigital Library
Friedman, J.H. 2001. Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29(5):1189--1232.Google ScholarCross Ref
Friedman, J.H. 2006. Recent advances in predictive (machine) learning. Journal of Classification, 23(2):175--197.Google ScholarCross Ref
Gatterbauer, W.; Bohunsky, P.; Herzog, M.; Krupl, B.; and Pollak, B. 2007. Towards Domain-Independent Information Extraction from Web Tables. In Proceedings WWW-07. pp. 71--80. Banff, Canada. Google ScholarDigital Library
Lin, D.; Zhao, S.; Qin, L.; and Zhou, M. 2003. Identifying Synonyms among Distributionally Similar Words. In Proceedings of IJCAI-03, pp.1492--1493. Acapulco, Mexico. Google ScholarDigital Library
Penn, G.; Hu, J.; Luo, H.; and McDonald, R. 2001. Flexible Web Document Analysis for Delivery to Narrow-Bandwidth Devices. In Proceedings of the Sixth International Conference on Document Analysis and Recognition. pp. 1074--1078. Seattle, WA. Google ScholarDigital Library
Wang, Y. and Hu, J. 2002. A Machine Learning Based Approach for Table Detection on the Web. In Proceedings of WWW-02. Honolulu, Hawaii. Google ScholarDigital Library
Yoshida, M.; Torisawa, K.; and Tsujii, J. 2001. A Method to Integrate Tables of the World Wide Web. In Proceedings of Workshop on Web Document Analysis. pp. 31--34.Google Scholar

Index Terms

Web-scale table census and classification
1. Computing methodologies
  1. Machine learning
    1. Learning settings
2. Information systems
  1. Information retrieval
    1. Document representation

Recommendations

Web-scale knowledge extraction from semi-structured tables
WWW '10: Proceedings of the 19th international conference on World wide web

A wealth of knowledge is encoded in the form of tables on the World Wide Web. We propose a classification algorithm and a rich feature set for automatically recognizing layout tables and attribute/value tables. We report the frequencies of these table ...
Read More
A fine-grained taxonomy of tables on the web
CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge management

We propose a classification taxonomy over a large crawl of HTML tables on the Web, focusing primarily on those tables that express structured knowledge. The taxonomy separates tables into two top-level classes: a) those used for layout purposes, ...
Read More
Towards combining web classification and web information extraction: a case study
KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining

Web content analysis often has two sequential and separate steps: Web Classification to identify the target Web pages, and Web Information Extraction to extract the metadata contained in the target Web pages. This decoupled strategy is highly ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WSDM '11: Proceedings of the fourth ACM international conference on Web search and data mining
February 2011
870 pages
ISBN:9781450304931
DOI:10.1145/1935826
General Chair:
Irwin King
CUHK, Hong Kong
,
Program Chairs:
Wolfgang Nejdl
L3S and University of Hannover, Germany
,
Hang Li
Microsoft Research Asia, China
Copyright © 2011 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 9 February 2011
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
classification
information extraction
structured data
web tables
Qualifiers
- poster
Conference

Acceptance Rates
WSDM '11 Paper Acceptance Rate83of372submissions,22%Overall Acceptance Rate498of2,863submissions,17%
More
Upcoming Conference
WSDM '25

Sponsor:

sigir

sigir

sigir

sigir

The Eighteenth ACM International Conference on Web Search and Data Mining

April 7 - 11, 2025

Hannover , Germany
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 74
  Total Citations
  View Citations
- 622
  Total Downloads
- Downloads (Last 12 months)27
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Web-scale table census and classification

WSDM '11: Proceedings of the fourth ACM international conference on Web search and data mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Web-scale knowledge extraction from semi-structured tables

A fine-grained taxonomy of tables on the web

Towards combining web classification and web information extraction: a case study