skip to main content
10.1145/1935826.1935904acmconferencesArticle/Chapter ViewAbstractPublication PageswsdmConference Proceedingsconference-collections
poster

Web-scale table census and classification

Published:09 February 2011Publication History

ABSTRACT

We report on a census of the types of HTML tables on the Web according to a fine-grained classification taxonomy describing the semantics that they express. For each relational table type, we describe open challenges for extracting from them semantic triples, i.e., knowledge. We also present TabEx, a supervised framework for web-scale HTML table classification and apply it to the task of classifying HTML tables into our taxonomy. We show empirical evidence, through a large-scale experimental analysis over a crawl of the Web, that classification accuracy significantly outperforms several baselines. We present a detailed feature analysis and outline the most salient features for each table type.

References

  1. Cafarella, M.J.; Halevy, A.; Wang, D. Z.; Wu, E.; and Zhang, Y. 2008.. WebTables: Exploring the Powerpower of Tablestables on the Web. In Proceedings of VLDB-08. Auckland, New Zealandthe 34th International Conf. on Very Large Data Bases, pages 538--549, 2008.Google ScholarGoogle Scholar
  2. Cafarella, M. J.; Halevy, A.; Zhang, Y.; Wang, D. Z.; and Wu, E. Uncovering the Relational Web. In WebDB, Vancouver, Canada, 2008.Google ScholarGoogle Scholar
  3. Chang, W.; Pantel, P.; Popescu, A.-M.; and Gabrilovich, E. 2009. Towards intent-driven bidterm suggestion. In Proceedings of WWW-09 (Short Paper), Madrid, Spain. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Chen, H.; Tsai, S.; and Tsai, J. 2000. Mining Tables from Large-Scale HTML Texts. In Proceedings of COLING-00. Saarbrücken, Germany. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Elmeleegy, H.; Madhavan, J.; and Halevy, A. 2009. Harvesting Relational Tables from Lists on the Web. In Proceedings of the VLDB Endowment (PVLDB). pp. 1078--1089. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Gazen, B. and Minton, S.; 2006. Overview of Autofeed: An Unsupervised Learning System for Generating Webfeeds. In Proceedings of AAAI-06. Boston, MA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Huanhuan, J. H.; Jiang, D.; Pei, J.; He, Q.; Liao, Z.; Chen, E.; and Li, H. 2008. Context-aware query suggestion by mining click-through and session data. In Proceedings of KDD-08. pp. 875--883. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Friedman, J.H. 2001. Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29(5):1189--1232.Google ScholarGoogle ScholarCross RefCross Ref
  9. Friedman, J.H. 2006. Recent advances in predictive (machine) learning. Journal of Classification, 23(2):175--197.Google ScholarGoogle ScholarCross RefCross Ref
  10. Gatterbauer, W.; Bohunsky, P.; Herzog, M.; Krupl, B.; and Pollak, B. 2007. Towards Domain-Independent Information Extraction from Web Tables. In Proceedings WWW-07. pp. 71--80. Banff, Canada. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Lin, D.; Zhao, S.; Qin, L.; and Zhou, M. 2003. Identifying Synonyms among Distributionally Similar Words. In Proceedings of IJCAI-03, pp.1492--1493. Acapulco, Mexico. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Penn, G.; Hu, J.; Luo, H.; and McDonald, R. 2001. Flexible Web Document Analysis for Delivery to Narrow-Bandwidth Devices. In Proceedings of the Sixth International Conference on Document Analysis and Recognition. pp. 1074--1078. Seattle, WA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Wang, Y. and Hu, J. 2002. A Machine Learning Based Approach for Table Detection on the Web. In Proceedings of WWW-02. Honolulu, Hawaii. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Yoshida, M.; Torisawa, K.; and Tsujii, J. 2001. A Method to Integrate Tables of the World Wide Web. In Proceedings of Workshop on Web Document Analysis. pp. 31--34.Google ScholarGoogle Scholar

Index Terms

  1. Web-scale table census and classification

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        WSDM '11: Proceedings of the fourth ACM international conference on Web search and data mining
        February 2011
        870 pages
        ISBN:9781450304931
        DOI:10.1145/1935826

        Copyright © 2011 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 9 February 2011

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • poster

        Acceptance Rates

        WSDM '11 Paper Acceptance Rate83of372submissions,22%Overall Acceptance Rate498of2,863submissions,17%

        Upcoming Conference

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader