skip to main content
research-article

Recovering semantics of tables on the web

Published:01 June 2011Publication History
Skip Abstract Section

Abstract

The Web offers a corpus of over 100 million tables [6], but the meaning of each table is rarely explicit from the table itself. Header rows exist in few cases and even when they do, the attribute names are typically useless. We describe a system that attempts to recover the semantics of tables by enriching the table with additional annotations. Our annotations facilitate operations such as searching for tables and finding related tables.

To recover semantics of tables, we leverage a database of class labels and relationships automatically extracted from the Web. The database of classes and relationships has very wide coverage, but is also noisy. We attach a class label to a column if a sufficient number of the values in the column are identified with that label in the database of class labels, and analogously for binary relationships. We describe a formal model for reasoning about when we have seen sufficient evidence for a label, and show that it performs substantially better than a simple majority scheme. We describe a set of experiments that illustrate the utility of the recovered semantics for table search and show that it performs substantially better than previous approaches. In addition, we characterize what fraction of tables on the Web can be annotated using our approach.

References

  1. M. Banko, M. J. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni. Open Information Extraction from the Web. In IJCAI, pp 2670--2676, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. M. Banko and O. Etzioni. The Tradeoffs Between Open and Traditional Relation Extraction. In ACL, pp. 28--36, 2008.Google ScholarGoogle Scholar
  3. B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for optimal margin classifiers. In COLT, pp. 144--152, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. T. Brants. TnT---A Statistical Part of Speech Tagger. In ANLP, pp. 224--231, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. M. Cafarella, A. Halevy, and N. Khoussainova. Data Integration for the Relational Web. PVLDB, 2(1):1090--1101, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. M. Cafarella, A. Halevy, D. Wang, E. Wu, and Y. Zhang. WebTables: Exploring the Power of Tables on the Web. PVLDB, 1(1):538--549, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu, and Y. Zhang. Uncovering the Relational Web. In WebDB, 2008.Google ScholarGoogle Scholar
  8. D. Carmel, H. Roitman, and N. Zwerding. Enhancing Cluster Labeling Using Wikipedia. In SIGIR, pp. 139--146, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. D. Cutting, D. Karger, and J. Pedersen. Constant Interaction-Time Scatter/Gather Browsing of Very Large Document Collections. In SIGIR, pp. 126--134, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. D. Downey, O. Etzioni, and S. Soderland. A Probabilistic Model of Redundancy in Information Extraction. In IJCAI, pp. 1034--1041, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. H. Elmeleegy, J. Madhavan, and A. Halevy. Harvesting Relational Tables from Lists on the Web. PVLDB, 2:1078--1089, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. R. Gupta and S. Sarawagi. Answering Table Augmentation Queries from Unstructured Lists on the Web. PVLDB, 2(1):289--300, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. M. Hearst. Automatic Acquisition of Hyponyms from Large Text Corpora. In COLING, pp. 539--545, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. P. Ipeirotis and A. Marian, editors. DBRank, 2010.Google ScholarGoogle Scholar
  15. Z. G. Ives, C. A. Knoblock, S. Minton, M. Jacob, P. P. Talukdar, R. Tuchinda, J. L. Ambite, M. Muslea, and C. Gazen. Interactive Data Integration through Smart Copy & Paste. In CIDR, 2009.Google ScholarGoogle Scholar
  16. N. Kushmerick, D. Weld, and R. Doorenbos. Wrapper Induction for Information Extraction. In IJCAI, pp. 729--737, 1997.Google ScholarGoogle Scholar
  17. G. Limaye, S. Sarawagi, and S. Chakrabarti. Annotating and Searching Web Tables Using Entities, Types and Relationships. In VLDB, pp. 1338--1347, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. D. Lin and X. Wu. Phrase Clustering for Discriminative Learning. In ACL-IJCNLP, pp. 1030--1038, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. T. M. Mitchell. Machine Learning. McGraw-Hill, New York, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. M. Paşca. The Role of Queries in Ranking Labeled Instances Extracted from Text. In COLING, pp. 955--962, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. M. Paşca and B. Van Durme. Weakly-Supervised Acquisition of Open-Domain Classes and Class Attributes from Web Documents and Query Logs. In ACL, pp. 19--27, 2008.Google ScholarGoogle Scholar
  22. P. Pantel and M. Pennacchiotti. Espresso: Leveraging Generic Patterns for Automatically Harvesting Semantic Relations. In COLING-ACL, pp. 113--120, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. M. Pennacchiotti and P. Pantel. Entity Extraction via Ensemble Semantics. In EMNLP, pp. 238--247, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. S. Ponzetto and R. Navigli. Large-Scale Taxonomy Mapping for Restructuring and Integrating Wikipedia. In IJCAI, pp. 2083--2088, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. R. Snow, D. Jurafsky, and A. Ng. Semantic Taxonomy Induction from Heterogenous Evidence. In COLING-ACL, pp. 801--808, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. F. Suchanek, G. Kasneci, and G. Weikum. YAGO: a Core of Semantic Knowledge Unifying WordNet and Wikipedia. In WWW, pp. 697--706, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. P. Talukdar, J. Reisinger, M. Paşca, D. Ravichandran, R. Bhagat, and F. Pereira. Weakly-Supervised Acquisition of Labeled Class Instances using Graph Random Walks. In EMNLP, pp. 582--590, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. P. Treeratpituk and J. Callan. Automatically Labeling Hierarchical Clusters. In DGO, pp. 167--176, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. R. Wang and W. Cohen. Iterative Set Expansion of Named Entities Using the Web. In ICDM, pp. 1091--1096, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. R. Wang and W. Cohen. Automatic Set Instance Extraction using the Web. In ACL-IJCNLP, pp. 441--449, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. F. Wu and D. Weld. Automatically Refining the Wikipedia Infobox Ontology. In WWW, pp. 635--644, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Recovering semantics of tables on the web

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image Proceedings of the VLDB Endowment
            Proceedings of the VLDB Endowment  Volume 4, Issue 9
            June 2011
            70 pages

            Publisher

            VLDB Endowment

            Publication History

            • Published: 1 June 2011
            Published in pvldb Volume 4, Issue 9

            Qualifiers

            • research-article

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader