research-article

Recovering semantics of tables on the web

Authors:
Petros Venetis

Stanford University

Stanford University
View Profile

,
Alon Halevy

Google Inc.

Google Inc.
View Profile

,
Jayant Madhavan

Google Inc.

Google Inc.
View Profile

,
Marius Paşca

Google Inc.

Google Inc.
View Profile

,
Warren Shen

Google Inc.

Google Inc.
View Profile

,
Fei Wu

Google Inc.

Google Inc.
View Profile

,
Gengxin Miao

UC Santa Barbara

UC Santa Barbara
View Profile

,
Chung Wu

Google Inc.

Google Inc.
View Profile

Proceedings of the VLDB Endowment Volume 4 Issue 9pp 528–538https://doi.org/10.14778/2002938.2002939

Published:01 June 2011Publication History

Proceedings of the VLDB Endowment

Abstract

The Web offers a corpus of over 100 million tables [6], but the meaning of each table is rarely explicit from the table itself. Header rows exist in few cases and even when they do, the attribute names are typically useless. We describe a system that attempts to recover the semantics of tables by enriching the table with additional annotations. Our annotations facilitate operations such as searching for tables and finding related tables.

To recover semantics of tables, we leverage a database of class labels and relationships automatically extracted from the Web. The database of classes and relationships has very wide coverage, but is also noisy. We attach a class label to a column if a sufficient number of the values in the column are identified with that label in the database of class labels, and analogously for binary relationships. We describe a formal model for reasoning about when we have seen sufficient evidence for a label, and show that it performs substantially better than a simple majority scheme. We describe a set of experiments that illustrate the utility of the recovered semantics for table search and show that it performs substantially better than previous approaches. In addition, we characterize what fraction of tables on the Web can be annotated using our approach.

References

M. Banko, M. J. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni. Open Information Extraction from the Web. In IJCAI, pp 2670--2676, 2007. Google ScholarDigital Library
M. Banko and O. Etzioni. The Tradeoffs Between Open and Traditional Relation Extraction. In ACL, pp. 28--36, 2008.Google Scholar
B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for optimal margin classifiers. In COLT, pp. 144--152, 1992. Google ScholarDigital Library
T. Brants. TnT---A Statistical Part of Speech Tagger. In ANLP, pp. 224--231, 2000. Google ScholarDigital Library
M. Cafarella, A. Halevy, and N. Khoussainova. Data Integration for the Relational Web. PVLDB, 2(1):1090--1101, 2009. Google ScholarDigital Library
M. Cafarella, A. Halevy, D. Wang, E. Wu, and Y. Zhang. WebTables: Exploring the Power of Tables on the Web. PVLDB, 1(1):538--549, 2008. Google ScholarDigital Library
M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu, and Y. Zhang. Uncovering the Relational Web. In WebDB, 2008.Google Scholar
D. Carmel, H. Roitman, and N. Zwerding. Enhancing Cluster Labeling Using Wikipedia. In SIGIR, pp. 139--146, 2009. Google ScholarDigital Library
D. Cutting, D. Karger, and J. Pedersen. Constant Interaction-Time Scatter/Gather Browsing of Very Large Document Collections. In SIGIR, pp. 126--134, 1993. Google ScholarDigital Library
D. Downey, O. Etzioni, and S. Soderland. A Probabilistic Model of Redundancy in Information Extraction. In IJCAI, pp. 1034--1041, 2005. Google ScholarDigital Library
H. Elmeleegy, J. Madhavan, and A. Halevy. Harvesting Relational Tables from Lists on the Web. PVLDB, 2:1078--1089, 2009. Google ScholarDigital Library
R. Gupta and S. Sarawagi. Answering Table Augmentation Queries from Unstructured Lists on the Web. PVLDB, 2(1):289--300, 2009. Google ScholarDigital Library
M. Hearst. Automatic Acquisition of Hyponyms from Large Text Corpora. In COLING, pp. 539--545, 1992. Google ScholarDigital Library
P. Ipeirotis and A. Marian, editors. DBRank, 2010.Google Scholar
Z. G. Ives, C. A. Knoblock, S. Minton, M. Jacob, P. P. Talukdar, R. Tuchinda, J. L. Ambite, M. Muslea, and C. Gazen. Interactive Data Integration through Smart Copy & Paste. In CIDR, 2009.Google Scholar
N. Kushmerick, D. Weld, and R. Doorenbos. Wrapper Induction for Information Extraction. In IJCAI, pp. 729--737, 1997.Google Scholar
G. Limaye, S. Sarawagi, and S. Chakrabarti. Annotating and Searching Web Tables Using Entities, Types and Relationships. In VLDB, pp. 1338--1347, 2010. Google ScholarDigital Library
D. Lin and X. Wu. Phrase Clustering for Discriminative Learning. In ACL-IJCNLP, pp. 1030--1038, 2009. Google ScholarDigital Library
T. M. Mitchell. Machine Learning. McGraw-Hill, New York, 1997. Google ScholarDigital Library
M. Paşca. The Role of Queries in Ranking Labeled Instances Extracted from Text. In COLING, pp. 955--962, 2010. Google ScholarDigital Library
M. Paşca and B. Van Durme. Weakly-Supervised Acquisition of Open-Domain Classes and Class Attributes from Web Documents and Query Logs. In ACL, pp. 19--27, 2008.Google Scholar
P. Pantel and M. Pennacchiotti. Espresso: Leveraging Generic Patterns for Automatically Harvesting Semantic Relations. In COLING-ACL, pp. 113--120, 2006. Google ScholarDigital Library
M. Pennacchiotti and P. Pantel. Entity Extraction via Ensemble Semantics. In EMNLP, pp. 238--247, 2009. Google ScholarDigital Library
S. Ponzetto and R. Navigli. Large-Scale Taxonomy Mapping for Restructuring and Integrating Wikipedia. In IJCAI, pp. 2083--2088, 2009. Google ScholarDigital Library
R. Snow, D. Jurafsky, and A. Ng. Semantic Taxonomy Induction from Heterogenous Evidence. In COLING-ACL, pp. 801--808, 2006. Google ScholarDigital Library
F. Suchanek, G. Kasneci, and G. Weikum. YAGO: a Core of Semantic Knowledge Unifying WordNet and Wikipedia. In WWW, pp. 697--706, 2007. Google ScholarDigital Library
P. Talukdar, J. Reisinger, M. Paşca, D. Ravichandran, R. Bhagat, and F. Pereira. Weakly-Supervised Acquisition of Labeled Class Instances using Graph Random Walks. In EMNLP, pp. 582--590, 2008. Google ScholarDigital Library
P. Treeratpituk and J. Callan. Automatically Labeling Hierarchical Clusters. In DGO, pp. 167--176, 2006. Google ScholarDigital Library
R. Wang and W. Cohen. Iterative Set Expansion of Named Entities Using the Web. In ICDM, pp. 1091--1096, 2008. Google ScholarDigital Library
R. Wang and W. Cohen. Automatic Set Instance Extraction using the Web. In ACL-IJCNLP, pp. 441--449, 2009. Google ScholarDigital Library
F. Wu and D. Weld. Automatically Refining the Wikipedia Infobox Ontology. In WWW, pp. 635--644, 2008. Google ScholarDigital Library

Index Terms

Recovering semantics of tables on the web
1. Information systems

Recommendations

Annotating and searching web tables using entities, types and relationships

Tables are a universal idiom to present relational data. Billions of tables on Web pages express entity references, attributes and relationships. This representation of relational world knowledge is usually considerably better than completely ...
Read More
Web personal name disambiguation based on reference entity tables mined from the web
WIDM '09: Proceedings of the eleventh international workshop on Web information and data management

Ambiguous personal names are common on the Web, which pose a challenge for many different tasks. The traditional disambiguation employs the clustering methods. However, without reference entity tables, the clustering method can only identify whether two ...
Read More
Coreference semantics from web features
ACL '12: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1

To address semantic ambiguities in coreference resolution, we use Web n-gram features that capture a range of world knowledge in a diffuse but robust way. Specifically, we exploit short-distance cues to hypernymy, semantic compatibility, and semantic ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

Proceedings of the VLDB Endowment Volume 4, Issue 9
June 2011
70 pages
ISSN:2150-8097
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 June 2011
Published in pvldb Volume 4, Issue 9
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 147
  Total Citations
  View Citations
- 689
  Total Downloads
- Downloads (Last 12 months)22
- Downloads (Last 6 weeks)10
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Recovering semantics of tables on the web

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Annotating and searching web tables using entities, types and relationships

Web personal name disambiguation based on reference entity tables mined from the web

Coreference semantics from web features

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Recovering semantics of tables on the web

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Annotating and searching web tables using entities, types and relationships

Web personal name disambiguation based on reference entity tables mined from the web

Coreference semantics from web features

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media