Abstract
The Web offers a corpus of over 100 million tables [6], but the meaning of each table is rarely explicit from the table itself. Header rows exist in few cases and even when they do, the attribute names are typically useless. We describe a system that attempts to recover the semantics of tables by enriching the table with additional annotations. Our annotations facilitate operations such as searching for tables and finding related tables.
To recover semantics of tables, we leverage a database of class labels and relationships automatically extracted from the Web. The database of classes and relationships has very wide coverage, but is also noisy. We attach a class label to a column if a sufficient number of the values in the column are identified with that label in the database of class labels, and analogously for binary relationships. We describe a formal model for reasoning about when we have seen sufficient evidence for a label, and show that it performs substantially better than a simple majority scheme. We describe a set of experiments that illustrate the utility of the recovered semantics for table search and show that it performs substantially better than previous approaches. In addition, we characterize what fraction of tables on the Web can be annotated using our approach.
- M. Banko, M. J. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni. Open Information Extraction from the Web. In IJCAI, pp 2670--2676, 2007. Google ScholarDigital Library
- M. Banko and O. Etzioni. The Tradeoffs Between Open and Traditional Relation Extraction. In ACL, pp. 28--36, 2008.Google Scholar
- B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for optimal margin classifiers. In COLT, pp. 144--152, 1992. Google ScholarDigital Library
- T. Brants. TnT---A Statistical Part of Speech Tagger. In ANLP, pp. 224--231, 2000. Google ScholarDigital Library
- M. Cafarella, A. Halevy, and N. Khoussainova. Data Integration for the Relational Web. PVLDB, 2(1):1090--1101, 2009. Google ScholarDigital Library
- M. Cafarella, A. Halevy, D. Wang, E. Wu, and Y. Zhang. WebTables: Exploring the Power of Tables on the Web. PVLDB, 1(1):538--549, 2008. Google ScholarDigital Library
- M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu, and Y. Zhang. Uncovering the Relational Web. In WebDB, 2008.Google Scholar
- D. Carmel, H. Roitman, and N. Zwerding. Enhancing Cluster Labeling Using Wikipedia. In SIGIR, pp. 139--146, 2009. Google ScholarDigital Library
- D. Cutting, D. Karger, and J. Pedersen. Constant Interaction-Time Scatter/Gather Browsing of Very Large Document Collections. In SIGIR, pp. 126--134, 1993. Google ScholarDigital Library
- D. Downey, O. Etzioni, and S. Soderland. A Probabilistic Model of Redundancy in Information Extraction. In IJCAI, pp. 1034--1041, 2005. Google ScholarDigital Library
- H. Elmeleegy, J. Madhavan, and A. Halevy. Harvesting Relational Tables from Lists on the Web. PVLDB, 2:1078--1089, 2009. Google ScholarDigital Library
- R. Gupta and S. Sarawagi. Answering Table Augmentation Queries from Unstructured Lists on the Web. PVLDB, 2(1):289--300, 2009. Google ScholarDigital Library
- M. Hearst. Automatic Acquisition of Hyponyms from Large Text Corpora. In COLING, pp. 539--545, 1992. Google ScholarDigital Library
- P. Ipeirotis and A. Marian, editors. DBRank, 2010.Google Scholar
- Z. G. Ives, C. A. Knoblock, S. Minton, M. Jacob, P. P. Talukdar, R. Tuchinda, J. L. Ambite, M. Muslea, and C. Gazen. Interactive Data Integration through Smart Copy & Paste. In CIDR, 2009.Google Scholar
- N. Kushmerick, D. Weld, and R. Doorenbos. Wrapper Induction for Information Extraction. In IJCAI, pp. 729--737, 1997.Google Scholar
- G. Limaye, S. Sarawagi, and S. Chakrabarti. Annotating and Searching Web Tables Using Entities, Types and Relationships. In VLDB, pp. 1338--1347, 2010. Google ScholarDigital Library
- D. Lin and X. Wu. Phrase Clustering for Discriminative Learning. In ACL-IJCNLP, pp. 1030--1038, 2009. Google ScholarDigital Library
- T. M. Mitchell. Machine Learning. McGraw-Hill, New York, 1997. Google ScholarDigital Library
- M. Paşca. The Role of Queries in Ranking Labeled Instances Extracted from Text. In COLING, pp. 955--962, 2010. Google ScholarDigital Library
- M. Paşca and B. Van Durme. Weakly-Supervised Acquisition of Open-Domain Classes and Class Attributes from Web Documents and Query Logs. In ACL, pp. 19--27, 2008.Google Scholar
- P. Pantel and M. Pennacchiotti. Espresso: Leveraging Generic Patterns for Automatically Harvesting Semantic Relations. In COLING-ACL, pp. 113--120, 2006. Google ScholarDigital Library
- M. Pennacchiotti and P. Pantel. Entity Extraction via Ensemble Semantics. In EMNLP, pp. 238--247, 2009. Google ScholarDigital Library
- S. Ponzetto and R. Navigli. Large-Scale Taxonomy Mapping for Restructuring and Integrating Wikipedia. In IJCAI, pp. 2083--2088, 2009. Google ScholarDigital Library
- R. Snow, D. Jurafsky, and A. Ng. Semantic Taxonomy Induction from Heterogenous Evidence. In COLING-ACL, pp. 801--808, 2006. Google ScholarDigital Library
- F. Suchanek, G. Kasneci, and G. Weikum. YAGO: a Core of Semantic Knowledge Unifying WordNet and Wikipedia. In WWW, pp. 697--706, 2007. Google ScholarDigital Library
- P. Talukdar, J. Reisinger, M. Paşca, D. Ravichandran, R. Bhagat, and F. Pereira. Weakly-Supervised Acquisition of Labeled Class Instances using Graph Random Walks. In EMNLP, pp. 582--590, 2008. Google ScholarDigital Library
- P. Treeratpituk and J. Callan. Automatically Labeling Hierarchical Clusters. In DGO, pp. 167--176, 2006. Google ScholarDigital Library
- R. Wang and W. Cohen. Iterative Set Expansion of Named Entities Using the Web. In ICDM, pp. 1091--1096, 2008. Google ScholarDigital Library
- R. Wang and W. Cohen. Automatic Set Instance Extraction using the Web. In ACL-IJCNLP, pp. 441--449, 2009. Google ScholarDigital Library
- F. Wu and D. Weld. Automatically Refining the Wikipedia Infobox Ontology. In WWW, pp. 635--644, 2008. Google ScholarDigital Library
Index Terms
- Recovering semantics of tables on the web
Recommendations
Annotating and searching web tables using entities, types and relationships
Tables are a universal idiom to present relational data. Billions of tables on Web pages express entity references, attributes and relationships. This representation of relational world knowledge is usually considerably better than completely ...
Web personal name disambiguation based on reference entity tables mined from the web
WIDM '09: Proceedings of the eleventh international workshop on Web information and data managementAmbiguous personal names are common on the Web, which pose a challenge for many different tasks. The traditional disambiguation employs the clustering methods. However, without reference entity tables, the clustering method can only identify whether two ...
Coreference semantics from web features
ACL '12: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1To address semantic ambiguities in coreference resolution, we use Web n-gram features that capture a range of world knowledge in a diffuse but robust way. Specifically, we exploit short-distance cues to hypernymy, semantic compatibility, and semantic ...
Comments