research-article

InfoGather: entity augmentation and attribute discovery by holistic matching with web tables

Authors:
Mohamed Yakout

Purdue University, West Lafayette, IN, USA

Purdue University, West Lafayette, IN, USA
View Profile

,
Kris Ganjam

Microsoft Research, Redmond, WA, USA

Microsoft Research, Redmond, WA, USA
View Profile

,
Kaushik Chakrabarti

Microsoft Research, Redmond, WA, USA

Microsoft Research, Redmond, WA, USA
View Profile

,
Surajit Chaudhuri

Microsoft Research, Redmond, WA, USA

Microsoft Research, Redmond, WA, USA
View Profile

SIGMOD '12: Proceedings of the 2012 ACM SIGMOD International Conference on Management of DataMay 2012Pages 97–108https://doi.org/10.1145/2213836.2213848

Published:20 May 2012Publication History

SIGMOD '12: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data

Pages 97–108

ABSTRACT

The Web contains a vast corpus of HTML tables, specifically entity attribute tables. We present three core operations, namely entity augmentation by attribute name, entity augmentation by example and attribute discovery, that are useful for "information gathering" tasks (e.g., researching for products or stocks). We propose to use web table corpus to perform them automatically. We require the operations to have high precision and coverage, have fast (ideally interactive) response times and be applicable to any arbitrary domain of entities. The naive approach that attempts to directly match the user input with the web tables suffers from poor precision and coverage.

Our key insight is that we can achieve much higher precision and coverage by considering indirectly matching tables in addition to the directly matching ones. The challenge is to be robust to spuriously matched tables: we address it by developing a holistic matching framework based on topic sensitive pagerank and an augmentation framework that aggregates predictions from multiple matched tables. We propose a novel architecture that leverages preprocessing in MapReduce to achieve extremely fast response times at query time. Our experiments on real-life datasets and 573M web tables show that our approach has (i) significantly higher precision and coverage and (ii) four orders of magnitude faster response times compared with the state-of-the-art approach.

References

B. Bahmani, K. Chakrabarti, and D. Xin. Fast personalized pagerank on mapreduce. In SIGMOD, 2011. Google ScholarDigital Library
Z. Bellahsene, A. Bonifati, and E. Rahm. Schema Matching and Mapping. Springer, 2011. Google ScholarDigital Library
P. A. Bernstein, J. Madhavan, and E. Rahm. Generic schema matching, ten years later. In VLDB, pages 695--701, 2011.Google Scholar
M. J. Cafarella, A. Y. Halevy, and N. Khoussainova. Data integration for the relational web. PVLDB, 2009. Google ScholarDigital Library
M. J. Cafarella, A. Y. Halevy, D. Z. Wang, E. Wu, and Y. Zhang. Webtables: exploring the power of tables on the web. PVLDB, 2008. Google ScholarDigital Library
M. J. Cafarella, A. Y. Halevy, Y. Zhang, D. Z. Wang, and E. Wu. Uncovering the relational web. In WebDB, 2008.Google Scholar
S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani. Robust and efficient fuzzy match for online data cleaning. In SIGMOD, 2003. Google ScholarDigital Library
A. Doan, P. Domingos, and A. Y. Halevy. Reconciling schemas of disparate data sources: a machine-learning approach. In ACM SIGMOD, pages 509--520, 2001. Google ScholarDigital Library
T. Elsayed, J. J. Lin, and D. W. Oard. Pairwise document similarity in large collections with mapreduce. In ACL, 2008. Google ScholarDigital Library
R. Gupta and S. Sarawagi. Answering table augmentation queries from unstructured lists on the web. Proc. VLDB Endow., pages 289--300, 2009. Google ScholarDigital Library
T. H. Haveliwala. Topic-sensitive pagerank. In WWW, 2002. Google ScholarDigital Library
B. He and K. C.-C. Chang. Statistical schema matching across web query interfaces. In SIGMOD, 2003. Google ScholarDigital Library
Y. He and D. Xin. Seisa: set expansion by iterative similarity aggregation. In WWW, pages 427--436, 2011. Google ScholarDigital Library
G. Limaye, S. Sarawagi, and S. Chakrabarti. Annotating and searching web tables using entities, types and relationships. Proc. VLDB Endow., pages 1338--1347, 2010. Google ScholarDigital Library
J. Madhavan, P. A. Bernstein, A. Doan, and A. Halevy. Corpus-based schema matching. In ICDE, 2005. Google ScholarDigital Library
J. Madhavan, P. A. Bernstein, and E. Rahm. Generic schema matching with cupid. In VLDB, 2001. Google ScholarDigital Library
L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Technical Report 1999--66, Stanford InfoLab, 1999.Google Scholar
E. Rahm and P. A. Bernstein. A survey of approaches to automatic schema matching. The VLDB Journal, pages 334--350, 2001. Google ScholarDigital Library
P. Venetis et al. Recovering semantics of tables on the web. Proc. VLDB Endow., pages 528--538, 2011. Google ScholarDigital Library
X. Yin, W. Tan, and C. Liu.Google Scholar

Index Terms

InfoGather: entity augmentation and attribute discovery by holistic matching with web tables
1. Human-centered computing
  1. Collaborative and social computing
    1. Collaborative and social computing systems and tools
2. Information systems
  1. World Wide Web

Recommendations

InfoGather+: semantic matching and annotation of numeric and time-varying attributes in web tables
SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data

Users often need to gather information about "entities" of interest. Recent efforts try to automate this task by leveraging the vast corpus of HTML tables; this is referred to as "entity augmentation". The accuracy of entity augmentation critically ...
Read More
Intelligent crawling of web applications for web archiving
WWW '12 Companion: Proceedings of the 21st International Conference on World Wide Web

The steady growth of the World Wide Web raises challenges regarding the preservation of meaningful Web data. Tools used currently by Web archivists blindly crawl and store Web pages found while crawling, disregarding the kind of Web site currently ...
Read More
Ranking Pages by Topology and Popularity within Web Sites

We compare two link analysis ranking methods of web pages in a site. The first, called Site Rank , is an adaptation of PageRank to the granularity of a web site and the second, called Popularity Rank , is based on the frequencies of user clicks on the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '12: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
May 2012
886 pages
ISBN:9781450312479
DOI:10.1145/2213836
General Chairs:
K. Selçuk Candan
Arizona State University
,
Yi Chen
Arizona State University
,
Richard Snodgrass
University of Arizona
,
Program Chair:
Luis Gravano
Columbia University
,
Publications Chair:
Ariel Fuxman
Microsoft Research
Copyright © 2012 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 20 May 2012
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
augmentation
data integration
page rank
web application
web tables
Qualifiers
- research-article
Conference

Acceptance Rates
SIGMOD '12 Paper Acceptance Rate48of289submissions,17%Overall Acceptance Rate785of4,003submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 165
  Total Citations
  View Citations
- 1,362
  Total Downloads
- Downloads (Last 12 months)77
- Downloads (Last 6 weeks)15
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

InfoGather: entity augmentation and attribute discovery by holistic matching with web tables

SIGMOD '12: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data

ABSTRACT

References

Cited By

Index Terms

Recommendations

InfoGather+: semantic matching and annotation of numeric and time-varying attributes in web tables

Intelligent crawling of web applications for web archiving

Ranking Pages by Topology and Popularity within Web Sites

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

InfoGather: entity augmentation and attribute discovery by holistic matching with web tables

SIGMOD '12: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data

ABSTRACT

References

Cited By

Index Terms

Recommendations

InfoGather+: semantic matching and annotation of numeric and time-varying attributes in web tables

Intelligent crawling of web applications for web archiving

Ranking Pages by Topology and Popularity within Web Sites

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media