ABSTRACT
The Web contains a vast corpus of HTML tables, specifically entity attribute tables. We present three core operations, namely entity augmentation by attribute name, entity augmentation by example and attribute discovery, that are useful for "information gathering" tasks (e.g., researching for products or stocks). We propose to use web table corpus to perform them automatically. We require the operations to have high precision and coverage, have fast (ideally interactive) response times and be applicable to any arbitrary domain of entities. The naive approach that attempts to directly match the user input with the web tables suffers from poor precision and coverage.
Our key insight is that we can achieve much higher precision and coverage by considering indirectly matching tables in addition to the directly matching ones. The challenge is to be robust to spuriously matched tables: we address it by developing a holistic matching framework based on topic sensitive pagerank and an augmentation framework that aggregates predictions from multiple matched tables. We propose a novel architecture that leverages preprocessing in MapReduce to achieve extremely fast response times at query time. Our experiments on real-life datasets and 573M web tables show that our approach has (i) significantly higher precision and coverage and (ii) four orders of magnitude faster response times compared with the state-of-the-art approach.
- B. Bahmani, K. Chakrabarti, and D. Xin. Fast personalized pagerank on mapreduce. In SIGMOD, 2011. Google ScholarDigital Library
- Z. Bellahsene, A. Bonifati, and E. Rahm. Schema Matching and Mapping. Springer, 2011. Google ScholarDigital Library
- P. A. Bernstein, J. Madhavan, and E. Rahm. Generic schema matching, ten years later. In VLDB, pages 695--701, 2011.Google Scholar
- M. J. Cafarella, A. Y. Halevy, and N. Khoussainova. Data integration for the relational web. PVLDB, 2009. Google ScholarDigital Library
- M. J. Cafarella, A. Y. Halevy, D. Z. Wang, E. Wu, and Y. Zhang. Webtables: exploring the power of tables on the web. PVLDB, 2008. Google ScholarDigital Library
- M. J. Cafarella, A. Y. Halevy, Y. Zhang, D. Z. Wang, and E. Wu. Uncovering the relational web. In WebDB, 2008.Google Scholar
- S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani. Robust and efficient fuzzy match for online data cleaning. In SIGMOD, 2003. Google ScholarDigital Library
- A. Doan, P. Domingos, and A. Y. Halevy. Reconciling schemas of disparate data sources: a machine-learning approach. In ACM SIGMOD, pages 509--520, 2001. Google ScholarDigital Library
- T. Elsayed, J. J. Lin, and D. W. Oard. Pairwise document similarity in large collections with mapreduce. In ACL, 2008. Google ScholarDigital Library
- R. Gupta and S. Sarawagi. Answering table augmentation queries from unstructured lists on the web. Proc. VLDB Endow., pages 289--300, 2009. Google ScholarDigital Library
- T. H. Haveliwala. Topic-sensitive pagerank. In WWW, 2002. Google ScholarDigital Library
- B. He and K. C.-C. Chang. Statistical schema matching across web query interfaces. In SIGMOD, 2003. Google ScholarDigital Library
- Y. He and D. Xin. Seisa: set expansion by iterative similarity aggregation. In WWW, pages 427--436, 2011. Google ScholarDigital Library
- G. Limaye, S. Sarawagi, and S. Chakrabarti. Annotating and searching web tables using entities, types and relationships. Proc. VLDB Endow., pages 1338--1347, 2010. Google ScholarDigital Library
- J. Madhavan, P. A. Bernstein, A. Doan, and A. Halevy. Corpus-based schema matching. In ICDE, 2005. Google ScholarDigital Library
- J. Madhavan, P. A. Bernstein, and E. Rahm. Generic schema matching with cupid. In VLDB, 2001. Google ScholarDigital Library
- L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Technical Report 1999--66, Stanford InfoLab, 1999.Google Scholar
- E. Rahm and P. A. Bernstein. A survey of approaches to automatic schema matching. The VLDB Journal, pages 334--350, 2001. Google ScholarDigital Library
- P. Venetis et al. Recovering semantics of tables on the web. Proc. VLDB Endow., pages 528--538, 2011. Google ScholarDigital Library
- X. Yin, W. Tan, and C. Liu.Google Scholar
Index Terms
- InfoGather: entity augmentation and attribute discovery by holistic matching with web tables
Recommendations
InfoGather+: semantic matching and annotation of numeric and time-varying attributes in web tables
SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of DataUsers often need to gather information about "entities" of interest. Recent efforts try to automate this task by leveraging the vast corpus of HTML tables; this is referred to as "entity augmentation". The accuracy of entity augmentation critically ...
Intelligent crawling of web applications for web archiving
WWW '12 Companion: Proceedings of the 21st International Conference on World Wide WebThe steady growth of the World Wide Web raises challenges regarding the preservation of meaningful Web data. Tools used currently by Web archivists blindly crawl and store Web pages found while crawling, disregarding the kind of Web site currently ...
Ranking Pages by Topology and Popularity within Web Sites
We compare two link analysis ranking methods of web pages in a site. The first, called Site Rank , is an adaptation of PageRank to the granularity of a web site and the second, called Popularity Rank , is based on the frequencies of user clicks on the ...
Comments