skip to main content
10.1145/2213836.2213848acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

InfoGather: entity augmentation and attribute discovery by holistic matching with web tables

Published:20 May 2012Publication History

ABSTRACT

The Web contains a vast corpus of HTML tables, specifically entity attribute tables. We present three core operations, namely entity augmentation by attribute name, entity augmentation by example and attribute discovery, that are useful for "information gathering" tasks (e.g., researching for products or stocks). We propose to use web table corpus to perform them automatically. We require the operations to have high precision and coverage, have fast (ideally interactive) response times and be applicable to any arbitrary domain of entities. The naive approach that attempts to directly match the user input with the web tables suffers from poor precision and coverage.

Our key insight is that we can achieve much higher precision and coverage by considering indirectly matching tables in addition to the directly matching ones. The challenge is to be robust to spuriously matched tables: we address it by developing a holistic matching framework based on topic sensitive pagerank and an augmentation framework that aggregates predictions from multiple matched tables. We propose a novel architecture that leverages preprocessing in MapReduce to achieve extremely fast response times at query time. Our experiments on real-life datasets and 573M web tables show that our approach has (i) significantly higher precision and coverage and (ii) four orders of magnitude faster response times compared with the state-of-the-art approach.

References

  1. B. Bahmani, K. Chakrabarti, and D. Xin. Fast personalized pagerank on mapreduce. In SIGMOD, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Z. Bellahsene, A. Bonifati, and E. Rahm. Schema Matching and Mapping. Springer, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. P. A. Bernstein, J. Madhavan, and E. Rahm. Generic schema matching, ten years later. In VLDB, pages 695--701, 2011.Google ScholarGoogle Scholar
  4. M. J. Cafarella, A. Y. Halevy, and N. Khoussainova. Data integration for the relational web. PVLDB, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. M. J. Cafarella, A. Y. Halevy, D. Z. Wang, E. Wu, and Y. Zhang. Webtables: exploring the power of tables on the web. PVLDB, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. M. J. Cafarella, A. Y. Halevy, Y. Zhang, D. Z. Wang, and E. Wu. Uncovering the relational web. In WebDB, 2008.Google ScholarGoogle Scholar
  7. S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani. Robust and efficient fuzzy match for online data cleaning. In SIGMOD, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. A. Doan, P. Domingos, and A. Y. Halevy. Reconciling schemas of disparate data sources: a machine-learning approach. In ACM SIGMOD, pages 509--520, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. T. Elsayed, J. J. Lin, and D. W. Oard. Pairwise document similarity in large collections with mapreduce. In ACL, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. R. Gupta and S. Sarawagi. Answering table augmentation queries from unstructured lists on the web. Proc. VLDB Endow., pages 289--300, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. T. H. Haveliwala. Topic-sensitive pagerank. In WWW, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. B. He and K. C.-C. Chang. Statistical schema matching across web query interfaces. In SIGMOD, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Y. He and D. Xin. Seisa: set expansion by iterative similarity aggregation. In WWW, pages 427--436, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. G. Limaye, S. Sarawagi, and S. Chakrabarti. Annotating and searching web tables using entities, types and relationships. Proc. VLDB Endow., pages 1338--1347, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. J. Madhavan, P. A. Bernstein, A. Doan, and A. Halevy. Corpus-based schema matching. In ICDE, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. J. Madhavan, P. A. Bernstein, and E. Rahm. Generic schema matching with cupid. In VLDB, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Technical Report 1999--66, Stanford InfoLab, 1999.Google ScholarGoogle Scholar
  18. E. Rahm and P. A. Bernstein. A survey of approaches to automatic schema matching. The VLDB Journal, pages 334--350, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. P. Venetis et al. Recovering semantics of tables on the web. Proc. VLDB Endow., pages 528--538, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. X. Yin, W. Tan, and C. Liu.Google ScholarGoogle Scholar

Index Terms

  1. InfoGather: entity augmentation and attribute discovery by holistic matching with web tables

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        SIGMOD '12: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
        May 2012
        886 pages
        ISBN:9781450312479
        DOI:10.1145/2213836

        Copyright © 2012 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 20 May 2012

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        SIGMOD '12 Paper Acceptance Rate48of289submissions,17%Overall Acceptance Rate785of4,003submissions,20%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader