skip to main content
research-article

SEMA-JOIN: joining semantically-related tables using big table corpora

Published:01 August 2015Publication History
Skip Abstract Section

Abstract

Join is a powerful operator that combines records from two or more tables, which is of fundamental importance in the field of relational database. However, traditional join processing mostly relies on string equality comparisons. Given the growing demand for ad-hoc data analysis, we have seen an increasing number of scenarios where the desired join relationship is not equi-join. For example, in a spreadsheet environment, a user may want to join one table with a subject column country-name, with another table with a subject column country-code. Traditional equi-join cannot handle such joins automatically, and the user typically has to manually find an intermediate mapping table in order to perform the desired join.

We develop a SEMA-JOIN approach that is a first step toward allowing users to perform semantic join automatically, with a click of the button. Our main idea is to utilize a data-driven method that leverages a big table corpus with over 100 million tables to determine statistical correlation between cell values at both row-level and column-level. We use the intuition that the correct join mapping is the one that maximizes aggregate pairwise correlation, to formulate the join prediction problem as an optimization problem. We develop a linear program relaxation and a rounding argument to obtain a 2-approximation algorithm in polynomial time. Our evaluation using both public tables from the Web and proprietary Enterprise tables from a large company shows that the proposed approach can perform automatic semantic joins with high precision for a variety of common join scenarios.

References

  1. Fips country codes. http://en.wikipedia.org/wiki/List_of_FIPS_country_codes.Google ScholarGoogle Scholar
  2. Google Web Tables. http://research.google.com/tables.Google ScholarGoogle Scholar
  3. Iso country codes. http://en.wikipedia.org/wiki/ISO_3166-1.Google ScholarGoogle Scholar
  4. Microsoft Excel Power Query. http://office.microsoft.com/powerbi.Google ScholarGoogle Scholar
  5. Microsoft Solver Foundation. http://msdn.microsoft.com/en-us/library/ff524509.aspx.Google ScholarGoogle Scholar
  6. C. Bizer. Search joins with the web. In Proceedings of International Conference on Database Theory (ICDT), 2014.Google ScholarGoogle Scholar
  7. K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of SIGMOD, 2008. Google ScholarGoogle Scholar
  8. M. J. Cafarella, A. Y. Halevy, and N. Khoussainova. Data integration for the relational web. In VLDB, 2009. Google ScholarGoogle Scholar
  9. S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In Proceedings of ICDE, 2006. Google ScholarGoogle Scholar
  10. A. Chuklin, P. Serdyukov, and M. de Rijke. Modeling clicks beyond the first result page. In Proceedings of CIKM, 2013. Google ScholarGoogle Scholar
  11. K. W. Church and P. Hanks. Word association norms, mutual information, and lexicography. In Computational Linguistics, 1990. Google ScholarGoogle Scholar
  12. W. W. Cohen. Data integration using similarity joins and a word-based information representation language. In Transactions on Information Systems, 2000. Google ScholarGoogle Scholar
  13. D. J. DeWitt, R. H. Katz, F. Olken, L. D. Shapiro, M. Stonebraker, and D. A. Wood. Implementation techniques for main memory database systems. In Proceedings of SIGMOD, 1984. Google ScholarGoogle Scholar
  14. A. Elmagarmid, P. G. Ipeirotis, and V. Verykios. Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering, 2007. Google ScholarGoogle Scholar
  15. U. Feige and M. Seltser. On the densest k-subgraph problem. In Algorithmica, 1997.Google ScholarGoogle Scholar
  16. O. Hassanzadeh, K. Q. Pu, S. H. Yeganeh, R. J. Miller, L. Popa, M. A. Hernández, and H. Ho. Discovering linkage points over web data. In Proceedings of VLDB, 2013. Google ScholarGoogle Scholar
  17. P. Pardalos and S. Vavasis. Quadratic programming with one negative eigenvalue is np-hard. Journal of Global Optimization, 1991.Google ScholarGoogle Scholar
  18. P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, I. A. Lorie, and T. G. Price. Access path selection in a relational database management system. In proceedings of SIGMOD, 1979. Google ScholarGoogle Scholar
  19. T. Simpson. Evaluating google as an epistemic tool. In Metaphilosophy, 2012.Google ScholarGoogle Scholar
  20. R. H. Warren and F. W. Tompa. Multi-column substring matching for database schema translation. In Proceedings of VLDB, 2006. Google ScholarGoogle Scholar
  21. C. Xiao, W. Wang, X. Lin, and J. Yu. Efficient similarity joins for near duplicate detection. In Proceedings of WWW, 2008. Google ScholarGoogle Scholar
  22. M. Yakout, K. Ganjam, K. Chakrabarti, and S. Chaudhuri. Infogather: entity augmentation and attribute discovery by holistic matching with web tables. In Proceedings of SIGMOD, 2012. Google ScholarGoogle Scholar

Index Terms

  1. SEMA-JOIN: joining semantically-related tables using big table corpora
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image Proceedings of the VLDB Endowment
      Proceedings of the VLDB Endowment  Volume 8, Issue 12
      Proceedings of the 41st International Conference on Very Large Data Bases, Kohala Coast, Hawaii
      August 2015
      728 pages

      Publisher

      VLDB Endowment

      Publication History

      • Published: 1 August 2015
      Published in pvldb Volume 8, Issue 12

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader