research-article

SEMA-JOIN: joining semantically-related tables using big table corpora

Authors:
Yeye He

Microsoft Research, Redmond, WA

Microsoft Research, Redmond, WA
View Profile

,
Kris Ganjam

Microsoft Research, Redmond, WA

Microsoft Research, Redmond, WA
View Profile

,
Xu Chu

University of Waterloo, Waterloo, ON, Canada

University of Waterloo, Waterloo, ON, Canada
View Profile

Proceedings of the VLDB Endowment Volume 8 Issue 12pp 1358–1369https://doi.org/10.14778/2824032.2824036

Published:01 August 2015Publication History

Proceedings of the VLDB Endowment

Abstract

Join is a powerful operator that combines records from two or more tables, which is of fundamental importance in the field of relational database. However, traditional join processing mostly relies on string equality comparisons. Given the growing demand for ad-hoc data analysis, we have seen an increasing number of scenarios where the desired join relationship is not equi-join. For example, in a spreadsheet environment, a user may want to join one table with a subject column country-name, with another table with a subject column country-code. Traditional equi-join cannot handle such joins automatically, and the user typically has to manually find an intermediate mapping table in order to perform the desired join.

We develop a SEMA-JOIN approach that is a first step toward allowing users to perform semantic join automatically, with a click of the button. Our main idea is to utilize a data-driven method that leverages a big table corpus with over 100 million tables to determine statistical correlation between cell values at both row-level and column-level. We use the intuition that the correct join mapping is the one that maximizes aggregate pairwise correlation, to formulate the join prediction problem as an optimization problem. We develop a linear program relaxation and a rounding argument to obtain a 2-approximation algorithm in polynomial time. Our evaluation using both public tables from the Web and proprietary Enterprise tables from a large company shows that the proposed approach can perform automatic semantic joins with high precision for a variety of common join scenarios.

References

Fips country codes. http://en.wikipedia.org/wiki/List_of_FIPS_country_codes.Google Scholar
Google Web Tables. http://research.google.com/tables.Google Scholar
Iso country codes. http://en.wikipedia.org/wiki/ISO_3166-1.Google Scholar
Microsoft Excel Power Query. http://office.microsoft.com/powerbi.Google Scholar
Microsoft Solver Foundation. http://msdn.microsoft.com/en-us/library/ff524509.aspx.Google Scholar
C. Bizer. Search joins with the web. In Proceedings of International Conference on Database Theory (ICDT), 2014.Google Scholar
K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of SIGMOD, 2008. Google Scholar
M. J. Cafarella, A. Y. Halevy, and N. Khoussainova. Data integration for the relational web. In VLDB, 2009. Google Scholar
S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In Proceedings of ICDE, 2006. Google Scholar
A. Chuklin, P. Serdyukov, and M. de Rijke. Modeling clicks beyond the first result page. In Proceedings of CIKM, 2013. Google Scholar
K. W. Church and P. Hanks. Word association norms, mutual information, and lexicography. In Computational Linguistics, 1990. Google Scholar
W. W. Cohen. Data integration using similarity joins and a word-based information representation language. In Transactions on Information Systems, 2000. Google Scholar
D. J. DeWitt, R. H. Katz, F. Olken, L. D. Shapiro, M. Stonebraker, and D. A. Wood. Implementation techniques for main memory database systems. In Proceedings of SIGMOD, 1984. Google Scholar
A. Elmagarmid, P. G. Ipeirotis, and V. Verykios. Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering, 2007. Google Scholar
U. Feige and M. Seltser. On the densest k-subgraph problem. In Algorithmica, 1997.Google Scholar
O. Hassanzadeh, K. Q. Pu, S. H. Yeganeh, R. J. Miller, L. Popa, M. A. Hernández, and H. Ho. Discovering linkage points over web data. In Proceedings of VLDB, 2013. Google Scholar
P. Pardalos and S. Vavasis. Quadratic programming with one negative eigenvalue is np-hard. Journal of Global Optimization, 1991.Google Scholar
P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, I. A. Lorie, and T. G. Price. Access path selection in a relational database management system. In proceedings of SIGMOD, 1979. Google Scholar
T. Simpson. Evaluating google as an epistemic tool. In Metaphilosophy, 2012.Google Scholar
R. H. Warren and F. W. Tompa. Multi-column substring matching for database schema translation. In Proceedings of VLDB, 2006. Google Scholar
C. Xiao, W. Wang, X. Lin, and J. Yu. Efficient similarity joins for near duplicate detection. In Proceedings of WWW, 2008. Google Scholar
M. Yakout, K. Ganjam, K. Chakrabarti, and S. Chaudhuri. Infogather: entity augmentation and attribute discovery by holistic matching with web tables. In Proceedings of SIGMOD, 2012. Google Scholar

Index Terms

SEMA-JOIN: joining semantically-related tables using big table corpora
1. Information systems
  1. Information retrieval

Index terms have been assigned to the content through auto-classification.

Recommendations

Multi-way spatial join selectivity for the ring join graph

Efficient spatial query processing is very important since the applications of the spatial DBMS (e.g. GIS, CAD/CAM, LBS) handle massive amount of data and consume much time. Many spatial queries contain the multi-way spatial join due to the fact that ...
Read More
Processing multi-join queries
Read More
Distributed stream join query processing with semijoins

This paper addresses the distributed stream processing of window-based multi-way join queries considering the semijoin as a key join operator. In distributed stream processing, data streams arriving at remote sites need to be shipped to the processing ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

Proceedings of the VLDB Endowment Volume 8, Issue 12
Proceedings of the 41st International Conference on Very Large Data Bases, Kohala Coast, Hawaii
August 2015
728 pages
ISSN:2150-8097
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 August 2015
Published in pvldb Volume 8, Issue 12
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 16
  Total Citations
  View Citations
- 224
  Total Downloads
- Downloads (Last 12 months)19
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

SEMA-JOIN: joining semantically-related tables using big table corpora

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Multi-way spatial join selectivity for the ring join graph

Processing multi-join queries

Distributed stream join query processing with semijoins

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

SEMA-JOIN: joining semantically-related tables using big table corpora

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Multi-way spatial join selectivity for the ring join graph

Processing multi-join queries

Distributed stream join query processing with semijoins

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media