Abstract
Join is a powerful operator that combines records from two or more tables, which is of fundamental importance in the field of relational database. However, traditional join processing mostly relies on string equality comparisons. Given the growing demand for ad-hoc data analysis, we have seen an increasing number of scenarios where the desired join relationship is not equi-join. For example, in a spreadsheet environment, a user may want to join one table with a subject column country-name, with another table with a subject column country-code. Traditional equi-join cannot handle such joins automatically, and the user typically has to manually find an intermediate mapping table in order to perform the desired join.
We develop a SEMA-JOIN approach that is a first step toward allowing users to perform semantic join automatically, with a click of the button. Our main idea is to utilize a data-driven method that leverages a big table corpus with over 100 million tables to determine statistical correlation between cell values at both row-level and column-level. We use the intuition that the correct join mapping is the one that maximizes aggregate pairwise correlation, to formulate the join prediction problem as an optimization problem. We develop a linear program relaxation and a rounding argument to obtain a 2-approximation algorithm in polynomial time. Our evaluation using both public tables from the Web and proprietary Enterprise tables from a large company shows that the proposed approach can perform automatic semantic joins with high precision for a variety of common join scenarios.
- Fips country codes. http://en.wikipedia.org/wiki/List_of_FIPS_country_codes.Google Scholar
- Google Web Tables. http://research.google.com/tables.Google Scholar
- Iso country codes. http://en.wikipedia.org/wiki/ISO_3166-1.Google Scholar
- Microsoft Excel Power Query. http://office.microsoft.com/powerbi.Google Scholar
- Microsoft Solver Foundation. http://msdn.microsoft.com/en-us/library/ff524509.aspx.Google Scholar
- C. Bizer. Search joins with the web. In Proceedings of International Conference on Database Theory (ICDT), 2014.Google Scholar
- K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of SIGMOD, 2008. Google Scholar
- M. J. Cafarella, A. Y. Halevy, and N. Khoussainova. Data integration for the relational web. In VLDB, 2009. Google Scholar
- S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In Proceedings of ICDE, 2006. Google Scholar
- A. Chuklin, P. Serdyukov, and M. de Rijke. Modeling clicks beyond the first result page. In Proceedings of CIKM, 2013. Google Scholar
- K. W. Church and P. Hanks. Word association norms, mutual information, and lexicography. In Computational Linguistics, 1990. Google Scholar
- W. W. Cohen. Data integration using similarity joins and a word-based information representation language. In Transactions on Information Systems, 2000. Google Scholar
- D. J. DeWitt, R. H. Katz, F. Olken, L. D. Shapiro, M. Stonebraker, and D. A. Wood. Implementation techniques for main memory database systems. In Proceedings of SIGMOD, 1984. Google Scholar
- A. Elmagarmid, P. G. Ipeirotis, and V. Verykios. Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering, 2007. Google Scholar
- U. Feige and M. Seltser. On the densest k-subgraph problem. In Algorithmica, 1997.Google Scholar
- O. Hassanzadeh, K. Q. Pu, S. H. Yeganeh, R. J. Miller, L. Popa, M. A. Hernández, and H. Ho. Discovering linkage points over web data. In Proceedings of VLDB, 2013. Google Scholar
- P. Pardalos and S. Vavasis. Quadratic programming with one negative eigenvalue is np-hard. Journal of Global Optimization, 1991.Google Scholar
- P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, I. A. Lorie, and T. G. Price. Access path selection in a relational database management system. In proceedings of SIGMOD, 1979. Google Scholar
- T. Simpson. Evaluating google as an epistemic tool. In Metaphilosophy, 2012.Google Scholar
- R. H. Warren and F. W. Tompa. Multi-column substring matching for database schema translation. In Proceedings of VLDB, 2006. Google Scholar
- C. Xiao, W. Wang, X. Lin, and J. Yu. Efficient similarity joins for near duplicate detection. In Proceedings of WWW, 2008. Google Scholar
- M. Yakout, K. Ganjam, K. Chakrabarti, and S. Chaudhuri. Infogather: entity augmentation and attribute discovery by holistic matching with web tables. In Proceedings of SIGMOD, 2012. Google Scholar
Index Terms
- SEMA-JOIN: joining semantically-related tables using big table corpora
Recommendations
Multi-way spatial join selectivity for the ring join graph
Efficient spatial query processing is very important since the applications of the spatial DBMS (e.g. GIS, CAD/CAM, LBS) handle massive amount of data and consume much time. Many spatial queries contain the multi-way spatial join due to the fact that ...
Distributed stream join query processing with semijoins
This paper addresses the distributed stream processing of window-based multi-way join queries considering the semijoin as a key join operator. In distributed stream processing, data streams arriving at remote sites need to be shipped to the processing ...
Comments