Abstract
Schema matching is a central challenge for data integration systems. Automated tools are often uncertain about schema matchings they suggest, and this uncertainty is inherent since it arises from the inability of the schema to fully capture the semantics of the represented data. Human common sense can often help. Inspired by the popularity and the success of easily accessible crowdsourcing platforms, we explore the use of crowdsourcing to reduce the uncertainty of schema matching.
Since it is typical to ask simple questions on crowdsourcing platforms, we assume that each question, namely Correspondence Correctness Question (CCQ), is to ask the crowd to decide whether a given correspondence should exist in the correct matching. We propose frameworks and efficient algorithms to dynamically manage the CCQs, in order to maximize the uncertainty reduction within a limited budget of questions. We develop two novel approaches, namely "Single CCQ" and "Multiple CCQ", which adaptively select, publish and manage the questions. We verified the value of our solutions with simulation and real implementation.
- L. Detwiler, W. Gatterbauer, B. Louie, D. Suciu, and P. Tarczy-Hornoch. Integrating and ranking uncertain scientific data. In ICDE, pages 1235-1238, 2009. Google Scholar
- A. Doan, R. Ramakrishnan, and A. Y. Halevy. Crowdsourcing systems on the world-wide web. Commun. ACM, 54(4):86-96, 2011. Google Scholar
- X. L. Dong, A. Y. Halevy, and C. Yu. Data integration with uncertainty. VLDB J., 18(2):469-500, 2009. Google Scholar
- M. J. Franklin, D. Kossmann, T. Kraska, S. Ramesh, and R. Xin. Crowddb: answering queries with crowdsourcing. In SIGMOD Conference, pages 61-72, 2011. Google Scholar
- A. Gal. Managing uncertainty in schema matching with top-k schema mappings. J. Data Semantics VI, pages 90-114, 2006. Google Scholar
- A. Gal. Uncertain Schema Matching. Synthesis Lectures on Data Management. Morgan & Claypool Publishers, 2011.Google Scholar
- A. Gal, A. Anaby-Tavor, A. Trombetta, and D. Montesi. A framework for modeling and evaluating automatic semantic reconciliation. VLDB J., 14(1):50-67, 2005. Google Scholar
- A. Gal, M. V. Martinez, G. I. Simari, and V. S. Subrahmanian. Aggregate query answering under uncertain schema mappings. In ICDE, pages 940-951, 2009. Google Scholar
- J. Huang, L. Antova, C. Koch, and D. Olteanu. Maybms: a probabilistic database management system. In SIGMOD Conference, 2009. Google Scholar
- S. Khuller, A. Moss, and J. Naor. The budgeted maximum coverage problem. Inf. Process. Lett., 70(1):39-45, 1999. Google Scholar
- A. Krause and C. Guestrin. A note on the budgeted maximization on submodular functions. (CMU-CALD-05-103), 2005.Google Scholar
- P. Lemay. The Statistical Analysis of Dynamics and Complexity in Psychology: A Configural Approach. Université de Lausanne, Faculté des sciences sociales et politiques, 1999.Google Scholar
- R. McCann, W. Shen, and A. Doan. Matching schemas in online communities: A web 2.0 approach. In ICDE [13], pages 110-119. Google Scholar
- R. J. Miller, L. M. Haas, and M. A. Hernández. Schema mapping as query discovery. In VLDB, pages 77-88, 2000. Google Scholar
- B. Mozafari, P. Sarkar, M. J. Franklin, M. I. Jordan, and S. Madden. Active learning for crowd-sourced databases. CoRR, abs/1209.3686, 2012.Google Scholar
- A. G. Parameswaran and N. Polyzotis. Answering queries using humans, algorithms and databases. In CIDR, pages 160-166, 2011.Google Scholar
- A. G. Parameswaran, A. D. Sarma, H. Garcia-Molina, N. Polyzotis, and J. Widom. Human-assisted graph search: it's okay to ask questions. PVLDB, 4(5):267-278, 2011. Google Scholar
- L. Popa, Y. Velegrakis, R. J. Miller, M. A. Hernández, and R. Fagin. Translating web data. In VLDB, pages 598-609, 2002. Google Scholar
- Y. Qi, K. S. Candan, and M. L. Sapino. Ficsr: feedback-based inconsistency resolution and query processing on misaligned data sources. In SIGMOD Conference, pages 151-162, 2007. Google Scholar
- E. Rahm and P. A. Bernstein. A survey of approaches to automatic schema matching. VLDB J., 10(4):334-350, 2001. Google Scholar
- A. D. Sarma, X. Dong, and A. Y. Halevy. Bootstrapping pay-as-you-go data integration systems. In SIGMOD Conference, pages 861-874, 2008. Google Scholar
- B. Settles. Active Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool Publishers, 2012.Google Scholar
- Y. Tong, L. Chen, Y. Cheng, and P. S. Yu. Mining frequent itemsets over uncertain databases. PVLDB, 5(11):1650-1661, 2012. Google Scholar
- Y. Tong, L. Chen, and B. Ding. Discovering threshold-based frequent closed itemsets over probabilistic data. In ICDE, pages 270-281, 2012. Google Scholar
- J. Wang, T. Kraska, M. J. Franklin, and J. Feng. Crowder: Crowdsourcing entity resolution. PVLDB, 5(11):1483-1494, 2012. Google Scholar
- L. Zhao, G. Sukthankar, and R. Sukthankar. Robust active learning using crowdsourced annotations for activity recognition. In Human Computation, 2011.Google Scholar
Index Terms
- Reducing uncertainty of schema matching via crowdsourcing
Recommendations
A schema matching-based approach to XML schema clustering
iiWAS '08: Proceedings of the 10th International Conference on Information Integration and Web-based Applications & ServicesThe relationship between XML data clustering and schema matching is bidirectional. On one side, clustering techniques have been adopted to improve matching performance, and on the other side schema matching is the backbone of the clustering technique. ...
Efficient management of uncertainty in XML schema matching
Despite advances in machine learning technologies a schema matching result between two database schemas (e.g., those derived from COMA++) is likely to be imprecise. In particular, numerous instances of "possible mappings" between the schemas may be ...
Comments