Skip to main content
Erschienen in: The VLDB Journal 5/2015

01.10.2015 | Special Issue Paper

Active learning in keyword search-based data integration

verfasst von: Zhepeng Yan, Nan Zheng, Zachary G. Ives, Partha Pratim Talukdar, Cong Yu

Erschienen in: The VLDB Journal | Ausgabe 5/2015

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

The problem of scaling up data integration, such that new sources can be quickly utilized as they are discovered, remains elusive: Global schemas for integrated data are difficult to develop and expand, and schema and record matching techniques are limited by the fact that data and metadata are often under-specified and must be disambiguated by data experts. One promising approach is to avoid using a global schema, and instead to develop keyword search-based data integration—where the system lazily discovers associations enabling it to join together matches to keywords, and return ranked results. The user is expected to understand the data domain and provide feedback about answers’ quality. The system generalizes such feedback to learn how to correctly integrate data. A major open challenge is that under this model, the user only sees and offers feedback on a few “top-\(k\)” results: This result set must be carefully selected to include answers of high relevance and answers that are highly informative when feedback is given on them. Existing systems merely focus on predicting relevance, by composing the scores of various schema and record matching algorithms. In this paper, we show how to predict the uncertainty associated with a query result’s score, as well as how informative feedback is on a given result. We build upon these foundations to develop an active learning approach to keyword search-based data integration, and we validate the effectiveness of our solution over real data from several very different domains.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Fußnoten
1
Consider, e.g., the situation where users put data into comments fields because there was no appropriate column in the schema.
 
2
By default tf-idf over the tuples in the data, although other metrics such as edit distance or \(n\)-grams could be used.
 
3
For simplicity, we describe the outcome as if each query produces one result, although the system actually iteratively enumerates top-scoring queries, even beyond \(k\) such queries, until it gets \(k\) answers.
 
Literatur
1.
Zurück zum Zitat Agrawal, S., Chaudhuri, S., Das, G.: DBXplorer: A system for keyword-based search over relational databases. In: ICDE (2002) Agrawal, S., Chaudhuri, S., Das, G.: DBXplorer: A system for keyword-based search over relational databases. In: ICDE (2002)
2.
Zurück zum Zitat Arasu, A., Götz, M., Kaushik, R.: On active learning of record matching packages. In: SIGMOD Conference, pp. 783–794 (2010) Arasu, A., Götz, M., Kaushik, R.: On active learning of record matching packages. In: SIGMOD Conference, pp. 783–794 (2010)
3.
Zurück zum Zitat Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.G.: DBpedia: A nucleus for a web of open data. In: ISWC/ASWC (2007) Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.G.: DBpedia: A nucleus for a web of open data. In: ISWC/ASWC (2007)
4.
Zurück zum Zitat Balmin, A., Hristidis, V., Papakonstantinou, Y.: ObjectRank: Authority-based keyword search in databases. In: VLDB (2004) Balmin, A., Hristidis, V., Papakonstantinou, Y.: ObjectRank: Authority-based keyword search in databases. In: VLDB (2004)
5.
Zurück zum Zitat Bergamaschi, S., Domnori, E., Guerra, F., Trillo Lado, R., Velegrakis, Y.: Keyword search over relational databases: a metadata approach. In: SIGMOD (2011) Bergamaschi, S., Domnori, E., Guerra, F., Trillo Lado, R., Velegrakis, Y.: Keyword search over relational databases: a metadata approach. In: SIGMOD (2011)
6.
Zurück zum Zitat Betteridge, J., Carlson, A., Hong, S.A., Jr., E.R.H., Law, E.L.M., Mitchell, T.M., Wang, S.H.: Toward never ending language learning. In: AAAI Spring Symposium: Learning by Reading and Learning to Read (2009) Betteridge, J., Carlson, A., Hong, S.A., Jr., E.R.H., Law, E.L.M., Mitchell, T.M., Wang, S.H.: Toward never ending language learning. In: AAAI Spring Symposium: Learning by Reading and Learning to Read (2009)
7.
Zurück zum Zitat Bhalotia, G., Hulgeri, A., Nakhe, C., Chakrabarti, S., Sudarshan, S.: Keyword searching and browsing in databases using BANKS. In: ICDE, pp. 431–440 (2002) Bhalotia, G., Hulgeri, A., Nakhe, C., Chakrabarti, S., Sudarshan, S.: Keyword searching and browsing in databases using BANKS. In: ICDE, pp. 431–440 (2002)
8.
Zurück zum Zitat Crammer, K., Dekel, O., Keshet, J., Shalev-Shwartz, S., Singer, Y.: Online passive–aggressive algorithms. J. Mach. Learn. Res. 7, 551–585 (2006)MATHMathSciNet Crammer, K., Dekel, O., Keshet, J., Shalev-Shwartz, S., Singer, Y.: Online passive–aggressive algorithms. J. Mach. Learn. Res. 7, 551–585 (2006)MATHMathSciNet
9.
Zurück zum Zitat Craswell, N., Zoeter, O., Taylor, M.J., Ramsey, B.: An experimental comparison of click position-bias models. In: WSDM, pp. 87–94 (2008) Craswell, N., Zoeter, O., Taylor, M.J., Ramsey, B.: An experimental comparison of click position-bias models. In: WSDM, pp. 87–94 (2008)
10.
Zurück zum Zitat Culotta, A., McCallum, A.: Reducing labeling effort for structured prediction tasks. In: AAAI, pp. 746–751 (2005) Culotta, A., McCallum, A.: Reducing labeling effort for structured prediction tasks. In: AAAI, pp. 746–751 (2005)
11.
Zurück zum Zitat Deng, T., Fan, W.: On the complexity of query result diversification. Proc. VLDB Endow. 6(8), 557–588 (2013) Deng, T., Fan, W.: On the complexity of query result diversification. Proc. VLDB Endow. 6(8), 557–588 (2013)
12.
Zurück zum Zitat Do, H.H., Rahm, E.: Matching large schemas: Aroaches and evaluatio. Inf. Syst. 32(6), 857–885 (2007) Do, H.H., Rahm, E.: Matching large schemas: Aroaches and evaluatio. Inf. Syst. 32(6), 857–885 (2007)
13.
Zurück zum Zitat Doan, A., Domingos, P., Halevy, A.Y.: Reconciling schemas of disparate data sources: a machine-learning approach. In: SIGMOD (2001) Doan, A., Domingos, P., Halevy, A.Y.: Reconciling schemas of disparate data sources: a machine-learning approach. In: SIGMOD (2001)
14.
Zurück zum Zitat Drosou, M., Pitoura, E.: Search result diversification. SIGMOD Rec. 39(1), 41–47 (2010) Drosou, M., Pitoura, E.: Search result diversification. SIGMOD Rec. 39(1), 41–47 (2010)
15.
Zurück zum Zitat Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: A survey. IEEE TKDE 19(1), 1–16 (2007) Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: A survey. IEEE TKDE 19(1), 1–16 (2007)
16.
Zurück zum Zitat Fagin, R., Lotem, A., Naor, M.: Optimal aggregation algorithms for middleware. J. Comput. Syst. Sci. 66(4), 614–656 (2003) Fagin, R., Lotem, A., Naor, M.: Optimal aggregation algorithms for middleware. J. Comput. Syst. Sci. 66(4), 614–656 (2003)
17.
Zurück zum Zitat Franklin, M., Halevy, A., Maier, D.: From databases to dataspaces: a new abstraction for information management. SIGMOD Rec. 34(4), 27–33 (2005) Franklin, M., Halevy, A., Maier, D.: From databases to dataspaces: a new abstraction for information management. SIGMOD Rec. 34(4), 27–33 (2005)
18.
Zurück zum Zitat Gal, A.: Uncertain Schema Matching. Synth. Lect. Data Manag. 3(1), 1–97 (2011) Gal, A.: Uncertain Schema Matching. Synth. Lect. Data Manag. 3(1), 1–97 (2011)
19.
Zurück zum Zitat Gal, A., Sagi, T.: Tuning the ensemble selection process of schema matchers. Inf. Syst. 35(8), 845–859 (2010)CrossRef Gal, A., Sagi, T.: Tuning the ensemble selection process of schema matchers. Inf. Syst. 35(8), 845–859 (2010)CrossRef
20.
Zurück zum Zitat Gollapudi, S., Sharma, A.: An axiomatic approach for result diversification. In: Proceedings of the 18th International Conference on World Wide Web, WWW ’09 (2009) Gollapudi, S., Sharma, A.: An axiomatic approach for result diversification. In: Proceedings of the 18th International Conference on World Wide Web, WWW ’09 (2009)
21.
Zurück zum Zitat Gravano, L., Ipeirotis, P.G., Koudas, N., Srivastava, D.: Text joins in an RDBMS for web data integration. In: WWW (2003) Gravano, L., Ipeirotis, P.G., Koudas, N., Srivastava, D.: Text joins in an RDBMS for web data integration. In: WWW (2003)
22.
Zurück zum Zitat Guo, F., Liu, C., Kannan, A., Minka, T., Taylor, M.J., Wang, Y.M., Faloutsos, C.: Click chain model in web search. In: WWW, pp. 11–20 (2009) Guo, F., Liu, C., Kannan, A., Minka, T., Taylor, M.J., Wang, Y.M., Faloutsos, C.: Click chain model in web search. In: WWW, pp. 11–20 (2009)
23.
Zurück zum Zitat Guo, L., Shao, F., Botev, C., Shanmugasundaram, J.: XRANK: Ranked keyword search over XML documents. In: SIGMOD (2003) Guo, L., Shao, F., Botev, C., Shanmugasundaram, J.: XRANK: Ranked keyword search over XML documents. In: SIGMOD (2003)
24.
Zurück zum Zitat He, H., Wang, H., Yang, J., Yu, P.S.: BLINKS: ranked keyword searches on graphs. In: SIGMOD (2007) He, H., Wang, H., Yang, J., Yu, P.S.: BLINKS: ranked keyword searches on graphs. In: SIGMOD (2007)
25.
Zurück zum Zitat Hristidis, V., Papakonstantinou, Y.: Discover: Keyword search in relational databases. In: VLDB, pp. 670–681 (2002) Hristidis, V., Papakonstantinou, Y.: Discover: Keyword search in relational databases. In: VLDB, pp. 670–681 (2002)
27.
Zurück zum Zitat Ilyas, I.F., Aref, W.G., Elmagarmid, A.K.: Supporting top-k join queries in relational databases. In: VLDB (2003) Ilyas, I.F., Aref, W.G., Elmagarmid, A.K.: Supporting top-k join queries in relational databases. In: VLDB (2003)
28.
Zurück zum Zitat Jacob, M., Ives, Z.G.: Sharing work in keyword search over databases. In: SIGMOD (2011) Jacob, M., Ives, Z.G.: Sharing work in keyword search over databases. In: SIGMOD (2011)
29.
Zurück zum Zitat Jeffery, S.R., Franklin, M.J., Halevy, A.Y.: Pay-as-you-go user feedback for dataspace systems. In: SIGMOD (2008) Jeffery, S.R., Franklin, M.J., Halevy, A.Y.: Pay-as-you-go user feedback for dataspace systems. In: SIGMOD (2008)
30.
Zurück zum Zitat Kacholia, V., Pandit, S., Chakrabarti, S., Sudarshan, S., Desai, R., Karambelkar, H.: Bidirectional expansion for keyword search on graph databases. In: VLDB, pp. 505–516 (2005) Kacholia, V., Pandit, S., Chakrabarti, S., Sudarshan, S., Desai, R., Karambelkar, H.: Bidirectional expansion for keyword search on graph databases. In: VLDB, pp. 505–516 (2005)
31.
Zurück zum Zitat Kimelfeld, B., Sagiv, Y.: Finding and approximating top-k answers in keyword proximity search. In: PODS, pp. 173–182 (2006) Kimelfeld, B., Sagiv, Y.: Finding and approximating top-k answers in keyword proximity search. In: PODS, pp. 173–182 (2006)
32.
Zurück zum Zitat Marian, A., Bruno, N., Gravano, L.: Evaluating top-k queries over web-accessible databases. ACM Trans. Database Syst. 29(2), 319–362 (2004) Marian, A., Bruno, N., Gravano, L.: Evaluating top-k queries over web-accessible databases. ACM Trans. Database Syst. 29(2), 319–362 (2004)
33.
Zurück zum Zitat Marie, A., Gal, A.: Managing uncertainty in schema matcher ensembles. In: SUM, pp. 60–73 (2007) Marie, A., Gal, A.: Managing uncertainty in schema matcher ensembles. In: SUM, pp. 60–73 (2007)
34.
Zurück zum Zitat Melnik, S., Garcia-Molina, H., Rahm, E.: Similarity flooding: a versatile graph matching algorithm and its application to schema matching. In: ICDE (2002) Melnik, S., Garcia-Molina, H., Rahm, E.: Similarity flooding: a versatile graph matching algorithm and its application to schema matching. In: ICDE (2002)
35.
Zurück zum Zitat Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB J. 10(4), 334–350 (2001) Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB J. 10(4), 334–350 (2001)
36.
Zurück zum Zitat Sayyadian, M., LeKhac, H., Doan, A., Gravano, L.: Efficient keyword search across heterogeneous relational databases. In: ICDE (2007) Sayyadian, M., LeKhac, H., Doan, A., Gravano, L.: Efficient keyword search across heterogeneous relational databases. In: ICDE (2007)
37.
Zurück zum Zitat Settles, B.: Active Learning. Morgan and Claypool, Cambridge (2012)MATH Settles, B.: Active Learning. Morgan and Claypool, Cambridge (2012)MATH
38.
Zurück zum Zitat Settles, B., Craven, M.: An analysis of active learning strategies for sequence labeling tasks. In: EMNLP (2008) Settles, B., Craven, M.: An analysis of active learning strategies for sequence labeling tasks. In: EMNLP (2008)
39.
Zurück zum Zitat Settles, B., Craven, M., Ray, S.: Multiple-instance active learning. In: NIPS (2007) Settles, B., Craven, M., Ray, S.: Multiple-instance active learning. In: NIPS (2007)
40.
Zurück zum Zitat Shen, S., Hu, B., Chen, W., Yang, Q.: Personalized click model through collaborative filtering. In: WSDM, pp. 323–332 (2012) Shen, S., Hu, B., Chen, W., Yang, Q.: Personalized click model through collaborative filtering. In: WSDM, pp. 323–332 (2012)
41.
Zurück zum Zitat Suchanek, F.M., Kasneci, G., Weikum, G.: YAGO: A large ontology from Wikipedia and WordNet. J. Web Sem. 6(3), 203–217 (2008) Suchanek, F.M., Kasneci, G., Weikum, G.: YAGO: A large ontology from Wikipedia and WordNet. J. Web Sem. 6(3), 203–217 (2008)
42.
Zurück zum Zitat Talukdar, P.P., Ives, Z.G., Pereira, F.: Automatically incorporating new sources in keyword search-based data integration. In: SIGMOD (2010) Talukdar, P.P., Ives, Z.G., Pereira, F.: Automatically incorporating new sources in keyword search-based data integration. In: SIGMOD (2010)
43.
Zurück zum Zitat Talukdar, P.P., Jacob, M., Mehmood, M.S., Crammer, K., Ives, Z.G., Pereira, F., Guha, S.: Learning to create data-integrating queries. In: VLDB (2008) Talukdar, P.P., Jacob, M., Mehmood, M.S., Crammer, K., Ives, Z.G., Pereira, F., Guha, S.: Learning to create data-integrating queries. In: VLDB (2008)
44.
Zurück zum Zitat Yakout, M., Elmagarmid, A.K., Neville, J., Ouzzani, M., Ilyas, I.F.: Guided data repair. PVLDB 4(5), 279–289 (2011) Yakout, M., Elmagarmid, A.K., Neville, J., Ouzzani, M., Ilyas, I.F.: Guided data repair. PVLDB 4(5), 279–289 (2011)
45.
Zurück zum Zitat Yan, Z., Zheng, N., Ives, Z., Talukdar, P., Yu, C.: Actively soliciting feedback for query answers in keyword search-based data integration. In: PVLDB (2013) Yan, Z., Zheng, N., Ives, Z., Talukdar, P., Yu, C.: Actively soliciting feedback for query answers in keyword search-based data integration. In: PVLDB (2013)
Metadaten
Titel
Active learning in keyword search-based data integration
verfasst von
Zhepeng Yan
Nan Zheng
Zachary G. Ives
Partha Pratim Talukdar
Cong Yu
Publikationsdatum
01.10.2015
Verlag
Springer Berlin Heidelberg
Erschienen in
The VLDB Journal / Ausgabe 5/2015
Print ISSN: 1066-8888
Elektronische ISSN: 0949-877X
DOI
https://doi.org/10.1007/s00778-014-0374-x

Weitere Artikel der Ausgabe 5/2015

The VLDB Journal 5/2015 Zur Ausgabe

Premium Partner