Skip to main content
Erschienen in: Information Systems Frontiers 3/2013

01.07.2013

Beyond search: Retrieving complete tuples from a text-database

Erschienen in: Information Systems Frontiers | Ausgabe 3/2013

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

A common task of Web users is querying structured information from Web pages. For realizing this interesting scenario we propose a novel query processor for systematically discovering instances of semantic relations in Web search results and joining these relation instances into complex result tuples with conjunctive queries. Our query processor transforms a structured user query into keyword queries that are submitted to a search engine, forwards search results to a relation extractor, and then combines relations into complex result tuples. The processor automatically learns discriminative and effective keywords for different types of semantic relations. Thereby, our query processor leverages the index of a search engine to query potentially billions of pages. Unfortunately, relation extractors may fail to return a relation for a result tuple. Moreover, user defined data sources may not return at least k complete result tuples. Therefore we propose an adaptive routing model based on information theory for retrieving missing attributes of incomplete result tuples. The model determines the most promising next incomplete tuple and attribute type for returning any-k complete result tuples at any point during the query execution process. We report a thorough experimental evaluation over multiple relation extractors. Our query processor returns complete result tuples while processing only very few Web pages.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
For instance, Yahoo Boss (Clarke et al. 2008) charges currently $0.30 dollar per 1.000 requests for the top-10 search results and OpenCalais (Croft et al. 2009) charges $2000 per 3.000.000 pages that are extracted with their service.
 
Literatur
Zurück zum Zitat Agichtein, E., & Gravano, L. (2003). Qxtract: a building block for efficient information extraction from plain-text databases. In SIGMOD conference (p. 663). Agichtein, E., & Gravano, L. (2003). Qxtract: a building block for efficient information extraction from plain-text databases. In SIGMOD conference (p. 663).
Zurück zum Zitat Avnur, R., & Hellerstein, J.M. (2000). Eddies: continuously adaptive query processing. In SIGMOD conference (pp. 261–272). Avnur, R., & Hellerstein, J.M. (2000). Eddies: continuously adaptive query processing. In SIGMOD conference (pp. 261–272).
Zurück zum Zitat Banko, M., & Etzioni, O. (2008). The tradeoffs between open and traditional relation extraction. In ACL (pp. 28–36). Banko, M., & Etzioni, O. (2008). The tradeoffs between open and traditional relation extraction. In ACL (pp. 28–36).
Zurück zum Zitat Boden, C., Hafele, T., Löser A. (2011). Classification algorithms for relation prediction. In ICDE workshops (pp. 46–52). Boden, C., Hafele, T., Löser A. (2011). Classification algorithms for relation prediction. In ICDE workshops (pp. 46–52).
Zurück zum Zitat Boden, C., Löser, A., Nagel, C., Pieper, S. (2011). Factcrawl: a fact retrieval framework for full-text indices. In 14th WebDB workshop with ACM SIGMOD Boden, C., Löser, A., Nagel, C., Pieper, S. (2011). Factcrawl: a fact retrieval framework for full-text indices. In 14th WebDB workshop with ACM SIGMOD
Zurück zum Zitat Boden, C., Löser, A., Nagel, C., Pieper, S. (2012). Fact-aware document retrieval for information extraction. Datenbank-Spektrum, 12, 89–100.CrossRef Boden, C., Löser, A., Nagel, C., Pieper, S. (2012). Fact-aware document retrieval for information extraction. Datenbank-Spektrum, 12, 89–100.CrossRef
Zurück zum Zitat Castellanos, M., Wang, S., Dayal, U., Gupta, C. (2010). Sie-obi: a streaming information extraction platform for operational business intelligence. In SIGMOD conference (pp. 1105–1110). Castellanos, M., Wang, S., Dayal, U., Gupta, C. (2010). Sie-obi: a streaming information extraction platform for operational business intelligence. In SIGMOD conference (pp. 1105–1110).
Zurück zum Zitat Chakrabarti, S., Sarawagi, S., Sudarshan, S. (2010). Enhancing search with structure. IEEE Data Engineering Bulletin, 33(1), 3–24. Chakrabarti, S., Sarawagi, S., Sudarshan, S. (2010). Enhancing search with structure. IEEE Data Engineering Bulletin, 33(1), 3–24.
Zurück zum Zitat Clarke, C.L.A., Kolla, M., Cormack, G.V., Vechtomova, O., Ashkan, A., Büttcher, S., MacKinnon, I. (2008). Novelty and diversity in information retrieval evaluation. In SIGIR (pp. 659–666). Clarke, C.L.A., Kolla, M., Cormack, G.V., Vechtomova, O., Ashkan, A., Büttcher, S., MacKinnon, I. (2008). Novelty and diversity in information retrieval evaluation. In SIGIR (pp. 659–666).
Zurück zum Zitat Croft, B., Metzler, D., Strohman, T. (2009). Search engines: Information retrieval in practice (1st ed.) USA: Addison-Wesley Publishing Company. Croft, B., Metzler, D., Strohman, T. (2009). Search engines: Information retrieval in practice (1st ed.) USA: Addison-Wesley Publishing Company.
Zurück zum Zitat Crow, D. (2010). Google Squared: Web scale, open domain information extraction and presentation. In ECIR, industrial track. Crow, D. (2010). Google Squared: Web scale, open domain information extraction and presentation. In ECIR, industrial track.
Zurück zum Zitat DeRose, P., Shen, W., 0002, F.C., Doan, A., Ramakrishnan, R. (2007a). Building structured web community portals: A top-down, compositional, and incremental approach. In VLDB (pp. 399–410). DeRose, P., Shen, W., 0002, F.C., Doan, A., Ramakrishnan, R. (2007a). Building structured web community portals: A top-down, compositional, and incremental approach. In VLDB (pp. 399–410).
Zurück zum Zitat DeRose, P., Shen, W., 0002, F.C., Lee, Y., Burdick, D., Doan, A., Ramakrishnan, R. (2007b). Dblife: A community information management platform for the database research community (demo). In CIDR (pp. 169–172). DeRose, P., Shen, W., 0002, F.C., Lee, Y., Burdick, D., Doan, A., Ramakrishnan, R. (2007b). Dblife: A community information management platform for the database research community (demo). In CIDR (pp. 169–172).
Zurück zum Zitat Dong, X., Halevy, A., Madhavan, J. (2005). Reference reconciliation in complex information spaces. In ACM SIGMOD (pp. 85–96). Dong, X., Halevy, A., Madhavan, J. (2005). Reference reconciliation in complex information spaces. In ACM SIGMOD (pp. 85–96).
Zurück zum Zitat Etzioni, O., Banko, M., Soderland, S., Weld, D.S. (2008). Open information extraction from the web. Communications of the ACM, 51(12), 68–74.CrossRef Etzioni, O., Banko, M., Soderland, S., Weld, D.S. (2008). Open information extraction from the web. Communications of the ACM, 51(12), 68–74.CrossRef
Zurück zum Zitat Feldman, R., Regev, Y., Gorodetsky, M. (2008). A modular information extraction system. Intelligent Data Analysis, 12(1), 51–71. Feldman, R., Regev, Y., Gorodetsky, M. (2008). A modular information extraction system. Intelligent Data Analysis, 12(1), 51–71.
Zurück zum Zitat Fung, G.P.C., Yu, J.X., Lu, H. (2002). Discriminative category matching: Efficient text classification for huge document collections. In ICDM (pp. 187–194). Fung, G.P.C., Yu, J.X., Lu, H. (2002). Discriminative category matching: Efficient text classification for huge document collections. In ICDM (pp. 187–194).
Zurück zum Zitat Galhardas, H., Florescu, D., Shasha, D., Simon, E., Saita, C.A. (2001). Declarative data cleaning: language, model, and algorithms. In VLDB (pp. 371–380). Galhardas, H., Florescu, D., Shasha, D., Simon, E., Saita, C.A. (2001). Declarative data cleaning: language, model, and algorithms. In VLDB (pp. 371–380).
Zurück zum Zitat Grishman, R., Huttunen, S., Yangarber, R. (2002). Information extraction for enhanced access to disease outbreak reports. Journal of Biomedical Informatics, 35(4), 236–246.CrossRef Grishman, R., Huttunen, S., Yangarber, R. (2002). Information extraction for enhanced access to disease outbreak reports. Journal of Biomedical Informatics, 35(4), 236–246.CrossRef
Zurück zum Zitat Halevy, A.Y. (2001). Answering queries using views: a survey. The VLDB Journal, 10, 270–294.CrossRef Halevy, A.Y. (2001). Answering queries using views: a survey. The VLDB Journal, 10, 270–294.CrossRef
Zurück zum Zitat Ilyas, I.F., Beskales, G., Soliman, M.A. (2008). A survey of top-query processing techniques in relational database systems. ACM Computing Surveys, 40(4). Ilyas, I.F., Beskales, G., Soliman, M.A. (2008). A survey of top-query processing techniques in relational database systems. ACM Computing Surveys, 40(4).
Zurück zum Zitat Ipeirotis, P.G., Agichtein, E., Jain, P., Gravano, L. (2006). To search or to crawl?: towards a query optimizer for text-centric tasks. In SIGMOD conference (pp. 265–276). Ipeirotis, P.G., Agichtein, E., Jain, P., Gravano, L. (2006). To search or to crawl?: towards a query optimizer for text-centric tasks. In SIGMOD conference (pp. 265–276).
Zurück zum Zitat Jain, A., Doan, A., Gravano, L. (2008). Optimizing sql queries over text databases. In ICDE (pp. 636–645). Jain, A., Doan, A., Gravano, L. (2008). Optimizing sql queries over text databases. In ICDE (pp. 636–645).
Zurück zum Zitat Jain, A., Ipeirotis, P.G., Doan, A., Gravano, L. (2009). Join optimization of information extraction output: quality matters! In ICDE (pp. 186–197). Jain, A., Ipeirotis, P.G., Doan, A., Gravano, L. (2009). Join optimization of information extraction output: quality matters! In ICDE (pp. 186–197).
Zurück zum Zitat Jain, A., & Pantel, P. (2010). Factrank: random walks on a web of facts. In COLING (pp. 501–509). Jain, A., & Pantel, P. (2010). Factrank: random walks on a web of facts. In COLING (pp. 501–509).
Zurück zum Zitat Jain, A., & Srivastava, D. (2009). Exploring a few good tuples from text databases. In ICDE (pp. 616–627). Jain, A., & Srivastava, D. (2009). Exploring a few good tuples from text databases. In ICDE (pp. 616–627).
Zurück zum Zitat Kasneci, G., Suchanek, F.M., Ramanath, M., Weikum, G. (2008). The YAGO-NAGA approach to knowledge discovery. SIGMOD Record, 37, 4.CrossRef Kasneci, G., Suchanek, F.M., Ramanath, M., Weikum, G. (2008). The YAGO-NAGA approach to knowledge discovery. SIGMOD Record, 37, 4.CrossRef
Zurück zum Zitat Liu, J., Dong, X., Halevy, A.Y. (2006). Answering structured queries on unstructured data. In WebDB. Liu, J., Dong, X., Halevy, A.Y. (2006). Answering structured queries on unstructured data. In WebDB.
Zurück zum Zitat Löser, A., Hüske, F., Markl, V. (2008). Situational business intelligence. In BIRTE. Löser, A., Hüske, F., Markl, V. (2008). Situational business intelligence. In BIRTE.
Zurück zum Zitat Löser, A., Lutter, S., Düssel, P., Markl, V. (2009). Ad-hoc queries over document collections—a case study. In BIRTE (pp. 50–65). Löser, A., Lutter, S., Düssel, P., Markl, V. (2009). Ad-hoc queries over document collections—a case study. In BIRTE (pp. 50–65).
Zurück zum Zitat Löser A., Nagel, C., Pieper, S. (2010). Augmenting tables by self-supervised web search. In BIRTE Löser A., Nagel, C., Pieper, S. (2010). Augmenting tables by self-supervised web search. In BIRTE
Zurück zum Zitat Markl, V., Raman, V., Simmen, D.E., Lohman, G.M., Pirahesh, H. (2004). Robust query processing through progressive optimization. In SIGMOD conference (pp. 659–670). Markl, V., Raman, V., Simmen, D.E., Lohman, G.M., Pirahesh, H. (2004). Robust query processing through progressive optimization. In SIGMOD conference (pp. 659–670).
Zurück zum Zitat Naumann, F. (2002). Quality-driven query answering for integrated information systems. Lecture notes in computer science Vol. 2261: Springer. Naumann, F. (2002). Quality-driven query answering for integrated information systems. Lecture notes in computer science Vol. 2261: Springer.
Zurück zum Zitat Pérez-Martínez, J.M., Llavori, R.B., Cabo, M.J.A., Pedersen, T.B. (2008). Contextualizing data warehouses with documents. Decision Support Systems, 45(1), 77–94.CrossRef Pérez-Martínez, J.M., Llavori, R.B., Cabo, M.J.A., Pedersen, T.B. (2008). Contextualizing data warehouses with documents. Decision Support Systems, 45(1), 77–94.CrossRef
Zurück zum Zitat Riloff, E. (1996). Automatically generating extraction patterns from untagged text. AAAI/IAAI, 2, 1044–1049. Riloff, E. (1996). Automatically generating extraction patterns from untagged text. AAAI/IAAI, 2, 1044–1049.
Zurück zum Zitat Selinger, P.G., Astrahan, M.M., Chamberlin, D.D., Lorie, R.A., Price, T.G. (1988). Access path selection in a relational database management system. In Proceedings of the 1979 ACM SIGMOD international conference on management of data, 30 May–1 June 1979 (pp. 23–34). Boston, Massachusetts. Selinger, P.G., Astrahan, M.M., Chamberlin, D.D., Lorie, R.A., Price, T.G. (1988). Access path selection in a relational database management system. In Proceedings of the 1979 ACM SIGMOD international conference on management of data, 30 May–1 June 1979 (pp. 23–34). Boston, Massachusetts.
Zurück zum Zitat Wu, F., & Weld, D.S. (2010). Open information extraction using wikipedia. In ACL (pp. 118–127). Wu, F., & Weld, D.S. (2010). Open information extraction using wikipedia. In ACL (pp. 118–127).
Zurück zum Zitat Yu, C., Lakshmanan, L.V.S., Amer-Yahia, S. (2009). It takes variety to make a world: diversification in recommender systems. In EDBT (pp. 368–378). Yu, C., Lakshmanan, L.V.S., Amer-Yahia, S. (2009). It takes variety to make a world: diversification in recommender systems. In EDBT (pp. 368–378).
Metadaten
Titel
Beyond search: Retrieving complete tuples from a text-database
Publikationsdatum
01.07.2013
Erschienen in
Information Systems Frontiers / Ausgabe 3/2013
Print ISSN: 1387-3326
Elektronische ISSN: 1572-9419
DOI
https://doi.org/10.1007/s10796-012-9403-8

Weitere Artikel der Ausgabe 3/2013

Information Systems Frontiers 3/2013 Zur Ausgabe