nach oben

Information Systems Frontiers

Erschienen in:

01.07.2013

Beyond search: Retrieving complete tuples from a text-database

Erschienen in: Information Systems Frontiers | Ausgabe 3/2013

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

A common task of Web users is querying structured information from Web pages. For realizing this interesting scenario we propose a novel query processor for systematically discovering instances of semantic relations in Web search results and joining these relation instances into complex result tuples with conjunctive queries. Our query processor transforms a structured user query into keyword queries that are submitted to a search engine, forwards search results to a relation extractor, and then combines relations into complex result tuples. The processor automatically learns discriminative and effective keywords for different types of semantic relations. Thereby, our query processor leverages the index of a search engine to query potentially billions of pages. Unfortunately, relation extractors may fail to return a relation for a result tuple. Moreover, user defined data sources may not return at least k complete result tuples. Therefore we propose an adaptive routing model based on information theory for retrieving missing attributes of incomplete result tuples. The model determines the most promising next incomplete tuple and attribute type for returning any-k complete result tuples at any point during the query execution process. We report a thorough experimental evaluation over multiple relation extractors. Our query processor returns complete result tuples while processing only very few Web pages.

Vorheriger Artikel Business Intelligence and the Web

Nächster Artikel Storing and analysing voice of the market data in the corporate data warehouse

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

For instance, Yahoo Boss (Clarke et al. 2008) charges currently $0.30 dollar per 1.000 requests for the top-10 search results and OpenCalais (Croft et al. 2009) charges $2000 per 3.000.000 pages that are extracted with their service.

Agichtein, E., & Gravano, L. (2003). Qxtract: a building block for efficient information extraction from plain-text databases. In SIGMOD conference (p. 663).

Avnur, R., & Hellerstein, J.M. (2000). Eddies: continuously adaptive query processing. In SIGMOD conference (pp. 261–272).

Banko, M., & Etzioni, O. (2008). The tradeoffs between open and traditional relation extraction. In ACL (pp. 28–36).

Boden, C., Hafele, T., Löser A. (2011). Classification algorithms for relation prediction. In ICDE workshops (pp. 46–52).

Boden, C., Löser, A., Nagel, C., Pieper, S. (2011). Factcrawl: a fact retrieval framework for full-text indices. In 14th WebDB workshop with ACM SIGMOD

Boden, C., Löser, A., Nagel, C., Pieper, S. (2012). Fact-aware document retrieval for information extraction. Datenbank-Spektrum, 12, 89–100.CrossRef

Castellanos, M., Wang, S., Dayal, U., Gupta, C. (2010). Sie-obi: a streaming information extraction platform for operational business intelligence. In SIGMOD conference (pp. 1105–1110).

Chakrabarti, S., Sarawagi, S., Sudarshan, S. (2010). Enhancing search with structure. IEEE Data Engineering Bulletin, 33(1), 3–24.

Clarke, C.L.A., Kolla, M., Cormack, G.V., Vechtomova, O., Ashkan, A., Büttcher, S., MacKinnon, I. (2008). Novelty and diversity in information retrieval evaluation. In SIGIR (pp. 659–666).

Croft, B., Metzler, D., Strohman, T. (2009). Search engines: Information retrieval in practice (1st ed.) USA: Addison-Wesley Publishing Company.

Crow, D. (2010). Google Squared: Web scale, open domain information extraction and presentation. In ECIR, industrial track.

DeRose, P., Shen, W., 0002, F.C., Doan, A., Ramakrishnan, R. (2007a). Building structured web community portals: A top-down, compositional, and incremental approach. In VLDB (pp. 399–410).

DeRose, P., Shen, W., 0002, F.C., Lee, Y., Burdick, D., Doan, A., Ramakrishnan, R. (2007b). Dblife: A community information management platform for the database research community (demo). In CIDR (pp. 169–172).

Dong, X., Halevy, A., Madhavan, J. (2005). Reference reconciliation in complex information spaces. In ACM SIGMOD (pp. 85–96).

Etzioni, O., Banko, M., Soderland, S., Weld, D.S. (2008). Open information extraction from the web. Communications of the ACM, 51(12), 68–74.CrossRef

Feldman, R., Regev, Y., Gorodetsky, M. (2008). A modular information extraction system. Intelligent Data Analysis, 12(1), 51–71.

Fortune 500 companies (2010). http://money.cnn.com/magazines/fortune (Last visited 01/06/10).

Fung, G.P.C., Yu, J.X., Lu, H. (2002). Discriminative category matching: Efficient text classification for huge document collections. In ICDM (pp. 187–194).

Galhardas, H., Florescu, D., Shasha, D., Simon, E., Saita, C.A. (2001). Declarative data cleaning: language, model, and algorithms. In VLDB (pp. 371–380).

Grishman, R., Huttunen, S., Yangarber, R. (2002). Information extraction for enhanced access to disease outbreak reports. Journal of Biomedical Informatics, 35(4), 236–246.CrossRef

Halevy, A.Y. (2001). Answering queries using views: a survey. The VLDB Journal, 10, 270–294.CrossRef

HSQLDB (2011). http://hsqldb.org/ (Last visited 06/14/11).

Ilyas, I.F., Beskales, G., Soliman, M.A. (2008). A survey of top-query processing techniques in relational database systems. ACM Computing Surveys, 40(4).

Ipeirotis, P.G., Agichtein, E., Jain, P., Gravano, L. (2006). To search or to crawl?: towards a query optimizer for text-centric tasks. In SIGMOD conference (pp. 265–276).

Jain, A., Doan, A., Gravano, L. (2008). Optimizing sql queries over text databases. In ICDE (pp. 636–645).

Jain, A., Ipeirotis, P.G., Doan, A., Gravano, L. (2009). Join optimization of information extraction output: quality matters! In ICDE (pp. 186–197).

Jain, A., & Pantel, P. (2010). Factrank: random walks on a web of facts. In COLING (pp. 501–509).

Jain, A., & Srivastava, D. (2009). Exploring a few good tuples from text databases. In ICDE (pp. 616–627).

Kasneci, G., Suchanek, F.M., Ramanath, M., Weikum, G. (2008). The YAGO-NAGA approach to knowledge discovery. SIGMOD Record, 37, 4.CrossRef

Liu, J., Dong, X., Halevy, A.Y. (2006). Answering structured queries on unstructured data. In WebDB.

Löser, A., Hüske, F., Markl, V. (2008). Situational business intelligence. In BIRTE.

Löser, A., Lutter, S., Düssel, P., Markl, V. (2009). Ad-hoc queries over document collections—a case study. In BIRTE (pp. 50–65).

Löser A., Nagel, C., Pieper, S. (2010). Augmenting tables by self-supervised web search. In BIRTE

Markl, V., Raman, V., Simmen, D.E., Lohman, G.M., Pirahesh, H. (2004). Robust query processing through progressive optimization. In SIGMOD conference (pp. 659–670).

Naumann, F. (2002). Quality-driven query answering for integrated information systems. Lecture notes in computer science Vol. 2261: Springer.

OpenCalais (2011). www.opencalais.com (Last visited 06/14/11).

Pérez-Martínez, J.M., Llavori, R.B., Cabo, M.J.A., Pedersen, T.B. (2008). Contextualizing data warehouses with documents. Decision Support Systems, 45(1), 77–94.CrossRef

Riloff, E. (1996). Automatically generating extraction patterns from untagged text. AAAI/IAAI, 2, 1044–1049.

Selinger, P.G., Astrahan, M.M., Chamberlin, D.D., Lorie, R.A., Price, T.G. (1988). Access path selection in a relational database management system. In Proceedings of the 1979 ACM SIGMOD international conference on management of data, 30 May–1 June 1979 (pp. 23–34). Boston, Massachusetts.

Wu, F., & Weld, D.S. (2010). Open information extraction using wikipedia. In ACL (pp. 118–127).

Yu, C., Lakshmanan, L.V.S., Amer-Yahia, S. (2009). It takes variety to make a world: diversification in recommender systems. In EDBT (pp. 368–378).

Titel: Beyond search: Retrieving complete tuples from a text-database
Publikationsdatum: 01.07.2013
Erschienen in: Information Systems Frontiers / Ausgabe 3/2013
Print ISSN: 1387-3326
Elektronische ISSN: 1572-9419
DOI: https://doi.org/10.1007/s10796-012-9403-8

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 3/2013

Capturing data quality requirements for web applications by means of DQ_WebRE

A virtual mart for knowledge discovery in databases

Active XML-based Web data integration

Semantic similarity measurement using historical google search patterns

Personalized web feeds based on ontology technologies

Storing and analysing voice of the market data in the corporate data warehouse