A relevance model for a data warehouse contextualized with documents

https://doi.org/10.1016/j.ipm.2008.11.001Get rights and content

Abstract

This paper presents a relevance model to rank the facts of a data warehouse that are described in a set of documents retrieved with an information retrieval (IR) query. The model is based in language modeling and relevance modeling techniques. We estimate the relevance of the facts by the probability of finding their dimensions values and the query keywords in the documents that are relevant to the query. The model is the core of the so-called contextualized warehouse, which is a new kind of decision support system that combines structured data sources and document collections. The paper evaluates the relevance model with the Wall Street Journal (WSJ) TREC test subcollection and a self-constructed fact database.

Introduction

During decades the information retrieval (IR) area has provided users with methods and tools for searching interesting pieces of text among huge document collections. However, until very recently these techniques have been implemented apart from databases due to the very different nature of the objects they manage: whereas data is well-structured with well-defined semantics, texts are unstructured and require approximate query processing (Baeza-Yates & Ribeiro-Neto, 1999).

Nowadays, corporate information systems need to include internal and external text-based sources (e.g., web documents) into the information processes defined within the organization. For example, decision support systems would greatly benefit from text-rich sources (e.g., financial news and market research reports) as they can help analysts to understand the historical trends recorded in corporate data warehouses. Opinion forums and blogs are also valuable text-sources that can be of great interest for enhancing the decision making processes. Unfortunately, there are scarce works in the literature concerned with a true integration of data and document retrieval techniques.

Recent proposals in the field of IR include language modeling (Ponte & Croft, 1998) and relevance modeling (Lavrenko & Croft, 2001). Language modeling represents each document as a language model. Thus, documents are ranked according to the probability of emitting the query keywords from the corresponding language model. Relevance modeling estimates the joint probability of the query’s keywords and the document words over the set of documents deemed relevant for that query. In this paper, we apply the language modeling and relevance modeling approaches to develop a new model that estimates the relevance of the facts stored into a data warehouse with respect to an IR query. These facts are well-structured data tuples, whose meaning is described by a set of documents retrieved with the same IR query from a separate text repository.

The proposed relevance model is the core of the contextualized warehouse described in Pérez, Berlanga, Aramburu, and Pedersen (2008). However, the topic of Pérez et al. (2008) was the multidimensional model of the contextualized warehouse, rather than the relevance model. In the current paper, we describe the relevance model in detail, and we compare it with the relevance-based language model techniques that support it. The paper provides a series of experiments over a well-known IR collection in order to demonstrate that the ranking of facts provided by the model is good enough for helping analysts in their tasks. This evaluation is completely new and has not been previously published anywhere. The review of the language modeling and relevance modeling approaches included in this paper is also an original contribution.

The rest of the paper is organized as follows: Section 2 overviews the contextualized warehouse. Section 3 reviews the language modeling and the relevance modeling IR approaches. Section 4 presents the contextualized warehouse relevance model and Section 5 evaluates it. Finally, Section 6 discusses some conclusions and future lines of work.

Section snippets

The contextualized warehouse

A contextualized warehouse is a new kind of decision support system that allows users to obtain strategic information by combining sources of structured data and documents. Fig. 1 shows the architecture of the contextualized warehouse presented in Pérez et al. (2008). Its main three components are a corporate data warehouse, a document warehouse and the fact extractor module. Next, we briefly describe these components:

  • (a)

    The corporate data warehouse integrates data from the organization’s

Language models and relevance-based language models

The work on language modeling estimates a language model mj for each document dj. A language model is an stochastic process which generates documents by emitting words randomly. The documents dj are then ranked according to the probability P(Q|mj) of emitting the query keywords Q from the respective language model mj (Ponte & Croft, 1998).

The calculation of the probability P(Q|mj) differs from model to model. In Song and Croft (1999) the query Q is represented as a sequence of independent

The facts relevance model

In this section, we propose a relevance model to calculate the relevance of a fact with respect to a selected context (i.e., to an IR query). Intuitively, a fact will be relevant for the selected context, if the fact is found in a document which is also relevant for this context. We will consider that a fact is important in an document if its dimension values are mentioned frequently in the document textual contents.

We assume that each document dj describes a set of facts {fi}; and that the

Experiments and results

This section evaluates the proposed relevance model with the Wall Street Journal (WSJ) TREC test collection (Harman, 1995) and a fact database constructed from the metadata available in the documents. In our experiments, we took a set of example information requests (called topics in TREC), determined which is the expected most relevant fact in the response result for each topic, and analyzed the quality of the ranking of the facts provided by our model.

It is important to emphasize that the

Conclusions

This paper introduces a new relevance model aimed at ranking the structured data (facts) and documents of a contextualized warehouse when the user establishes an analysis context (i.e., runs an IR query). The approach can be summarized as follows. First, we use language modeling formulas (Ponte & Croft, 1998) to rank the documents by the probability of emitting the query keywords from the respective language model. Then, we adapt relevance modeling techniques (Lavrenko & Croft, 2001) to

Juan Manuel Pérez obtained the B.S. degree in Computer Science in 2000, and the Ph.D. degree in 2007, both from Universitat Jaume I, Spain. Currently, he is associate lecturer at this university. He is author of a number of papers in international journals and conferences such as Decision Support Systems, IEEE Transactions on Knowledge and Data Engineering, DEXA, ECIR, ICDE, DOLAP, etc. His research interests are information retrieval, multidimensional databases, and web-based technologies.

References (20)

  • R.A. Baeza-Yates et al.

    Modern information retrieval

    (1999)
  • Codd, E. F. (1993). Providing OLAP to user-analysts: An IT...
  • Danger, R., Berlanga, R., & Ruiz-Shulcloper, J. (2004). CRISOL: An approach for automatically populating semantic web...
  • Eguchi, K., & Lavrenko, V. (2006). Sentiment retrieval using generative models. In Proceedings of the 2006 conference...
  • Harman, D. K. (1995). Overview of the third retrieval conference (TREC-3). In D. K. Harman (Ed.), Overview of the third...
  • W.H. Inmon

    Building the data warehouse

    (2005)
  • Lavrenko, V., Allan, J., DeGuzman, E., LaFlamme, D., Pollard, V. & Thomas, S. (2002). Relevance models for topic...
  • Lavrenko, V., & Croft, W. B. (2001). Relevance-based language models. In Proceedings of the 24th annual international...
  • Lavrenko, V., Feng, S. L., & Manmatha, R. (2003). Statistical models for automatic video annotation and retrieval. In...
  • Liu, B., Hu, M., & Cheng, J. (2005). Opinion observer: Analyzing and comparing opinions on the web. In Proceedings of...
There are more references available in the full text version of this article.

Cited by (12)

  • Enrichment of the phenotypic and genotypic Data Warehouse analysis using Question Answering systems to facilitate the decision making process in cereal breeding programs

    2015, Ecological Informatics
    Citation Excerpt :

    These works are based on applying IR techniques to select the context of analysis from the document warehouses. In Pérez-Martínez et al. (2009), the authors formalize a multidimensional model containing a new dimension for the returned documents. To the best of our knowledge, these papers are the most complete ones in combining and considering structured and unstructured data in a common DW architecture.

  • A comprehensive review of decision support systems in construction tender management

    2018, International Journal of Civil Engineering and Technology
  • Model-driven data warehouse automation: A dependent-concept learning approach

    2016, Artificial Intelligence: Concepts, Methodologies, Tools, and Applications
View all citing articles on Scopus

Juan Manuel Pérez obtained the B.S. degree in Computer Science in 2000, and the Ph.D. degree in 2007, both from Universitat Jaume I, Spain. Currently, he is associate lecturer at this university. He is author of a number of papers in international journals and conferences such as Decision Support Systems, IEEE Transactions on Knowledge and Data Engineering, DEXA, ECIR, ICDE, DOLAP, etc. His research interests are information retrieval, multidimensional databases, and web-based technologies.

Rafael Berlanga is an associate professor of Computer Science at Universitat Jaume I, Spain. He received the B.S. degree from Universidad de Valencia in Physics, and the Ph.D. degree in Computer Science in 1996 from the same university. He is author of several articles in international journals, such as Information Processing & Management, Concurrency: Practice and Experience, Applied Intelligence, among others, and numerous communications in international conferences such as DEXA, ECIR, CIARP, etc. His current research interests are knowledge bases, information retrieval, and temporal reasoning.

María José Aramburu is an associate professor of Computer Science at Universitat Jaume I, Spain. She obtained the B.S. degree from Universidad Politécnica de Valencia in Computer Science in 1991, and a Ph.D. from the School of Computer Science of the University of Birmingham (UK) in 1998. She is author of several articles in international journals, such as Information Processing & Management, Concurrency: Practice and Experience, Applied Intelligence, and numerous communications in international conferences such as DEXA, ECIR, etc. Her main research interests include document databases, and their applications.

View full text