A relevance model for a data warehouse contextualized with documents

doi:10.1016/j.ipm.2008.11.001

Information Processing & Management

Volume 45, Issue 3, May 2009, Pages 356-367

https://doi.org/10.1016/j.ipm.2008.11.001 Get rights and content

Abstract

This paper presents a relevance model to rank the facts of a data warehouse that are described in a set of documents retrieved with an information retrieval (IR) query. The model is based in language modeling and relevance modeling techniques. We estimate the relevance of the facts by the probability of finding their dimensions values and the query keywords in the documents that are relevant to the query. The model is the core of the so-called contextualized warehouse, which is a new kind of decision support system that combines structured data sources and document collections. The paper evaluates the relevance model with the Wall Street Journal (WSJ) TREC test subcollection and a self-constructed fact database.

Introduction

During decades the information retrieval (IR) area has provided users with methods and tools for searching interesting pieces of text among huge document collections. However, until very recently these techniques have been implemented apart from databases due to the very different nature of the objects they manage: whereas data is well-structured with well-defined semantics, texts are unstructured and require approximate query processing (Baeza-Yates & Ribeiro-Neto, 1999).

Nowadays, corporate information systems need to include internal and external text-based sources (e.g., web documents) into the information processes defined within the organization. For example, decision support systems would greatly benefit from text-rich sources (e.g., financial news and market research reports) as they can help analysts to understand the historical trends recorded in corporate data warehouses. Opinion forums and blogs are also valuable text-sources that can be of great interest for enhancing the decision making processes. Unfortunately, there are scarce works in the literature concerned with a true integration of data and document retrieval techniques.

Recent proposals in the field of IR include language modeling (Ponte & Croft, 1998) and relevance modeling (Lavrenko & Croft, 2001). Language modeling represents each document as a language model. Thus, documents are ranked according to the probability of emitting the query keywords from the corresponding language model. Relevance modeling estimates the joint probability of the query’s keywords and the document words over the set of documents deemed relevant for that query. In this paper, we apply the language modeling and relevance modeling approaches to develop a new model that estimates the relevance of the facts stored into a data warehouse with respect to an IR query. These facts are well-structured data tuples, whose meaning is described by a set of documents retrieved with the same IR query from a separate text repository.

The proposed relevance model is the core of the contextualized warehouse described in Pérez, Berlanga, Aramburu, and Pedersen (2008). However, the topic of Pérez et al. (2008) was the multidimensional model of the contextualized warehouse, rather than the relevance model. In the current paper, we describe the relevance model in detail, and we compare it with the relevance-based language model techniques that support it. The paper provides a series of experiments over a well-known IR collection in order to demonstrate that the ranking of facts provided by the model is good enough for helping analysts in their tasks. This evaluation is completely new and has not been previously published anywhere. The review of the language modeling and relevance modeling approaches included in this paper is also an original contribution.

The rest of the paper is organized as follows: Section 2 overviews the contextualized warehouse. Section 3 reviews the language modeling and the relevance modeling IR approaches. Section 4 presents the contextualized warehouse relevance model and Section 5 evaluates it. Finally, Section 6 discusses some conclusions and future lines of work.

Section snippets

The contextualized warehouse

A contextualized warehouse is a new kind of decision support system that allows users to obtain strategic information by combining sources of structured data and documents. Fig. 1 shows the architecture of the contextualized warehouse presented in Pérez et al. (2008). Its main three components are a corporate data warehouse, a document warehouse and the fact extractor module. Next, we briefly describe these components:

(a)
The corporate data warehouse integrates data from the organization’s

Language models and relevance-based language models

The work on language modeling estimates a language model $m_{j}$ for each document $d_{j}$ . A language model is an stochastic process which generates documents by emitting words randomly. The documents $d_{j}$ are then ranked according to the probability $P (Q | m_{j})$ of emitting the query keywords Q from the respective language model $m_{j}$ (Ponte & Croft, 1998).

The calculation of the probability $P (Q | m_{j})$ differs from model to model. In Song and Croft (1999) the query Q is represented as a sequence of independent

The facts relevance model

In this section, we propose a relevance model to calculate the relevance of a fact with respect to a selected context (i.e., to an IR query). Intuitively, a fact will be relevant for the selected context, if the fact is found in a document which is also relevant for this context. We will consider that a fact is important in an document if its dimension values are mentioned frequently in the document textual contents.

We assume that each document $d_{j}$ describes a set of facts ${f_{i}}$ ; and that the

Experiments and results

This section evaluates the proposed relevance model with the Wall Street Journal (WSJ) TREC test collection (Harman, 1995) and a fact database constructed from the metadata available in the documents. In our experiments, we took a set of example information requests (called topics in TREC), determined which is the expected most relevant fact in the response result for each topic, and analyzed the quality of the ranking of the facts provided by our model.

It is important to emphasize that the

Conclusions

This paper introduces a new relevance model aimed at ranking the structured data (facts) and documents of a contextualized warehouse when the user establishes an analysis context (i.e., runs an IR query). The approach can be summarized as follows. First, we use language modeling formulas (Ponte & Croft, 1998) to rank the documents by the probability of emitting the query keywords from the respective language model. Then, we adapt relevance modeling techniques (Lavrenko & Croft, 2001) to

Juan Manuel Pérez obtained the B.S. degree in Computer Science in 2000, and the Ph.D. degree in 2007, both from Universitat Jaume I, Spain. Currently, he is associate lecturer at this university. He is author of a number of papers in international journals and conferences such as Decision Support Systems, IEEE Transactions on Knowledge and Data Engineering, DEXA, ECIR, ICDE, DOLAP, etc. His research interests are information retrieval, multidimensional databases, and web-based technologies.

References (20)

R.A. Baeza-Yates et al.
Modern information retrieval
(1999)
Codd, E. F. (1993). Providing OLAP to user-analysts: An IT...
Danger, R., Berlanga, R., & Ruiz-Shulcloper, J. (2004). CRISOL: An approach for automatically populating semantic web...
Eguchi, K., & Lavrenko, V. (2006). Sentiment retrieval using generative models. In Proceedings of the 2006 conference...
Harman, D. K. (1995). Overview of the third retrieval conference (TREC-3). In D. K. Harman (Ed.), Overview of the third...
W.H. Inmon
Building the data warehouse
(2005)
Lavrenko, V., Allan, J., DeGuzman, E., LaFlamme, D., Pollard, V. & Thomas, S. (2002). Relevance models for topic...
Lavrenko, V., & Croft, W. B. (2001). Relevance-based language models. In Proceedings of the 24th annual international...
Lavrenko, V., Feng, S. L., & Manmatha, R. (2003). Statistical models for automatic video annotation and retrieval. In...
Liu, B., Hu, M., & Cheng, J. (2005). Opinion observer: Analyzing and comparing opinions on the web. In Proceedings of...

There are more references available in the full text version of this article.

Cited by (12)

Developing scalable management information system with big financial data using data mart and mining architecture
2023, Information Processing and Management
The traditional Management Information System (MIS) with Big Financial Data (BFD) for corporate financial diagnosis has many limitations such as the data is not summarized thus these causing increases in query times, and also the complexity in analysis. The creation of a Data Mart (DM) leads to a great summarization of data, such that contains only essential business information. And by using data mining techniques we can be extracting unknown useful information from DM and apply it to make important decisions for the business. Thus, in this paper we are adopting an architecture of six layers; interface layer, analysis layer, extract transformation load layer, data mart layer, data mining layer, and evaluating layer, MIS with BFD using DM and Mining (MIS-BFD-DMM) is proposed, which is not only permits the use of DM and mining technologies in decision support, but also the full utilization of non-financial/financial info held by businesses. This paper offers the benefits of building and integrating DM with mining. Also determines the distinction between DM and a relational database for decision-makers to get information. The test and analysis are achieved in the terms of useful metrics (accuracy, balance accuracy, F-measure, precision, recall, and time). As a result, Data returned from arranged star schema is far faster than ERD. In conclusion, the SVM is best than other algorithms in terms of the parameters of the confusion matrix.
Enrichment of the phenotypic and genotypic Data Warehouse analysis using Question Answering systems to facilitate the decision making process in cereal breeding programs
2015, Ecological Informatics
Citation Excerpt :
These works are based on applying IR techniques to select the context of analysis from the document warehouses. In Pérez-Martínez et al. (2009), the authors formalize a multidimensional model containing a new dimension for the returned documents. To the best of our knowledge, these papers are the most complete ones in combining and considering structured and unstructured data in a common DW architecture.
Currently there are an overwhelming number of scientific publications in Life Sciences, especially in Genetics and Biotechnology. This huge amount of information is structured in corporate Data Warehouses (DWs) or in Biological Databases (e.g. UniProt, RCSB Protein Data Bank, CEREALAB or GenBank), whose main drawback is its cost of updating that makes it obsolete easily. However, these Databases are the main tool for enterprises when they want to update their internal information, for example when a plant breeder enterprise needs to enrich its genetic information (internal structured Database) with recently discovered genes related to specific phenotypic traits (external unstructured data) in order to choose the desired parentals for breeding programs.
In this paper, we propose to complement the internal information with external data from the Web using Question Answering (QA) techniques. We go a step further by providing a complete framework for integrating unstructured and structured information by combining traditional Databases and DW architectures with QA systems. The great advantage of our framework is that decision makers can compare instantaneously internal data with external data from competitors, thereby allowing taking quick strategic decisions based on richer data.
A comprehensive review of decision support systems in construction tender management
2018, International Journal of Civil Engineering and Technology
Decision support systems in manufacturing: a survey and future trends
2017, Journal of Modelling in Management
Model-driven data warehouse automation: A dependent-concept learning approach
2016, Artificial Intelligence: Concepts, Methodologies, Tools, and Applications
A framework for enriching Data Warehouse analysis with Question Answering systems
2016, Journal of Intelligent Information Systems

View all citing articles on Scopus

Rafael Berlanga is an associate professor of Computer Science at Universitat Jaume I, Spain. He received the B.S. degree from Universidad de Valencia in Physics, and the Ph.D. degree in Computer Science in 1996 from the same university. He is author of several articles in international journals, such as Information Processing & Management, Concurrency: Practice and Experience, Applied Intelligence, among others, and numerous communications in international conferences such as DEXA, ECIR, CIARP, etc. His current research interests are knowledge bases, information retrieval, and temporal reasoning.

María José Aramburu is an associate professor of Computer Science at Universitat Jaume I, Spain. She obtained the B.S. degree from Universidad Politécnica de Valencia in Computer Science in 1991, and a Ph.D. from the School of Computer Science of the University of Birmingham (UK) in 1998. She is author of several articles in international journals, such as Information Processing & Management, Concurrency: Practice and Experience, Applied Intelligence, and numerous communications in international conferences such as DEXA, ECIR, etc. Her main research interests include document databases, and their applications.

View full text