Efficient maintenance of basic statistical functions in data warehouses

doi:10.1016/j.dss.2013.08.003

Decision Support Systems

Volume 57, January 2014, Pages 94-104

https://doi.org/10.1016/j.dss.2013.08.003 Get rights and content

Highlights

•
Significant improvements in maintaining the statistical functions inside a data warehouse.
•
Reduce the maintenance time from minutes to seconds.
•
Efficiently maintaining basic statistical functions inside a data warehouse contributes to firm performance.
•
Can also be applied to maintain a distributed data warehouse or a data mart.

Abstract

In general, some simple but very meaningful statistical functions are often used to retrieve valuable summary information in corporate databases. However, it is not uncommon that such information is obtained from computerized information systems which spend a great deal of time calculating the large volume of collected data. In practice, such data is usually stored in a data warehouse in which a large number of summary tables or materialized aggregate views are built in order to improve the system performance. Upon changes, most notable new transactional data are collected from various data sources, and all summary tables in the data warehouse that correspond to the transactional data must be updated accordingly. Since the number of summary tables that need to be maintained is often large, efficiently maintaining these is thus a critical issue for managing a data warehouse. In this study, an efficient maintenance approach to enhance the performance of a data warehouse is proposed, in which some additional auxiliary tables are kept inside a data warehouse with the role of improving the maintenance processes of some statistical functions, such as MIN, MAX, MEAN, and MEDIAN. Finally, a comparative analysis is performed to verify the effectiveness of the proposal method.

Introduction

Data warehouses are often huge systems which require a very high level maintenance. Any reorganization of the business processes and the source systems may require the data warehouse to change. Updates are often needed in these situations, and more resources are required for maintenance of a data warehouse rather than its development. Views are computed and stored in a database to allow efficient querying and analysis of the data, and these are known as materialized views. In order to keep these up to date, it is necessary to maintain the materialized views in response to the changes at the sources. In general, incrementally maintaining a view can be significantly cheaper than recomputing it from scratch, especially if the size of view is large compared to the size of the changes. The problem of finding such changes in the views based on changes in the base relations is known as the view maintenance problem. Moreover, since statistical functions often play a very important role in the online analytical process (OLAP) in data warehouses, some simple but very meaningful functions are often used to retrieve information. In order to make statistical analysis operations more effective, it is necessary to keep them up to date, which often requires a lot of time and resources. The objective of this study is thus to propose an approach to maintain statistical functions in a data warehouse more efficiently, in which some supplementary data are kept with a legacy role. The auxiliary data are organized in a specific format that makes them very efficient in response to changes in the base tables. In other words, when a base table is changed, the data warehouse will use these auxiliary data to maintain the view instead of reaccessing the entire base table, significantly improving the maintenance processes.

Since data warehouses often contain huge amounts of data, in order to keep performance smooth and constant, maintenance of a data warehouse is crucial. [16] suggested an efficiency model for data warehouse operations which contains two major processes (refresh processing and query production), and concluded that only few organizations can obtain good efficiency for both processes. While companies may be successful in implementing multi-terabyte data warehouses, it is still difficult to translate this investment into useful business knowledge or positive business results. Computing and/or materializing a complete data cube is often expensive, in both time and space, and difficult to conduct online. Research has proposed different approaches for dealing with this problem of querying OLAP data cubes [4], [14], [19]. However, as changes are made to the data sources, the warehouse views must be updated to reflect this. According to [3], [6], the updated view can be either recomputed from scratch or obtained by using incremental maintenance techniques. However, in today's warehouses, the changes to the sources are often not detected and propagated to the warehouse immediately. To efficiently deal with changes to the materialized views in relational and deductive database systems, [5] presented a counting algorithm that tracks the number of alternative derivations for each derived tuple in a view. By first deleting a superset of the tuples and rederiving some of them, a recursive view can be maintained incrementally. This method was described as the delete and rederive (DRed) algorithm.

For data warehouses that provide global access, even a short period of downtime may not be acceptable. [1] proposed two algorithms called SWEEP and Nested SWEEP to deal with this issue. Both algorithms work without requiring that the data warehouse is quiescent to incorporate the new views. A multi-agent system (MAS) that achieves immediate incremental view maintenance was developed by [21]. [7] proposed an algorithm for efficiently computing updates to views without reaccessing the materialization of views. [10] proposed a propagation approach to deal with view maintenance tasks. [9] proposed an approach which employs an extendable multidimensional array to maintain the increment of data cubes with the aim to avoid recomputing the entire data cube. [13] proposed an efficient algorithm which can maintain a data cube using only a subset 2ⁿ delta cuboids.

Incremental maintenance has been studied by [3], [22]. In spite of their positive results, view maintenance is still complicated by the fact that the sources are decoupled from the warehouse, so that traditional incremental view maintenance may exhibit anomalies. In this situation, warehouse self-maintainability is desirable. This is the ability of a warehouse to maintain itself without “help” from the underlying databases, i.e., it based only on reported changes to the underlying databases [12]. Auxiliary data is used in order to update materialized views, without accessing the base relation. Research on the construction of auxiliary tables for self-maintainability can be seen in [2], [8], [15], [17], [18], [20]. According to [12], warehouse independence with respect to queries is the ability of the warehouse to answer queries posed to the underlying sources from the warehouse views. Similarly, warehouse independence with respect to updates is the ability of a warehouse to maintain itself based only on reported changes in the underlying sources.

Based on the aforementioned discussion, in this study an efficient maintenance approach to enhance the performance of a data warehouse is proposed. The main objective of this study is to avoid reloading all the data in DW completely when calculating some statistical functions is required, no matter if the data are generated from real-time operations or periodical refreshment. In this paper, we maintain some supplementary tables (e.g., auxiliary tables, materialized views) in the DW to keep track of each update, so that the values of required statistical functions, such as MIN, MAX, MEAN, and MEDIAN, can be instantly calculated without a full reloading task. The proposed incremental maintenance approach is statistically proved to be significantly faster than the traditional one (completely reloading and recomputing), especially when the volume of the original data is much larger than that of the changes. Therefore, the proposed approach can be employed in both a periodic data refresh policy or an OLAP environment. This paper is organized as follows: Section 2 considers the maintenance of a data warehouse, Section 3 depicts the development of the proposed approach, Section 4 demonstrates the effectiveness of the proposed approach by performing a numerical experiment, and finally, Section 5 presents the concluding remarks.

Section snippets

Maintaining a data warehouse

The data in data warehouses are in the form of data cubes, which allow data to be modeled and viewed in multiple dimensions. To clarify the complex view of data, a simple example is provided. For instance, in order to respond to changes in the business environment, the board of an enterprise, which is specialized in selling, decides to develop a decision support system with the use of a data warehouse. A part of the system is depicted in Fig. 1.

The system is divided into two parts: (1) the

Incremental maintenance of basic statistical functions

As specified in the previous section, we can classify statistical functions into self-maintainable, semi-self-maintainable, and non-self-maintainable. It is relatively easy to maintain self-maintainable functions. However, in order to maintain semi-self-maintainable and non-self-maintainable statistical functions we need to use the auxiliary data as a legacy.

Numerical application

In this section, an experiment is performed to investigate the effectiveness of the proposed approach. The hardware and software specifications, as well as the data used in the experiment, are stated as follows:

-
Hardware specifications
- •
  ASUS i7 3370
- •
  CPU: Intel® 4-Core (3.40 GHz)
- •
  RAM: 8GB
-
Software specifications
- •
  Window Server 2008 Enterprise Edition
- •
  Microsoft SQL Server 2008

The data used for the experiment came from the KDD Cup 1999 database, which is the data set used for “The Third International

Conclusion

By utilizing some more typical auxiliary data, we can get significant improvements in maintaining the statistical functions inside a data warehouse. The proposed approach works efficiently with the MEAN, VARIANCE, MIN, and MAX functions. With just thousands of auxiliary tuples, we can reduce the maintenance time from minutes to seconds. As the dynamic business environment requires managers to make a lot of rapid decisions, and as statistical functions play a very important role in the decision

References (22)

C.-M. Chao
Incremental maintenance of object-oriented data warehouses
Information Sciences
(2004)
H. Gupta et al.
Incremental maintenance of aggregate and outerjoin expressions
Information Systems
(2006)
J.V. Harrison
Incremental view maintenance in extended relational database
Information and Software Technology
(1995)
K.Y. Lee et al.
An efficient method for maintaining data cubes incrementally
The Information of the Science
(2010)
H.-G. Li et al.
Progressive ranking of range aggregates
Data & Knowledge Engineering
(2007)
W. Liang et al.
Making multiple views self-maintainable in a data warehouse
Data & Knowledge Engineering
(1999)
M.V. Mannino et al.
Efficiency evaluation of data warehouse operations
Decision Support Systems
(2008)
M. Mohania et al.
Making aggregate views self-maintainable
Data & Knowledge Engineering
(2000)
Z. Shi et al.
MSMiner—a developing platform for OLAP
Decision Support Systems
(2007)
X. Zhang et al.
Integrating the maintenance and synchronizaiton of data warehouses using a cooperative framework
Information Systems
(2002)

D. Agrawal et al.

Efficient view maintenance at data warehouses. Paper presented at the ACM SIGMOD International Conference on Management of Data

(1997)

Cited by (7)

A game theory based framework for materialized view selection in data warehouses
2018, Engineering Applications of Artificial Intelligence
Citation Excerpt :
Materialized views are employed to efficient processing of complex queries. Finding this optimal set of views is a challenging issue that has attracted a lot of recent studies (Huang et al., 2014). From a general point of view, materialized view selection methods can be divided into two categories: static view selection and dynamic view selection.
Data warehouses exploit On-Line Analytical Processing (OLAP) to make rapid answers for analytical queries. Huge amount of aggregated data within a data warehouse on the one hand, and complex analytical queries raised in a data warehouse on the other hand, increase response time to queries extremely. To solve this problem, a number of views are derived and extracted from original base tables and queries have been answered using them. Since materialization of all possible views is not effective because of limitation of storage and maintenance overhead, selecting an optimal set of views for materialization is crucial to maximize data warehouse performance.
In this paper, a game theory based framework for the materialized view selection is proposed. In the proposed framework, query processing and view maintenance costs play a game against each other as two players and continue the game until reach the equilibrium. According to the framework, a new static method, called Game Theory based Materialized View selection (GTMV), has been proposed. Verification of proposed approach has been evaluated using several synthetic and real world datasets. Experimental results show that the GTMV method has better performance comparing previous algorithms and substantially outperform former methods.
Optimized execution method for queries with materialized views: Design and implementation
2021, Journal of Intelligent and Fuzzy Systems
Introduction to smart maintenance
2018, Studies in Systems, Decision and Control
Answering ad-hoc continuous aggregate queries over data streams using Dynamic Prefix Aggregate Tree
2016, Intelligent Data Analysis
Graph-Based Framework for Evaluating the Feasibility of Transition to Maintainomics
2015, Studies in Big Data
Benchmark for approximate query answering systems
2015, Journal of Database Management

View all citing articles on Scopus

Yeu-Shiang Huang is currently a professor in the Department of Industrial and Information Management at National Cheng Kung University, Taiwan. He earned both his M.S. and Ph.D. degrees in Industrial Engineering from the University of Wisconsin—Madison, U.S.A. His research interests include operations management, supply chain management, reliability engineering, and decision analysis. Related papers have appeared in such professional journals as Software Testing, Verification and Reliability, Software Quality Journal, Computers and Operations Research, Computers and Industrial Engineering, Expert Systems with Applications, International Journal of Computer Mathematics, Journal of Universal Computer Science, IIE Transactions, Naval Research Logistics, IEEE Transactions on Engineering Management, Reliability Engineering and System Safety, IEEE Transactions on Reliability, Communications in Statistics, and others.

Do Duy is a graduate student in the Department of Industrial and Information Management at National Cheng Kung University, Taiwan.

Chih-Chiang Fang is currently an assistant professor in the Department of Information Management at Shu-Te University, Kaohsiung, Taiwan. He received his Ph.D. degree in the Department of Industrial and Information Management at National Cheng Kung University, Taiwan. His research interests include decision analysis, Bayesian statistical methods, and reliability engineering. Related papers have appeared in such professional journals as Naval Research Logistics, IEEE Transactions on Engineering Management, Software Testing Verification and Reliability Computers and Industrial Engineering, and others.

View full text

Efficient maintenance of basic statistical functions in data warehouses

Highlights

Abstract

Introduction

Section snippets

Maintaining a data warehouse

Incremental maintenance of basic statistical functions

Numerical application

Conclusion

Information Sciences

Information Systems

Information and Software Technology

The Information of the Science

Data & Knowledge Engineering

Data & Knowledge Engineering

Decision Support Systems

Data & Knowledge Engineering

Decision Support Systems

Information Systems

Efficient view maintenance at data warehouses. Paper presented at the ACM SIGMOD International Conference on Management of Data