Efficient maintenance of basic statistical functions in data warehouses
Introduction
Data warehouses are often huge systems which require a very high level maintenance. Any reorganization of the business processes and the source systems may require the data warehouse to change. Updates are often needed in these situations, and more resources are required for maintenance of a data warehouse rather than its development. Views are computed and stored in a database to allow efficient querying and analysis of the data, and these are known as materialized views. In order to keep these up to date, it is necessary to maintain the materialized views in response to the changes at the sources. In general, incrementally maintaining a view can be significantly cheaper than recomputing it from scratch, especially if the size of view is large compared to the size of the changes. The problem of finding such changes in the views based on changes in the base relations is known as the view maintenance problem. Moreover, since statistical functions often play a very important role in the online analytical process (OLAP) in data warehouses, some simple but very meaningful functions are often used to retrieve information. In order to make statistical analysis operations more effective, it is necessary to keep them up to date, which often requires a lot of time and resources. The objective of this study is thus to propose an approach to maintain statistical functions in a data warehouse more efficiently, in which some supplementary data are kept with a legacy role. The auxiliary data are organized in a specific format that makes them very efficient in response to changes in the base tables. In other words, when a base table is changed, the data warehouse will use these auxiliary data to maintain the view instead of reaccessing the entire base table, significantly improving the maintenance processes.
Since data warehouses often contain huge amounts of data, in order to keep performance smooth and constant, maintenance of a data warehouse is crucial. [16] suggested an efficiency model for data warehouse operations which contains two major processes (refresh processing and query production), and concluded that only few organizations can obtain good efficiency for both processes. While companies may be successful in implementing multi-terabyte data warehouses, it is still difficult to translate this investment into useful business knowledge or positive business results. Computing and/or materializing a complete data cube is often expensive, in both time and space, and difficult to conduct online. Research has proposed different approaches for dealing with this problem of querying OLAP data cubes [4], [14], [19]. However, as changes are made to the data sources, the warehouse views must be updated to reflect this. According to [3], [6], the updated view can be either recomputed from scratch or obtained by using incremental maintenance techniques. However, in today's warehouses, the changes to the sources are often not detected and propagated to the warehouse immediately. To efficiently deal with changes to the materialized views in relational and deductive database systems, [5] presented a counting algorithm that tracks the number of alternative derivations for each derived tuple in a view. By first deleting a superset of the tuples and rederiving some of them, a recursive view can be maintained incrementally. This method was described as the delete and rederive (DRed) algorithm.
For data warehouses that provide global access, even a short period of downtime may not be acceptable. [1] proposed two algorithms called SWEEP and Nested SWEEP to deal with this issue. Both algorithms work without requiring that the data warehouse is quiescent to incorporate the new views. A multi-agent system (MAS) that achieves immediate incremental view maintenance was developed by [21]. [7] proposed an algorithm for efficiently computing updates to views without reaccessing the materialization of views. [10] proposed a propagation approach to deal with view maintenance tasks. [9] proposed an approach which employs an extendable multidimensional array to maintain the increment of data cubes with the aim to avoid recomputing the entire data cube. [13] proposed an efficient algorithm which can maintain a data cube using only a subset 2n delta cuboids.
Incremental maintenance has been studied by [3], [22]. In spite of their positive results, view maintenance is still complicated by the fact that the sources are decoupled from the warehouse, so that traditional incremental view maintenance may exhibit anomalies. In this situation, warehouse self-maintainability is desirable. This is the ability of a warehouse to maintain itself without “help” from the underlying databases, i.e., it based only on reported changes to the underlying databases [12]. Auxiliary data is used in order to update materialized views, without accessing the base relation. Research on the construction of auxiliary tables for self-maintainability can be seen in [2], [8], [15], [17], [18], [20]. According to [12], warehouse independence with respect to queries is the ability of the warehouse to answer queries posed to the underlying sources from the warehouse views. Similarly, warehouse independence with respect to updates is the ability of a warehouse to maintain itself based only on reported changes in the underlying sources.
Based on the aforementioned discussion, in this study an efficient maintenance approach to enhance the performance of a data warehouse is proposed. The main objective of this study is to avoid reloading all the data in DW completely when calculating some statistical functions is required, no matter if the data are generated from real-time operations or periodical refreshment. In this paper, we maintain some supplementary tables (e.g., auxiliary tables, materialized views) in the DW to keep track of each update, so that the values of required statistical functions, such as MIN, MAX, MEAN, and MEDIAN, can be instantly calculated without a full reloading task. The proposed incremental maintenance approach is statistically proved to be significantly faster than the traditional one (completely reloading and recomputing), especially when the volume of the original data is much larger than that of the changes. Therefore, the proposed approach can be employed in both a periodic data refresh policy or an OLAP environment. This paper is organized as follows: Section 2 considers the maintenance of a data warehouse, Section 3 depicts the development of the proposed approach, Section 4 demonstrates the effectiveness of the proposed approach by performing a numerical experiment, and finally, Section 5 presents the concluding remarks.
Section snippets
Maintaining a data warehouse
The data in data warehouses are in the form of data cubes, which allow data to be modeled and viewed in multiple dimensions. To clarify the complex view of data, a simple example is provided. For instance, in order to respond to changes in the business environment, the board of an enterprise, which is specialized in selling, decides to develop a decision support system with the use of a data warehouse. A part of the system is depicted in Fig. 1.
The system is divided into two parts: (1) the
Incremental maintenance of basic statistical functions
As specified in the previous section, we can classify statistical functions into self-maintainable, semi-self-maintainable, and non-self-maintainable. It is relatively easy to maintain self-maintainable functions. However, in order to maintain semi-self-maintainable and non-self-maintainable statistical functions we need to use the auxiliary data as a legacy.
Numerical application
In this section, an experiment is performed to investigate the effectiveness of the proposed approach. The hardware and software specifications, as well as the data used in the experiment, are stated as follows:
- -
Hardware specifications
- •
ASUS i7 3370
- •
CPU: Intel® 4-Core (3.40 GHz)
- •
RAM: 8GB
- •
- -
Software specifications
- •
Window Server 2008 Enterprise Edition
- •
Microsoft SQL Server 2008
- •
The data used for the experiment came from the KDD Cup 1999 database, which is the data set used for “The Third International
Conclusion
By utilizing some more typical auxiliary data, we can get significant improvements in maintaining the statistical functions inside a data warehouse. The proposed approach works efficiently with the MEAN, VARIANCE, MIN, and MAX functions. With just thousands of auxiliary tuples, we can reduce the maintenance time from minutes to seconds. As the dynamic business environment requires managers to make a lot of rapid decisions, and as statistical functions play a very important role in the decision
Yeu-Shiang Huang is currently a professor in the Department of Industrial and Information Management at National Cheng Kung University, Taiwan. He earned both his M.S. and Ph.D. degrees in Industrial Engineering from the University of Wisconsin—Madison, U.S.A. His research interests include operations management, supply chain management, reliability engineering, and decision analysis. Related papers have appeared in such professional journals as Software Testing, Verification and Reliability,
References (22)
Incremental maintenance of object-oriented data warehouses
Information Sciences
(2004)- et al.
Incremental maintenance of aggregate and outerjoin expressions
Information Systems
(2006) Incremental view maintenance in extended relational database
Information and Software Technology
(1995)- et al.
An efficient method for maintaining data cubes incrementally
The Information of the Science
(2010) - et al.
Progressive ranking of range aggregates
Data & Knowledge Engineering
(2007) - et al.
Making multiple views self-maintainable in a data warehouse
Data & Knowledge Engineering
(1999) - et al.
Efficiency evaluation of data warehouse operations
Decision Support Systems
(2008) - et al.
Making aggregate views self-maintainable
Data & Knowledge Engineering
(2000) - et al.
MSMiner—a developing platform for OLAP
Decision Support Systems
(2007) - et al.
Integrating the maintenance and synchronizaiton of data warehouses using a cooperative framework
Information Systems
(2002)
Efficient view maintenance at data warehouses. Paper presented at the ACM SIGMOD International Conference on Management of Data
Cited by (7)
A game theory based framework for materialized view selection in data warehouses
2018, Engineering Applications of Artificial IntelligenceCitation Excerpt :Materialized views are employed to efficient processing of complex queries. Finding this optimal set of views is a challenging issue that has attracted a lot of recent studies (Huang et al., 2014). From a general point of view, materialized view selection methods can be divided into two categories: static view selection and dynamic view selection.
Optimized execution method for queries with materialized views: Design and implementation
2021, Journal of Intelligent and Fuzzy SystemsIntroduction to smart maintenance
2018, Studies in Systems, Decision and ControlAnswering ad-hoc continuous aggregate queries over data streams using Dynamic Prefix Aggregate Tree
2016, Intelligent Data AnalysisGraph-Based Framework for Evaluating the Feasibility of Transition to Maintainomics
2015, Studies in Big DataBenchmark for approximate query answering systems
2015, Journal of Database Management
Yeu-Shiang Huang is currently a professor in the Department of Industrial and Information Management at National Cheng Kung University, Taiwan. He earned both his M.S. and Ph.D. degrees in Industrial Engineering from the University of Wisconsin—Madison, U.S.A. His research interests include operations management, supply chain management, reliability engineering, and decision analysis. Related papers have appeared in such professional journals as Software Testing, Verification and Reliability, Software Quality Journal, Computers and Operations Research, Computers and Industrial Engineering, Expert Systems with Applications, International Journal of Computer Mathematics, Journal of Universal Computer Science, IIE Transactions, Naval Research Logistics, IEEE Transactions on Engineering Management, Reliability Engineering and System Safety, IEEE Transactions on Reliability, Communications in Statistics, and others.
Do Duy is a graduate student in the Department of Industrial and Information Management at National Cheng Kung University, Taiwan.
Chih-Chiang Fang is currently an assistant professor in the Department of Information Management at Shu-Te University, Kaohsiung, Taiwan. He received his Ph.D. degree in the Department of Industrial and Information Management at National Cheng Kung University, Taiwan. His research interests include decision analysis, Bayesian statistical methods, and reliability engineering. Related papers have appeared in such professional journals as Naval Research Logistics, IEEE Transactions on Engineering Management, Software Testing Verification and Reliability Computers and Industrial Engineering, and others.