Elsevier

Decision Support Systems

Volume 57, January 2014, Pages 94-104
Decision Support Systems

Efficient maintenance of basic statistical functions in data warehouses

https://doi.org/10.1016/j.dss.2013.08.003Get rights and content

Highlights

  • Significant improvements in maintaining the statistical functions inside a data warehouse.

  • Reduce the maintenance time from minutes to seconds.

  • Efficiently maintaining basic statistical functions inside a data warehouse contributes to firm performance.

  • Can also be applied to maintain a distributed data warehouse or a data mart.

Abstract

In general, some simple but very meaningful statistical functions are often used to retrieve valuable summary information in corporate databases. However, it is not uncommon that such information is obtained from computerized information systems which spend a great deal of time calculating the large volume of collected data. In practice, such data is usually stored in a data warehouse in which a large number of summary tables or materialized aggregate views are built in order to improve the system performance. Upon changes, most notable new transactional data are collected from various data sources, and all summary tables in the data warehouse that correspond to the transactional data must be updated accordingly. Since the number of summary tables that need to be maintained is often large, efficiently maintaining these is thus a critical issue for managing a data warehouse. In this study, an efficient maintenance approach to enhance the performance of a data warehouse is proposed, in which some additional auxiliary tables are kept inside a data warehouse with the role of improving the maintenance processes of some statistical functions, such as MIN, MAX, MEAN, and MEDIAN. Finally, a comparative analysis is performed to verify the effectiveness of the proposal method.

Introduction

Data warehouses are often huge systems which require a very high level maintenance. Any reorganization of the business processes and the source systems may require the data warehouse to change. Updates are often needed in these situations, and more resources are required for maintenance of a data warehouse rather than its development. Views are computed and stored in a database to allow efficient querying and analysis of the data, and these are known as materialized views. In order to keep these up to date, it is necessary to maintain the materialized views in response to the changes at the sources. In general, incrementally maintaining a view can be significantly cheaper than recomputing it from scratch, especially if the size of view is large compared to the size of the changes. The problem of finding such changes in the views based on changes in the base relations is known as the view maintenance problem. Moreover, since statistical functions often play a very important role in the online analytical process (OLAP) in data warehouses, some simple but very meaningful functions are often used to retrieve information. In order to make statistical analysis operations more effective, it is necessary to keep them up to date, which often requires a lot of time and resources. The objective of this study is thus to propose an approach to maintain statistical functions in a data warehouse more efficiently, in which some supplementary data are kept with a legacy role. The auxiliary data are organized in a specific format that makes them very efficient in response to changes in the base tables. In other words, when a base table is changed, the data warehouse will use these auxiliary data to maintain the view instead of reaccessing the entire base table, significantly improving the maintenance processes.

Since data warehouses often contain huge amounts of data, in order to keep performance smooth and constant, maintenance of a data warehouse is crucial. [16] suggested an efficiency model for data warehouse operations which contains two major processes (refresh processing and query production), and concluded that only few organizations can obtain good efficiency for both processes. While companies may be successful in implementing multi-terabyte data warehouses, it is still difficult to translate this investment into useful business knowledge or positive business results. Computing and/or materializing a complete data cube is often expensive, in both time and space, and difficult to conduct online. Research has proposed different approaches for dealing with this problem of querying OLAP data cubes [4], [14], [19]. However, as changes are made to the data sources, the warehouse views must be updated to reflect this. According to [3], [6], the updated view can be either recomputed from scratch or obtained by using incremental maintenance techniques. However, in today's warehouses, the changes to the sources are often not detected and propagated to the warehouse immediately. To efficiently deal with changes to the materialized views in relational and deductive database systems, [5] presented a counting algorithm that tracks the number of alternative derivations for each derived tuple in a view. By first deleting a superset of the tuples and rederiving some of them, a recursive view can be maintained incrementally. This method was described as the delete and rederive (DRed) algorithm.

For data warehouses that provide global access, even a short period of downtime may not be acceptable. [1] proposed two algorithms called SWEEP and Nested SWEEP to deal with this issue. Both algorithms work without requiring that the data warehouse is quiescent to incorporate the new views. A multi-agent system (MAS) that achieves immediate incremental view maintenance was developed by [21]. [7] proposed an algorithm for efficiently computing updates to views without reaccessing the materialization of views. [10] proposed a propagation approach to deal with view maintenance tasks. [9] proposed an approach which employs an extendable multidimensional array to maintain the increment of data cubes with the aim to avoid recomputing the entire data cube. [13] proposed an efficient algorithm which can maintain a data cube using only a subset 2n delta cuboids.

Incremental maintenance has been studied by [3], [22]. In spite of their positive results, view maintenance is still complicated by the fact that the sources are decoupled from the warehouse, so that traditional incremental view maintenance may exhibit anomalies. In this situation, warehouse self-maintainability is desirable. This is the ability of a warehouse to maintain itself without “help” from the underlying databases, i.e., it based only on reported changes to the underlying databases [12]. Auxiliary data is used in order to update materialized views, without accessing the base relation. Research on the construction of auxiliary tables for self-maintainability can be seen in [2], [8], [15], [17], [18], [20]. According to [12], warehouse independence with respect to queries is the ability of the warehouse to answer queries posed to the underlying sources from the warehouse views. Similarly, warehouse independence with respect to updates is the ability of a warehouse to maintain itself based only on reported changes in the underlying sources.

Based on the aforementioned discussion, in this study an efficient maintenance approach to enhance the performance of a data warehouse is proposed. The main objective of this study is to avoid reloading all the data in DW completely when calculating some statistical functions is required, no matter if the data are generated from real-time operations or periodical refreshment. In this paper, we maintain some supplementary tables (e.g., auxiliary tables, materialized views) in the DW to keep track of each update, so that the values of required statistical functions, such as MIN, MAX, MEAN, and MEDIAN, can be instantly calculated without a full reloading task. The proposed incremental maintenance approach is statistically proved to be significantly faster than the traditional one (completely reloading and recomputing), especially when the volume of the original data is much larger than that of the changes. Therefore, the proposed approach can be employed in both a periodic data refresh policy or an OLAP environment. This paper is organized as follows: Section 2 considers the maintenance of a data warehouse, Section 3 depicts the development of the proposed approach, Section 4 demonstrates the effectiveness of the proposed approach by performing a numerical experiment, and finally, Section 5 presents the concluding remarks.

Section snippets

Maintaining a data warehouse

The data in data warehouses are in the form of data cubes, which allow data to be modeled and viewed in multiple dimensions. To clarify the complex view of data, a simple example is provided. For instance, in order to respond to changes in the business environment, the board of an enterprise, which is specialized in selling, decides to develop a decision support system with the use of a data warehouse. A part of the system is depicted in Fig. 1.

The system is divided into two parts: (1) the

Incremental maintenance of basic statistical functions

As specified in the previous section, we can classify statistical functions into self-maintainable, semi-self-maintainable, and non-self-maintainable. It is relatively easy to maintain self-maintainable functions. However, in order to maintain semi-self-maintainable and non-self-maintainable statistical functions we need to use the auxiliary data as a legacy.

Numerical application

In this section, an experiment is performed to investigate the effectiveness of the proposed approach. The hardware and software specifications, as well as the data used in the experiment, are stated as follows:

  • -

    Hardware specifications

    • ASUS i7 3370

    • CPU: Intel® 4-Core (3.40 GHz)

    • RAM: 8GB

  • -

    Software specifications

    • Window Server 2008 Enterprise Edition

    • Microsoft SQL Server 2008

The data used for the experiment came from the KDD Cup 1999 database, which is the data set used for “The Third International

Conclusion

By utilizing some more typical auxiliary data, we can get significant improvements in maintaining the statistical functions inside a data warehouse. The proposed approach works efficiently with the MEAN, VARIANCE, MIN, and MAX functions. With just thousands of auxiliary tuples, we can reduce the maintenance time from minutes to seconds. As the dynamic business environment requires managers to make a lot of rapid decisions, and as statistical functions play a very important role in the decision

Yeu-Shiang Huang is currently a professor in the Department of Industrial and Information Management at National Cheng Kung University, Taiwan. He earned both his M.S. and Ph.D. degrees in Industrial Engineering from the University of Wisconsin—Madison, U.S.A. His research interests include operations management, supply chain management, reliability engineering, and decision analysis. Related papers have appeared in such professional journals as Software Testing, Verification and Reliability,

References (22)

  • D. Agrawal et al.

    Efficient view maintenance at data warehouses. Paper presented at the ACM SIGMOD International Conference on Management of Data

    (1997)
  • Cited by (7)

    • A game theory based framework for materialized view selection in data warehouses

      2018, Engineering Applications of Artificial Intelligence
      Citation Excerpt :

      Materialized views are employed to efficient processing of complex queries. Finding this optimal set of views is a challenging issue that has attracted a lot of recent studies (Huang et al., 2014). From a general point of view, materialized view selection methods can be divided into two categories: static view selection and dynamic view selection.

    • Introduction to smart maintenance

      2018, Studies in Systems, Decision and Control
    • Benchmark for approximate query answering systems

      2015, Journal of Database Management
    View all citing articles on Scopus

    Yeu-Shiang Huang is currently a professor in the Department of Industrial and Information Management at National Cheng Kung University, Taiwan. He earned both his M.S. and Ph.D. degrees in Industrial Engineering from the University of Wisconsin—Madison, U.S.A. His research interests include operations management, supply chain management, reliability engineering, and decision analysis. Related papers have appeared in such professional journals as Software Testing, Verification and Reliability, Software Quality Journal, Computers and Operations Research, Computers and Industrial Engineering, Expert Systems with Applications, International Journal of Computer Mathematics, Journal of Universal Computer Science, IIE Transactions, Naval Research Logistics, IEEE Transactions on Engineering Management, Reliability Engineering and System Safety, IEEE Transactions on Reliability, Communications in Statistics, and others.

    Do Duy is a graduate student in the Department of Industrial and Information Management at National Cheng Kung University, Taiwan.

    Chih-Chiang Fang is currently an assistant professor in the Department of Information Management at Shu-Te University, Kaohsiung, Taiwan. He received his Ph.D. degree in the Department of Industrial and Information Management at National Cheng Kung University, Taiwan. His research interests include decision analysis, Bayesian statistical methods, and reliability engineering. Related papers have appeared in such professional journals as Naval Research Logistics, IEEE Transactions on Engineering Management, Software Testing Verification and Reliability Computers and Industrial Engineering, and others.

    View full text