Abstract
Data warehouses collect large quantities of data from distributed sources into a single repository. A typical load to create or maintain a warehouse processes GBs of data, takes hours or even days to execute, and involves many complex and user-defined transformations of the data (e.g., find duplicates, resolve data inconsistencies, and add unique keys). If the load fails, a possible approach is to “redo” the entire load. A better approach is to resume the incomplete load from where it was interrupted. Unfortunately, traditional algorithms for resuming the load either impose unacceptable overhead during normal operation, or rely on the specifics of transformations. We develop a resumption algorithm called DR that imposes no overhead and relies only on the high-level properties of the transformations. We show that DR can lead to a ten-fold reduction in resumption time by performing experiments using commercial software.
- 1 P. A. Bernstein, M. Hsu, and B. Mann. Implementing Recoverable Requests Using Queues. In SIGMOD, pp. 112- 122, 1990. Google ScholarDigital Library
- 2 P.A. Bernstein and E. Newcomer. Principles of Transaction Processing. Morgan-Kaufman, 1997. Google ScholarDigital Library
- 3 F. Carino. High-performance, parallel warehouse servers and large-scale applications, Oct. 1997. Talk about Teradata given in Stanford Database Seminar.Google Scholar
- 4 TPC Committee. Transaction Processing Council. Available at: http://www.tpc.org/.Google Scholar
- 5 J. Gray and A. Reuter. Transaction Processing: Concepts and Techniques. Morgan-Kaufman, 1993. Google ScholarDigital Library
- 6 Informatica. Powermart 4.0 overview. Available at: http://www.informatica.com/pm_tech_over.html.Google Scholar
- 7 W. J. Labio, J. L. Wiener, H. Garcia-Molina, and V. Gorelik. Resumption algorithms. Technical report, Stanford University, 1998. Available at http://wwwdb. stanford.edu/pub/papers/resume.ps.Google Scholar
- 8 C. Mohan and I. Narang. Algorithms for Creating Indexes for Very Large Tables Without Quiescing Updates. In SIGMOD, pp. 361-370, 1992. Google ScholarDigital Library
- 9 R. Reinsch and M. Zimowski. Method for Restarting a Long- Running, Fault-Tolerant Operation in a Transaction-Oriented Data Base System Without Burdening the System Log. U.S. Patent 4,868,744, IBM, 1989.Google Scholar
- 10 Sagent Technologies. Personal correspondence with customers.Google Scholar
- 11 J. L. Wiener and J. F. Naughton. OODB Bulk Loading Revisited: The Partitioned-List Approach. In VLDB, pp. 30- 41, Zurich, Switzerland, 1995. Google ScholarDigital Library
- 12 A. Witkowski, F. Carifio, and P. Kostamaa. NCR 3700- The Next-Generation Industrial Database Computer. In VLDB, pp. 230-243, 1993. Google ScholarDigital Library
Index Terms
- Efficient resumption of interrupted warehouse loads
Recommendations
Efficient resumption of interrupted warehouse loads
SIGMOD '00: Proceedings of the 2000 ACM SIGMOD international conference on Management of dataData warehouses collect large quantities of data from distributed sources into a single repository. A typical load to create or maintain a warehouse processes GBs of data, takes hours or even days to execute, and involves many complex and user-defined ...
Alliance Rules for Data Warehouse Cleansing
ICSPS '09: Proceedings of the 2009 International Conference on Signal Processing SystemsData Cleansing is an activity performed on the data sets of data warehouse to enhance and maintain the quality and consistency of the data. This paper addresses the problems related with dirty data, entrance of dirty data and detection of dirty data in ...
Comments