Skip to main content

2014 | Buch

Data Warehouse Systems

Design and Implementation

verfasst von: Alejandro Vaisman, Esteban Zimányi

Verlag: Springer Berlin Heidelberg

Buchreihe : Data-Centric Systems and Applications

insite
SUCHEN

Über dieses Buch

With this textbook, Vaisman and Zimányi deliver excellent coverage of data warehousing and business intelligence technologies ranging from the most basic principles to recent findings and applications. To this end, their work is structured into three parts. Part I describes “Fundamental Concepts” including multi-dimensional models; conceptual and logical data warehouse design and MDX and SQL/OLAP. Subsequently, Part II details “Implementation and Deployment,” which includes physical data warehouse design; data extraction, transformation, and loading (ETL) and data analytics. Lastly, Part III covers “Advanced Topics” such as spatial data warehouses; trajectory data warehouses; semantic technologies in data warehouses and novel technologies like Map Reduce, column-store databases and in-memory databases.

As a key characteristic of the book, most of the topics are presented and illustrated using application tools. Specifically, a case study based on the well-known Northwind database illustrates how the concepts presented in the book can be implemented using Microsoft Analysis Services and Pentaho Business Analytics. All chapters are summarized using review questions and exercises to support comprehensive student learning. Supplemental material to assist instructors using this book as a course text is available at http://cs.ulb.ac.be/DWSDIbook/, including electronic versions of the figures, solutions to all exercises, and a set of slides accompanying each chapter.

Overall, students, practitioners and researchers alike will find this book the most comprehensive reference work on data warehouses, with key topics described in a clear and educational style.

Inhaltsverzeichnis

Frontmatter

Fundamental Concepts

Frontmatter
Chapter 1. Introduction
Abstract
Organizations today are facing increasingly complex challenges in terms of management and problem solving in order to achieve their operational goals. This situation compels people in those organizations to utilize analysis tools that can better support their decisions. Business intelligence comprises a collection of methodologies, processes, architectures, and technologies that transform raw data into meaningful and useful information for decision making. Business intelligence and decision-support systems provide assistance to managers at various organizational levels for analyzing strategic information. These systems collect vast amounts of data and reduce them to a form that can be used to analyze organizational behavior. This data transformation comprises a set of tasks that take the data from the sources and, through extraction, transformation, integration, and cleansing processes, store the data in a common repository called a data warehouse . Data warehouses have been developed and deployed as an integral part of decision-support systems to provide an infrastructure that enables users to obtain efficient and accurate responses to complex queries.
Alejandro Vaisman, Esteban Zimányi
Chapter 2. Database Concepts
Abstract
This chapter introduces the basic database concepts, covering modeling, design, and implementation aspects. Section 2.1 begins by describing the concepts underlying database systems and the typical four-step process used for designing them, starting with requirements specification, followed by conceptual, logical, and physical design. These steps allow a separation of concerns, where requirements specification gathers the requirements about the application and its environment, conceptual design targets the modeling of these requirements from the perspective of the users, logical design develops an implementation of the application according to a particular database technology, and physical design optimizes the application with respect to a particular implementation platform. Section 2.2 presents the Northwind case study that we will use throughout the book. In Sect. 2.3, we review the entity-relationship model, a popular conceptual model for designing databases. Section 2.4 is devoted to the most used logical model of databases, the relational model. Finally, physical design considerations for databases are covered in Sect. 2.5.
The aim of this chapter is to provide the necessary knowledge to understand the remaining chapters in this book, making it self-contained. However, we do not intend to be comprehensive and refer the interested reader to the many textbooks on the subject.
Alejandro Vaisman, Esteban Zimányi
Chapter 3. Data Warehouse Concepts
Abstract
This chapter introduces the basic concepts of data warehouses. A data warehouse is a particular database targeted toward decision support. It takes data from various operational databases and other data sources and transforms it into new structures that fit better for the task of performing business analysis. Data warehouses are based on a multidimensional model, where data are represented as hypercubes, with dimensions corresponding to the various business perspectives and cube cells containing the measures to be analyzed.
Alejandro Vaisman, Esteban Zimányi
Chapter 4. Conceptual Data Warehouse Design
Abstract
The advantages of using conceptual models for designing databases are well known. Conceptual models facilitate communication between users and designers since they do not require knowledge about specific features of the underlying implementation platform. Further, schemas developed using conceptual models can be mapped to various logical models, such as relational, object-relational, or object-oriented models, thus simplifying responses to changes in the technology used. Moreover, conceptual models facilitate the database maintenance and evolution, since they focus on users’ requirements; as a consequence, they provide better support for subsequent changes in the logical and physical schemas.
Alejandro Vaisman, Esteban Zimányi
Chapter 5. Logical Data Warehouse Design
Abstract
Conceptual models are useful to design database applications since they favor the communication between the stakeholders in a project. However, conceptual models must be translated into logical ones for their implementation on a database management system. In this chapter, we study how the conceptual multidimensional model studied in the previous chapter can be represented in the relational model. We start in Sect. 5.1 by describing the three logical models for data warehouses, namely, relational OLAP (ROLAP), multidimensional OLAP (MOLAP), and hybrid OLAP (HOLAP).
Alejandro Vaisman, Esteban Zimányi
Chapter 6. Querying Data Warehouses
Abstract
Just as SQL is a language for manipulating relational databases, MDX (Multi-Dimensional eXpressions) is a language for defining and querying multidimensional databases. Although at first sight it may appear that MDX resembles SQL, they are significantly different from each other. While SQL operates over tables, attributes, and tuples, MDX works over data cubes, dimensions, hierarchies, and members (at the instance level). MDX is a de facto standard supported by many OLAP tool providers.
Alejandro Vaisman, Esteban Zimányi

Implementation and Deployment

Frontmatter
Chapter 7. Physical Data Warehouse Design
Abstract
The physical design of data warehouses is crucial to ensure adequate query response time. There are typically three common techniques for improving performance in data warehouse systems: materialized views, indexing, and partitioning. A materialized view is a view that is physically stored in a database, which enhances query performance by precalculating costly operations such as joins and aggregations. With respect to indexing, traditional techniques used in OLTP systems are not appropriate for multidimensional data. Thus, alternative indexing mechanisms are used in data warehouses, typically bitmap and join indexes. Finally, partitioning or fragmentation divides the contents of a relation into several files, typically based on a range of values of an attribute.
Alejandro Vaisman, Esteban Zimányi
Chapter 8. Extraction, Transformation, and Loading
Abstract
Extraction, transformation, and loading (ETL) processes are used to extract data from internal and external sources of an organization, transform these data, and load them into a data warehouse. Since ETL processes are complex and costly, it is important to reduce their development and maintenance costs. Modeling ETL processes at a conceptual level is a way to achieve this goal. However, existing ETL tools, like Microsoft Integration Services or Pentaho Data Integration (also known as Kettle), have their own specific language to define ETL processes. Further, there is no agreed-upon conceptual model to specify such processes. In this chapter, we study the design of ETL processes using a conceptual approach. The model we use is based on the Business Process Modeling Notation (BPMN), a de facto standard for specifying business processes. The model provides a set of primitives that cover the requirements of frequently used ETL processes. Since BPMN is already used for specifying business processes, users already familiar with BPMN do not need to learn another language for defining ETL processes. Further, BPMN provides a conceptual and implementation-independent specification of such processes, which hides technical details and allows users and designers to focus on essential characteristics of such processes. Finally, ETL processes expressed in BPMN can be translated into executable specifications for ETL tools.
Alejandro Vaisman, Esteban Zimányi
Chapter 9. Data Analytics: Exploiting the Data Warehouse
Abstract
Analytics can be defined as the discovery and communication of meaningful patterns in data. Organizations apply analytics to their data in order to describe, predict, and improve organizational performance. Analytics uses descriptive and predictive models to gain valuable knowledge from data and uses this insight to guide decision making. Analytics relies on data visualization to communicate insight. We can distinguish several variations of analytics depending on the kind of data to be analyzed. While data analytics copes with traditional structured data, text analytics refers to the analysis of unstructured textual sources such as those found in blogs, social networks, and the like. Web analytics refers to the collection, analysis, and reporting of web data. Finally, visual analytics combines automated analysis techniques with interactive visualizations, providing effective means to interactively explore large and complex data sets for decision making.
Alejandro Vaisman, Esteban Zimányi
Chapter 10. A Method for Data Warehouse Design
Abstract
Even though there is an abundant literature in the area of software development, few publications have been devoted to the development of data warehouses. Most of them are written by practitioners based on their experience in building data warehouses. On the other hand, the scientific community has proposed a variety of approaches, which in general target a specific conceptual model and are too complex to be used in real-world environments. As a consequence, there is still a lack of a methodological framework that could guide developers in the various stages of the data warehouse development process.
Alejandro Vaisman, Esteban Zimányi

Advanced Topics

Frontmatter
Chapter 11. Spatial Data Warehouses
Abstract
It is estimated that about 80% of the data stored in databases has a spatial or location component. Therefore, the location dimension has been widely used in data warehouse and OLAP systems. However, this dimension is usually represented in an alphanumeric, nonspatial manner (i.e., using solely the place name) since these systems are not able to manipulate spatial data. Nevertheless, it is well known that including spatial data in the analysis process can help to reveal patterns that are difficult to discover otherwise. Taking into account the growing demand to incorporate spatial data into the decision-making process, we present in this chapter how data warehouses can be extended with spatial data.
Alejandro Vaisman, Esteban Zimányi
Chapter 12. Trajectory Data Warehouses
Abstract
The previous chapter focused on the analysis of the spatial features of static objects such as stores, cities, or states, where by static we mean that the spatial features of these objects do not change (or change exceptionally) across time. However, there is a wide range of applications that require the analysis of the so-called moving objects, that is, objects that continuously change their position in space and time. This is called mobility data analysis. The interest in mobility data analysis has expanded dramatically with the availability of embedded positioning devices like GPS. With these devices, traffic data, for example, can be captured as a collection of sequences of positioning signals transmitted by the cars’ GPS along their itineraries. Since such sequences can be very long, they are often processed by dividing them in segments.
Alejandro Vaisman, Esteban Zimányi
Chapter 13. New Data Warehouse Technologies
Abstract
Big data refers to large collections of data that may be unstructured or may grow so large and at such a high pace that it is difficult to manage them with standard database systems or analysis tools. Examples of big data include web logs, radio-frequency identification tags, sensor networks, and social networks, among other ones. It has been reported as of the time of writing this book that 7 and 10 terabytes of data are added and processed, respectively, by Twitter and Facebook every day. Approximately 80% of these data are unstructured, and 90% of them have been created in the last 2 years. Management and analysis of these massive amounts of data demand new solutions that go beyond the traditional processes or software tools. All of these have great implications on the way data warehousing practice is going to be performed in the future. For instance, big data analytics requires in many cases the data latency (the time elapsed between the moment some data are collected and the action based on such data is taken) to be dramatically reduced. Thus, near real-time data management techniques must be developed. Also, external data sources like the semantic web may need to be queried.
Alejandro Vaisman, Esteban Zimányi
Chapter 14. Data Warehouses and the Semantic Web
Abstract
The availability of enormous amounts of data from many different domains is producing a shift in the way data warehousing practices are being carried out. Massive-scale data sources are becoming common, posing new challenges to data warehouse practitioners and researchers. The semantic web, where large amounts of data are being stored daily, is a promising scenario for data analysis in a near future. As large repositories of semantically annotated data become available, new opportunities for enhancing current decision-support systems will appear. In this scenario, two approaches are clearly identified. One focuses on automating multidimensional design, using semantic web artifacts, for example, existing ontologies. In this approach, data warehouses are (semi)automatically designed using available metadata and then populated with semantic web data. The other approach aims at analyzing large amounts of semantic web data using OLAP tools. In this chapter, we tackle the latter approach, which requires the definition of a precise vocabulary allowing to represent OLAP data on the semantic web. Over this vocabulary, multidimensional models and OLAP operations for the semantic web can be defined. Currently, there are two proposals in this direction. On the one hand, the data cube vocabulary (also denoted QB) follows statistical data models. On the other hand, the QB4OLAP vocabulary follows closely the classic multidimensional models for OLAP studied in this book. On the other hand, the QB4OLAP vocabulary follows 20 closely the classic multidimensional models for OLAP studied in this book.
Alejandro Vaisman, Esteban Zimányi
Chapter 15. Conclusion
Abstract
In this book, we have provided an in-depth coverage of the most relevant topics in data warehouse design and implementation. Even though in Chaps. 11–14 we covered advanced and very recent developments, there are many other important ones that have been consciously left out for space reasons in favor of mature technologies. We conclude this book with some brief comments on these topics, which we believe will become increasingly relevant in the near future. We refer to a recent book [144] where further perspectives on business intelligence can be found.
Alejandro Vaisman, Esteban Zimányi
Backmatter
Metadaten
Titel
Data Warehouse Systems
verfasst von
Alejandro Vaisman
Esteban Zimányi
Copyright-Jahr
2014
Verlag
Springer Berlin Heidelberg
Electronic ISBN
978-3-642-54655-6
Print ISBN
978-3-642-54654-9
DOI
https://doi.org/10.1007/978-3-642-54655-6