14.03.2018 | Research Paper Open Access

# EM-OLAP Framework

## Econometric Model Transformation Method for OLAP Design in Intelligence Systems

- Zeitschrift:
- Business & Information Systems Engineering

- Autoren:
- Jan Tyrychtr, Martin Pelikán, Hana Štiková, Ivan Vrana

## 1 Introduction

### 1.1 Problem Statement and Previous Research

### 1.2 Research Question and Methods

### 1.3 Structure of the Article

## 2 Theoretical Background

### 2.1 OLAP

### 2.2 Econometric Models

_{r}is the rth exogenous variable, with a value in the period \(t\) of \(x_{rt}\), where the number of exogenous variables is equal to k. Thus, r = (1, 2, …, k). The time-delayed endogenous variable \(z\) expresses the effects of variables for period \(t\), where z = (1, 2, …, t − z). \(u_{st}\) is a random variable in the sth equation of explained endogenous variables in period t. β

_{is}is a structural parameter in the ith equation of the sth model undelayed endogenous variable, and \(\gamma_{ir}\) in the ith equation of the model of the rth predetermined variable.

- matrix \(B\) contains parameters of the endogenous variables of the model,
- matrix \(\varGamma\) contains parameters of the predetermined variables of the model,
- vector \(y_{t}\) contains endogenous variables of the model,
- vector \(x_{t}\) contains predetermined variables of the model, and
- vector \(u_{t}\) includes stochastic variables of the model.

### 2.3 The TEM-CM Method

- Phase 1: Creation of the primary constellation schema
- Rule 1.1: Creation of a fact table in an empty schema for each endogenous variable from the EM.
- Rule 1.2: Creation of dimensions of the schema for each exogenous variable from the EM.
- Rule 1.3: Creation of a time dimension in the schema (if a time variable exists in the EM).
- Phase 2: Creation of relationships in the schema
- Rule 2.1: If there is a relationship between the exogenous and endogenous variables in the EM, create table-related associations between facts and dimensions in the schema.

## 3 Research Approach

## 4 Identification Phase

- Econometrist – a scientist or analyst that models economic reality using statistic, mathematic or economics instruments. He/she is an expert in economy and statistics and utilizes economic and statistic software. His/her work results in EM equations. He/she does not use an EM-OLAP framework. Rather, he/she only formally identifies the economic reality to design econometric systems. OLAP is not the substantial instrument of an econometrist.
- System designer – someone that proposes a system design or architecture. He/she proposes an optimum balance between business needs and technological constraints. Econometric intelligent systems are a rather special part of system design but currently lack methodical guidelines. Thus, the EM-OLAP framework is directly designed to meet the needs of a system designer, enabling him/her to design econometric systems based on OLAP concepts.
- Analyst – someone that directly works with the created OLAP solution. He/she performs econometric analyses (e.g., analyses of production factors, consumption changes or unit and marginal costs), creates key performance indicators (KPIs) and develops reports for decision makers (managers).
- Decision maker (manager) – someone who evaluates the econometric analyses and proposes further steps.

- to perform a comparison of multidimensional schemas via measurements of data mart quality and
- to create formal rules for the transformation of an EM.

- design of conceptual and logical schemas of a multidimensional database and
- creation and implementation of the physical design of the OLAP prototype.

Hypothesis |

H1: A possible transformation of the EM into the physical schema for OLAP exists |

## 5 Creation of the TEM Method

### 5.1 Proposal of TEM

_{1}, x

_{2}, and x

_{3}represent dimensions. Since the model contains a time variable t, we add the dimension of time to the schema. The fact table is associated with the roll-up relationship for all relevant dimensions, i.e., the variables on the right side of the equation. All notations of the equation represent the measure and thus serve as observed indicators, which will be part of the fact table. Thus, the created conceptual diagram is as shown in Fig. 3.

_{1}, y

_{2}and y

_{3}. Subsequently, we create a dimension in the schema for each exogenous variable in our EM (x

_{1}, x

_{2}, x

_{3}, x

_{4}and \(x_{5}\)). Since the model contains a time variable t, we also create the dimension of time. We create roll-up table associations between fact tables and dimensions. Thus, for example, the equation \(y_{2t} = \beta_{21} y_{1t} + \gamma_{21} x_{1t} + \gamma_{25} x_{5t} + u_{2t}\) indicates that the dimensions \(x_{1}\) and \(x_{5}\) are related to a fact table \(y_{2}\). However, an endogenous variable \(y_{1}\) appears in this second equation. A roll-up of the association between the fact table y

_{1}and the fact table y

_{2}must be created. For the transformation into the logical schema, each dimension is provided with a numerical primary key and associated with the fact table by a foreign key. It is necessary to monitor the measures that will be part of each fact table for each of the three equations. Random variables \(u_{1t} , u_{2t}\) are not illustrated in any conceptual or logical schema. The created logical schema is illustrated in Fig. 5.

### 5.2 Comparison of Multidimensional Schemas

#### 5.2.1 Quantitative Comparison

Variant 1 (TEM-CM) | Variant 2 (TEM) | ||
---|---|---|---|

Measure | Value of measurement | Measure | Value of measurement |

NFT(Sc) | 3 | NFT(Sc) | 1 |

NSDT(Sc) | 2 | NSDT(Sc) | 0 |

Sum | 5 | Sum | 1 |

Variant 1 (TEM-CM) | Variant 2 (TEM) | ||
---|---|---|---|

Measure | Value of measurement | Measure | Value of measurement |

NFT(Sc) | 3 | NFT(Sc) | 1 |

NDT(Sc) | 6 | NDT(Sc) | 6 |

NFK(Sc) | 12 | NFK(Sc) | 6 |

NMFT(Sc) | 14 | NMFT(Sc) | 11 |

Sum | 35 | Sum | 24 |

## 6 The TEM Method

### 6.1 Formal Representation

- \(Y = \left\{ {y_{s} } \right\}\mathop \cup \nolimits \left\{ {y_{st} } \right\}\) is a finite set of endogenous variables,
- \(X = \left\{ {x_{r} } \right\}\mathop \cup \nolimits \left\{ {x_{rt} } \right\}\) is a finite set of exogenous variables and
- \(Rel \subseteq \left( {X \times Y} \right)\mathop \cup \nolimits \left( {Y \times Y} \right)\) is a set of structural relations in the EM.

- \(Ent\) is a non-empty finite set of entities in the schema,
- \(Key\) is a finite non-empty set of keys in the schema,
- \(Att\) is a finite non-empty set of attributes in the schema,
- \(Fact \subseteq Ent\) is a finite set of facts in the schema,
- \(Dim \subseteq Ent\) is a finite set of dimensions in the schema, and
- \(Measure \subseteq Fact\) is a finite set of measures in the schema.

### 6.2 Design of Rules for the TEM Method

- Phase 1: Creation of the basic star schema.
- Rule 1.1: Creation of measures in an empty star schema for each endogenous variable of the EM, which is defined by:$$\forall y_{s} \in Y : m_{s } \in Measure\;{\text{and}}\;\forall y_{st} \in Y : m_{st } \in Measure.$$
- Rule 1.2: Creation of the dimension in the star schema for each exogenous variable in the EM, which is defined by:$$\forall x_{r} \in X : d_{s } \in Dim\; {\text{and}}\;\forall x_{rt} \in X : d_{rt } \in Dim.$$
- Rule 1.3: If there is a time variable in the EM, create the time dimension:$$\forall x_{rt} \in X : d_{rt } \in Dim_{time} .$$
- Phase 2: Creation of relations between entities in the star schema.
- Rule 2.1: If there is a relationship between exogenous variable \(x\), endogenous variable \(y\) and function \(getKey\) that returns a set of keys to these variables, then we create associations between the corresponding fact and the corresponding dimension:$$\begin{aligned} \forall \left( {x,y} \right) \in Rel:(d,c,K)|(d \in Dim) \wedge (c \in Fact) \wedge ((d,c) \in Ass) \wedge (K \subseteq K_{d} \mathop \cup \nolimits K_{c} |(K_{d} = getKey\left( d \right)) \wedge (K_{c} \hfill \\ \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, = getKey\left( c \right))) \hfill \\ \end{aligned}$$

### 6.3 Application of the Rules of the TEM Method

_{1t}denotes industry production during the period t, y

_{2t}denotes other production during the period t, \(y_{3t}\) is the total production during the period t, x

_{1t}is quantity, x

_{2t}is price, x

_{3t}is market demand, x

_{4t}is supply, x

_{5t}is firm-specific information, and u

_{1t}, u

_{2t}are random components of the period t.

_{1t}, y

_{2t}and y

_{3t}(rule 1.1). Subsequently, in accordance with rule 1.2, we create a dimension in the star schema for each exogenous variable in our EM: quantity, price, market demand, supply and firm-specific information (e.g., product characteristics). Since model (1) includes a time variable t, the time dimension is created. In the last phase (rule 2.1), we form an association via the generated keys between the fact table and dimensions. Thus, for example, the equation \(y_{2t} = \beta_{21} y_{1t} + \gamma_{21} x_{1t} + \gamma_{25} x_{5t} + u_{2t}\) indicates that the level quantity of products and firm-specific information have a relationship with other production (i.e., with the measure y

_{1t}in the fact table). In the application context, the equation may be expressed as follows:

## 7 The Creation of the Prototype

### 7.1 Conceptual Design of the Prototype

Rule 1.1 | Measure | \(y_{kt} = 205.113L_{kt}^{0.249} WU_{kt}^{0.525} K_{kt}^{0.143}\) |

Rule 1.2 | Dimension | Land (L) Work (WU) Capital (K) |

Rule 1.3 | Dimension | Time |

### 7.2 Logical Design of the Prototype

#### 7.2.1 Find Relationships Between Different Sets of Entities

#### 7.2.2 Specify Primary Keys for All Sets of Entities

#### 7.2.3 Find All Attributes for Each Set of Entities

#### 7.2.4 Specify the Hierarchy of the Time Dimension

#### 7.2.5 Identify Granularity and Approach of Slowly Changing Dimension

- The first corresponds to the situation in which an econometrist has changed parameters in the econometric equation. The original measure should be preserved, while a new one should be created. This process enables EM-OLAP users to see differences in the calculation of the old and new econometric equations. This situation has no influence on dimension changes.
- The second results from a need to add or remove a variable to/from the EM (to add or remove a relationship with a dimension to/from a measure). This need leads to principal difficulties with a granularity, which is more broadly discussed in Sect. 7.4.2). This problem can be solved by creating a new data model with a new fact table. A high data redundancy is the disadvantage of this solution.

### 7.3 Physical Design of the Prototype

#### 7.3.1 Integrated Data – A

#### 7.3.2 Integrated Data – B

#### 7.3.3 Integrated Data – C

- find a combination of several factors that leads to roughly the same level of production;
- identify the maximum value of production in the reporting period;
- derive the percentage change in the value of one factor during the change in value of the second factor and at a constant level of production.

### 7.4 Result of Prototyping

#### 7.4.1 Existence of a Solution

#### 7.4.2 Limitations and Constraints of the Application

- Generally, OLAP focuses on a more effective analysis of a large number of events, which are related to combinations of a limited number of dimensions. Aggregation mechanisms are the advantage of this solution, yielding a better understanding of the observed process or event. Thus, several concepts in different dimensions must be limited. One should ensure that the dimension tables are somehow related to the fact tables.
- It is always necessary to set the range of dimension values according to a concrete economic reality and to predict these ranges. When these ranges are large, the data should be rounded or categorized. In our considered context (agriculture), it is easy to set the ranges of dimensions such as land acreage and number of employees. However, capital is a continuous quantity, and its concrete values should be constrained via rounding or categorization.
- Multiequation models should be treated with care when considering a granularity problem. Several variants should be considered for this design: (1) select only one endogenous variable as a measure and solve the remaining equations as a calculated column. (2) Create a measure for the selected variable as an aggregation function (e.g., average, maximum) and solve the remaining equations as a calculated column. (3) Convert the EM into a reduced form (one equation). (4) Create a separate data cube for each endogenous variable, with the relationships among individual variables being lost. The selected variant will depend on the econometric requirements of the created OLAP solution.

## 8 EM-OLAP Framework

## 9 Application and Acceptance of the EM-OLAP Framework

## 10 Discussion

- Design of the TEM method. Given the nature of the used method of analogy as a thought process, the conclusions of the analogy clearly lack the characteristic of irrefutable claims. Therefore, other permissible transformations of the EM into conceptual and logical schemas may exist.
- Selection of variants of TEM designs. The quality of the proposed schemas was measured according to scientific methods for measuring data marts presented by (Serrano et al. 2008; Gupta and Gosain 2010). In this area, new ways to measure the quality of multidimensional schemas are continually being developed. Therefore, we cannot evaluate the use of other approaches.
- Creation of rules for the TEM method. The formal notation of the TEM method was created via a mathematical apparatus gradually derived step-by-step instead of via the formulation of definitions, theorems and mathematical proofs. The TEM method was successfully presented at the 9th European Computing Conference (Tyrychtr and Vrana 2016).
- Creation of the prototype. The creation of the prototype of conceptual and logical schemas (according to the TEM method) and the subsequent creation of the physical schema of a multidimensional database allowed us to accept hypothesis H1. To design a physical schema, we experimented with different variants of integrated data. All variants were based only on data suitable for the analysis of production functions. Evidently, the physical design demonstrated the ability to identify different approaches when different types of econometric context are proposed. In future research, the proposal of physical access (e.g., in the context of cost and demand functions) should be considered.
- Acceptance. Several potential problems hindering the adoption of the TEM method exist. According to the design principles of a multidimensional database, fact table measures should be connected to only the combinations of dimensions that determine their values. Thus, only measures sharing all dimensions should be incorporated into the fact table. As a result, the following two possible situations can occur in the design of a multiequation model:

- Only measures related to all dimensions are stored in a standard way (if possible) in the fact table. In practice, data cubes are created with respect to various areas.
- However, the occurrence of non-standard solutions in which measures are not related to all dimensions is not an exception. These solutions are built using one fact table and multiple dimensions. This, however, leads to (1) many NULL elements in a data cube and (2) the user knowing the correct combinations and when to use a certain measure with a particular dimension.