1 Introduction
-
We propose a novel loss function for multicriteria selection of the optimal database design for document stores.
-
We devise an algebra of transformations that allow to systematically modify a database design while retaining all the information (i.e., no attributes or entities are lost).
-
We present an implementation of a local search algorithm that, driven by the loss function, proposes with high probability near-optimal designs.
-
We assess our method and prototype to show the scalability as well as the performance gain of our solution against competitors by using the RUBiS benchmark [11].
2 Related work
System | Data store | Workload | Optimization | Exploration |
---|---|---|---|---|
Chebotko et al. [24] | Col | Read | Rule-based | Low |
Mortadelo [7] | Col/Doc | Read | Query cost/metamodel | Low |
De Lima et al. [23] | Doc | Read | Query cost/rule-based | Low |
NoSE [6] | Col | Read | Query cost | Low |
DBSR [5] | Doc | Read | Query cost | Low |
DocDesign 2.0 | Doc | Read | Multicriteria | High |
3 Overview
Regions
as the top-level document, a normalized approach similar to 3NF, or an in-between solution. With reference to our estimate in the introduction, there are \(24^6\) possible alternative designs to choose from.
3.1 User inputs
3.2 Design processes
3.3 Loss function
3.4 Search algorithm
4 Canonical model
C | The catalog (the design) | \(E_N\) | Node (either atom or edge) |
A | Atom | \({\mathbb {A}}\) | Set of all atoms in C |
\(A_C\) | Class atom | O(h) | Root atom of a struct h |
\(E_R^{x,y}\) | Directed relationship between two atoms x and y | \({\mathbb {E}}_R\) | Set of all relationships in C |
\(E_H\) | Hyperedge | \(E_H^+\) | Transitive closure of \(E_H\) |
\(E_{\textit{Struct}}\) | Struct | \({\mathbb {E}}_{\textit{Struct}}\) | Set of all the \(E_{\textit{Struct}}\)s in C |
\(E_{\textit{Set}}\) | Set | \({\mathbb {E}}_{\textit{Set}}\) | Set of all the \(E_{\textit{Set}}\)s in C |
\({E}^{Doc}_{Doc}\) | Hyperedge representing a document of a document store | \(E^{Doc}_{\textit{Top}}\) | \(E^{Doc}_{Doc}\) representing top level document of a collection of a document store |
\({E}^{Doc}_{List}\) | Hyperedge representing a list in a document store | \(E^{Doc}_{Col}\) | \(E^{Doc}_{List}\) representing a top level list (a.k.a collection) |
\(E_R^{A_C,A_C}\) | Relationship between two class atoms | E | Generic superclass for edges |
4.1 Immutable graph
Method |
\(\langle \langle {preconditions} \rangle \rangle \)
|
\( \langle \langle {postconditions} \rangle \rangle \)
|
---|---|---|
\(\textit{Hyperedge}(C,nodes: \textit{Set} \ \textit{of} \ E_N )\)
|
\(\bullet \)
\(nodes\subseteq sel\!f\)
| |
\(\bullet \)
\(sel\!f\in C\)
| ||
\(\backsim \textit{Hyperedge}()\)
|
\(\bullet \)
\( sel\!f \notin \textit{self}.C \)
| |
\(\bullet \)
\(sel\!f.children@pre \subseteq \)
\(\textit{self}.\textit{parent}\)
| ||
\(E_H.\textit{addNode}(n:E_N)\)
|
\(\bullet \)
\(sel\!f \notin n^+\)
|
\(\bullet \)
\(n \in sel\!f\)
|
\(E_H.\textit{removeNode}(n:E_N)\)
|
\(\bullet \)
\(n \in sel\!f\)
|
\(\bullet \)
\(n \notin sel\!f\)
|
4.2 Storage-agnostic constructs
Method | Activity |
---|---|
\(\textit{Struct}(C, r:\! A_C\! ,At:\textit{Set} \ \textit{of} \ A, Re: \) | \(\textit{super}(C, \{r\} \cup At \cup Re \cup Hy)\) |
\(\textit{Set} \ \textit{of} \ E_R, Hy: \textit{Set of} \ E_H, p:E_H)\) | \(\textit{self}.\textit{setRoot}(r)\) |
\(p.\textit{addNode}(\textit{self}\,)\) | |
\(\textit{Se}t(C, Re: \textit{Set} \ \textit{of}\ E_R , Hy:\textit{Set} \ \textit{of} \ E_{\textit{Struct}},At:\) | \(\textit{super}(C,Re \cup Hy\cup At)\) |
\(\textit{Set} \ \textit{of} \ A, p:E_{\textit{Struct}})\) | \(p.\textit{addNode}(\textit{self}\,)\) |
4.3 Document store-specific constructs
Method | Activity |
---|---|
\(\textit{Document}(C,r:A_C,Re:\textit{Set} \ \textit{of} \ E_R,At:\textit{Set} \ \textit{of} \ A,\) | \(\textit{super}(C,r, At,Re, Do\cup Li, p)\) |
\( Do:\textit{Set} \ \textit{of} \ E^{Doc}_{Doc},Li:\textit{Set} \ \textit{of} \ E^{Doc}_{List}, p: \) | |
\((E^{Doc}_{List} \cup E^{Doc}_{\textit{Struct}}))\) | |
\(\textit{TopDoc}(C,r:A_C,Re:\textit{Set} \ \textit{of} \ E_R,At:\textit{Set} \ \textit{of} \ A,\) | \(\textit{super}(C,r, At,Re, Do\cup Li,p)\) |
\( Do:\textit{Set} \ \textit{of} \ E^{Doc}_{Doc},Li:\textit{Set} \ \textit{of} \ E^{Doc}_{List}, p: E^{Doc}_{Col})\) | |
\(\textit{Collection}(C, Do:\textit{Set} \ \textit{of} \ E^{Doc}_{\textit{Top}})\) | \(\textit{super}(C,\varnothing ,Do,\varnothing , \varnothing )\) |
\(List(C,Re: \textit{Set} \ \textit{of} \ E_R,Do: \textit{Set}\) | \(\textit{super}(C,Re,Do,At,p)\) |
\(\textit{of} \ E^{Doc}_{Doc}, At: \textit{Set} \ \textit{of} \ A, p: E^{Doc}_{\textit{Struct}})\) |
5 Design processes over the canonical model
5.1 Random design generation
products
and comments
from RUBiS and also include users
to have a complex scenario to generate a random design. Let us assume we picked \(E_R^{\textit{P}\_\textit{ID},\textit{C}\_\textit{ID}}\) as the first \(E_R\) in Algorithm 2 line 5. Next, in Algorithm 3 we got \(\textit{LEFT}\) as the random \(\textit{opChoice}\) in line 2. Since there are no existing \(\textit{Comp}\)s we move to line 21 and create a new \(\textit{Comp}\) to the \(\textit{comps}\) list with \(Cnode\langle P\!\_ID,null,null,ROOT\rangle \) as the root and a single child \(Cnode\langle P\!\_ID, C\!\_ID, E_R^{P\_ID,C\_{I}D},\textit{NEST}\rangle \) if we got \(\textit{NEST}\) option. Now, coming back to line 7 in Algorithm 2, we have \(E_R^{P\_ID,U\_{I}D}\) in \(E_Rs\) as the only unused \(E_R^{A_C,A_C}\) connected to \(P\!\_ID\). Here, at line 13 we pick this \(E_R\) and go back to Algorithm 3. Let us assume that we got \(\textit{RIGHT}\) as the \(\textit{opChoice}\). We cannot extend the previous \(\textit{Comp}\) that we made as it does not satisfy the extensible criteria. Thus, we create a new \(\textit{Comp}\) with \(Cnode\langle \textit{U}\_\textit{ID},null,null,ROOT\rangle \) as the root and \(Cnode\langle \textit{U}\_\textit{ID},\textit{P}\_\textit{ID},E_R^{P\_ID,U\_{I}D},REF\rangle \) as its child.products
embedding comments
in one collection with comments
having a reference to the users
and a second collection of users
with a reference to both comments
and products
.5.2 Design transformations
Design 3
, \(E_{Rel}\) belongs to the \(E^{Doc}_\textit{Top}\) containing \(P\!\_\textit{ID}\); in Design 5
, it is the \(E^{Doc}_\textit{Set}\) of \(C\!\_ID\); and in Design 4
, that of the \(A_C\)). Notice that in Designs 1
and 2
, there is a double-head dashed arrow, which means that, for segregation and union, its existence is optional and also can be at either one or other side. The optionality of the \(E_R\) in Design 1
and 2
implies that the reference between the collections can reside on either side, which gives rise to four alternative schemas, namely references on both collections, one collection, or none.product
. Since, only \(A_C\)s are relevant for the transformations, namely \(P\!\_\textit{ID}\) and \(C\!\_\textit{ID}\), only these are shown to keep the figures clean. Nevertheless, we assume that the attributes of any \(A_C\) are always attached to it (e.g., \(P\!\_\textit{NAME}\) will always be in the same hyperedge/document as \(P\!\_\textit{ID}\)).Design 1
where a product
have references to the comments
and follow the transformations as illustrated in Fig. 5. According to Table 10, in order to union two \(E_\textit{Set}^{Doc}\)s, they need to share the same parent. Thus, if we call \(E_{Col\_Prod}^{Doc}.union(E_{Col\_Com}^{Doc})\), firstly, the \(E_{Col\_Com}^{Doc}\) is added to the source \(E_{Col\_Prod}^{Doc})\), followed by the removal of the absorbed collection from the catalog. Finally, \(E_{Col\_Com}^{Doc}\) is disposed, leaving its children in the new parent \(E_{Col\_Prod}^{Doc}\), as represented in Design 2
, where comments
and products
are in the same collection (notice that in this case only products
references comments
). The equivalent representation in JSON is illustrated in Fig. 6
Design 3
. The result is that for each product, there will be several comments
in the form of a flattened list, and each document will carry the name of the relationship suffixed by a counter. Implementation-wise, we rename the embedded \(E^{Doc}_{Doc}\) with the name of the relationship followed by a counter, as shown in the JSON in RHS of Fig. 7.
Design 3
, we obtain Design 4
, where the comments
are directly embedded inside each product
without an enclosing comments
document. However, the prefix gotCom
followed by the counter still needs to be included in the JSON to distinguish different instances as shown in Fig. 8. Similarly, by applying \(E^{Doc}_{List\_Com}.\textit{Flatten}()\) on Design 5
, we can flatten the list of comment
s into an embedded sequence with a counter.
Design 3
, we obtain Design 5
which embeds the comments
inside each product
document. Afterward, the child \(E_{Doc}^{Doc}\) and the \(E_R\)s that are no longer used in any path to the remaining children, are removed from the original \(E_\textit{Struct}^{Doc}\). The transformation results in a new List inside the JSON containing the data in the child \(E_{Doc}^{Doc}\) which is linked to the container by the path of \(E_R\)s. To revert this transformation, we can call flatten on the created \(E_{List}^{Doc}\) moving us back to Design 3
. This transformation is illustrated as JSON in Fig. 9
Design 4
to nest the comment
, we can obtain back Design 3
as shown in Fig. 10.
product
in Design 4
, we can obtain back Design 2
(the dashed arrow depending on the path the parameters determine). An example of the split transformation is shown in Fig. 11. Finally, the \(\textit{segregate}\) transformation divides an \(E_{List}^{Doc}\) (or \(E_{Col}^{Doc}\)) containing multiple \(E_{Doc}^{Doc}\)s or As into two. The only condition is that the segregated nodes must be already contained in the original \(E_{List}^{Doc}\). After the transformation, the \(E_R\)s that are no longer used by any of the children inside the original \(E_{List}^{Doc}\) are removed from it together with the segregated \(E_{List}^{Doc}\). As a result, the corresponding JSON will contain two independent lists (or collections if we are talking about \(E_{Col}^{Doc}\)) with \(E_{Doc}^{Doc}\) or A, whose contents will depend on the path to \(E_{Doc}^{Doc}\) or A from the parent of the original \(E_\textit{Set}^{Doc}\). By calling \(E_{Col\_ABs}^{Doc}.segregate(E_{Top\_Com}^{Doc})\) on Design 2
, we can obtain Design 1
. Figure 12 illustrates the segregation in JSON.
6 Experiments
6.1 Quality of the design
Query | Frequency |
---|---|
4% | |
2% | |
4% | |
6% | |
20% | |
6% | |
4% | |
20% | |
8% | |
6% | |
20% |
User
give a Product
and a Comment
. Finally, we measured the throughput of each of the alternate designs.product
. On the contrary, DocDesign 2.0 learns more toward having references and only embedding the region
within the user
. From an end-user perspective, the design from DocDesign 2.0 is much cleaner and has less maintenance compared to the ones of DBSR. Moreover, doing any updates will be pretty expensive in DBSR designs as it will involve updating multiple documents in different collections. The documents used in DBSR contain rather small documents and are unrealistic for a real-world scenario. Because of this, we conduct the same experiment with increased document sizes by converting the integer identifiers into MongoDB UUID
fields (24 bytes instead of only 4) and increasing the description attribute size while maintaining the same workload.DocDesign 2.0 | DBSR (3 col) | DBSR (5 col) |
---|---|---|
Runtime (ms) | ||||||
---|---|---|---|---|---|---|
Min | Q1 | Mean | Media | Q3 | Max | |
DocDesign 2.0 | 270 | 512 | 733 | 604 | 726 | 80,639 |
DBSR (3 col) | 228 | 470 | 760 | 575 | 1074 | 80,063 |
DBSR (5 col) | 262 | 523 | 787 | 585 | 2067 | 76,259 |
DBSR (5 col agg.) | 253 | 542 | 1304 | 607 | 2471 | 75,583 |
Runtime (ms) | ||||||
---|---|---|---|---|---|---|
Min | Q1 | Mean | Media | Q3 | Max | |
DocDesign 2.0 | 650 | 1121 | 1842 | 1268 | 2003 | 88,869 |
DBSR (3 col) | 519 | 1292 | 2629 | 1782 | 4091 | 115,290 |
DBSR (5 col) | 639 | 1558 | 2683 | 1866 | 7317 | 97,611 |
DBSR (5 col agg.) | 619 | 1555 | 4147 | 1814 | 9587 | 108,083 |
6.2 Scalability of the approach
-
The design generated by using weights that represent typical requirement of optimizing queries with consideration on storage space outperforms the design generated by DBSR.
-
DocDesign 2.0 ’s design was not only performant, but also requires less storage space.
-
DocDesign 2.0 ’s multicriteria-based approach provides flexibility for the end users to optimize according to their requirements.
-
The non-improving iterations (N) determine the optimality of the design. Through a synthetic workload with 30 entities with varying topology, we concluded that \(N=15\) would already generate a near-optimal solution.