1 Introduction
2 Related work
3 Data model
3.1 Data
3.2 Operators
-
Selection: \(\sigma _C(T) = \{t ~ | ~ t \in T \wedge t\) satisfies condition \(C\}\).
-
Projection: \( \pi _{{\varvec{A}}}(T) = \{t.\vec {A}~ | ~ t \in T\}\). Note that its results are duplicate eliminated.
-
Join: \( T_1 \bowtie _{\theta _1} \cdots \bowtie _{\theta _{m-1}} T_m = \{\langle t_1, \ldots , t_m \rangle ~ | ~ t_i \in T_i ~(i=1,\ldots ,m) \wedge \langle t_1, \ldots , t_m \rangle \text {satisfies condition}~\theta _i (i=1, \ldots , m-1) \}. \) Join operator without \(\theta _i\) represents natural join.
-
Aggregation: \( \alpha _{{\varvec{G}},g({\varvec{B}})}(T) = \{\langle T'.{\varvec{G}}, g(T'.{\varvec{B}}) \rangle ~ | ~ T' \subseteq T, \forall t, t' \in T', \forall t'' \notin T':t'.{\varvec{G}} = t.{\varvec{G}} \wedge \) \(t''.{\varvec{G}} \ne t.{\varvec{G}} \}\). Note that \({\varvec{G}}\). denotes the grouping key of T and \(g({\varvec{B}})\) denotes an aggregate function over attribute \({\varvec{B}}\) in the grouped table.
-
Union: \( T_1 \cup \cdots \cup T_m = \{t~ | ~t \in T_i~(i=1,\ldots ,m)\}\).
-
Difference: \( T_1 - T_2 = \{t ~ | ~ t \in T_1 \wedge t \notin T_2\}\).
4 Augmented lineage
-
Source Lineage (SL): \({ SL}_{{\mathscr {T}}}({\bar{O}}) \)
-
Reasoning Lineage (RL): \({ RL}_{{\mathscr {T}}}({\bar{O}})= \{ \langle o.{{\varvec{E}}_i}, o.\hbox {Value},o.\hbox {Reason} \rangle | \forall i: o \in IL({\bar{O}}, O'_i) \} \)
5 Augmented lineage derivation
5.1 Segment
-
Non-D-segment: A segment of operators except the difference in the pattern: “\(\phi ^*\)-\(\alpha ^*\)-\((\cup | \pi | \sigma )^*\)-\((\pi | \sigma |\bowtie )^*\)”. \(\phi ^*\) and \(\alpha ^*\) denote sequences of function and aggregation operators, respectively. \((\cup | \pi | \sigma )^*\) and \((\pi | \sigma | \bowtie )^*\) denote the operator sub-trees that consist of any combinations of the specified three operators. The leftmost operator is located at the top of the operator tree. Namely, the operators are executed from right to left in a bottom-up manner. Note that all the operators do not need to actually appear in a Non-D-segment. If \(\alpha ^*\) consists of multiple aggregations \(\alpha _1- \cdots -\alpha _m\), the grouping keys \({\varvec{G_i}}\) of \(\alpha _i\) must meet the following conditions: \({\varvec{G_1}} \subseteq \cdots \subseteq {\varvec{G_m}}\). This segment is based on AUSPJ-segment \(\alpha \)-\(\cup \)-\(\pi \)-\(\sigma \)-\(\bowtie \) proposed in [17]. Placing the function operator at the top of the segment naturally extends AUSPJ-segment to accommodate the function operator, since attributes corresponding to Value and Reason of UDFs are guaranteed to appear in its intermediate result, which is convenient for deriving reasoning lineage. Another change is generalization of operator sequences to reduce the number of segments.
-
D-segment: A segment consisting of a single difference operator. This segment is represented as the following pattern: “−”.
5.2 Segmentation
5.3 Tracing query
-
Tracing Query for a Non-D-Segment: Based on the rewriting in Proposition 2, the source lineage of tuple set \({\bar{O}} (\subseteq {\mathscr {T}}(D))\) in task \({\mathscr {T}}\) can be obtained by executing the following tracing query:Note that \({\varvec{A}}_{T^i_j}\) denotes the set of attributes of \(T^i_j\) and \( < imes \) denotes semi-join. If some operators are missing in the Non-D-segment, their counterparts are omitted in the tracing query. If tuple set \({\bar{O}}\) is the whole output \({\mathscr {T}}(D)\), the tracing query is denoted as follows using the notation ALL:$$\begin{aligned} \begin{aligned} { TQ}_{{\bar{O}},{\mathscr {T}}}(D)&= \bigcup _{i} { Split}_{{\varvec{A}}_{T^i_1},\ldots ,{\varvec{A}}_{T^i_{m_i}}}\\&\quad (\sigma _{C_i}(T^i_1 \bowtie \cdots \bowtie T^i_{m_i}) < imes {\bar{O}}) \end{aligned} \end{aligned}$$$$\begin{aligned} \begin{aligned} { TQ}_{\textbf{ALL},{\mathscr {T}}}(D)&= \bigcup _{i} { Split}_{{\varvec{A}}_{T^i_1},\ldots ,{\varvec{A}}_{T^i_{m_i}}} \\&(\sigma _{C_i}(T^i_1 \bowtie \cdots \bowtie T^i_{m_i})) \end{aligned} \end{aligned}$$
-
Tracing Query for a D-Segment: Given a D-segment \({\mathscr {T}}(D) = T_1 - T_2\), the source lineage of tuple set \({\bar{O}} (\subseteq {\mathscr {T}}(D))\) in task \({\mathscr {T}}\) is represented as follows:Please note that the tracing query for a D-Segment identifies \(T_2\) as well as \({\bar{O}}\) as the source lineage.$$\begin{aligned} { TQ}_{{\bar{O}},{\mathscr {T}}}(T_1, T_2) = \{ {\bar{O}}, T_2 \} \end{aligned}$$
5.4 Augmented lineage derivation procedure
6 Implementation of augmented lineage derivation
6.1 Deployment of intermediate results
-
Rerun: This approach runs the original analysis task as it is, usually in the normal mode. When the augmented lineage is requested, it restores all the needed intermediate results of the non-root segments by rerunning the original task in the reasoning mode. Although it causes no runtime overhead and storage cost to the original analysis task, rerunning the task to restore the intermediate results takes much time.
-
Full Materialization (Full): This approach executes the original analysis task in the reasoning mode and materializes (generates and stores) the needed intermediate results of non-root segments during its execution. This approach affects the performance of the original task and needs the storage cost for managing the intermediate results.
-
Function Materialization (FM): This approach executes the original analysis task in the reasoning mode as in Full Materialization but materializes only intermediate results of non-root segments which contain function operators during its execution. Intermediate results of the other segments are restored by rerunning the segments.
6.2 System architecture
7 Experiment
7.1 LFW dataset evaluation
7.1.1 Task
7.1.2 UDFs
-
Face Recognition A person is identified by an ML-based face recognition model in the given image. It outputs the person’s name as Value and the position of the bounding box around the face as Reason. Its processing cost is more expensive than the following alternative.
-
String Processing A person is recognized by extracting the substring corresponding to the celebrity’s name from the URI string. It outputs the person’s name as Value and the position of the name substring in the URI as Reason. Its processing cost is cheaper than the face recognition implementation.
Small | Medium | Large | |
---|---|---|---|
Image | \(4.5 \times 10^4\) | \(4.5 \times 10^5\) | \(4.5 \times 10^6\) |
Event | \(6.075 \times 10^5\) | \(6.075 \times 10^6\) | \(6.075 \times 10^7\) |
7.1.3 Task processing
7.1.4 Derivation of augmented lineage
Small | Medium | Large | |
---|---|---|---|
FM | \(4.5 \times 10^3\) | \(4.5 \times 10^4\) | \(4.5 \times 10^5\) |
Full | \(4.95 \times 10^4\) | \(4.95 \times 10^5\) | \(4.95 \times 10^6\) |
7.2 CDA workload evaluation
CDA workload query | Original TPC-H benchmark query | Changes | Involved segment |
---|---|---|---|
Q1 | Q2 | Apply person_detection UDF to Part table and select parts whose photograph images include any human image | Non-D |
Q2 | Q3 | Apply text_sentiment UDF to Lineitem table and select line-item tuples whose review comments are positive | Non-D |
Q3 | Q9 | Apply person_detection UDF to Part table and select parts whose photograph images include any human image | Non-D |
Q4 | Q15 | Apply text_sentiment UDF to Lineitem table and select line-item tuples whose review comments are positive | Non-D |
Q5 | Q16 | Apply person_detection UDF to Part table and select parts whose photograph images include any human image | Non-D |
Q6 | Q17 | Apply person_detection UDF to Part table and select parts whose photograph images include any human image | Non-D |
Q7 | Q18 | Apply text_sentiment UDF to Lineitem table and select line-item tuples whose review comments are positive | Non-D |
Q8 | Q19 | Apply person_detection UDF to Part table and select parts whose photograph images include any human image | Non-D |
Q9 | Q20 | Apply person_detection UDF to Part table and select parts whose photograph images include any human image. Then, add distinct clause to remove duplicate | Non-D |
Q10 | Q21 | Apply text_sentiment UDF to Lineitem table and select line-item tuples whose review comments are positive | Non-D and D |
7.2.1 Tasks
-
Part table We added a new attribute to store the photograph image of the part (p_image). We used images in the COCO dataset [34] for this extension, and we stored path strings of the image files as its values.
-
Lineitem table We replaced the comment attribute (l_comment) values with Amazon review text data [40].
7.2.2 UDFs
-
Human detection This UDF is applied to the attribute p_image in the extended Part table. This UDF takes image data as input and decides whether any human image appears in the whole image or not. It returns a Boolean value as Value and the position of the bounding box of the human image as Reason when it returns True. We implemented this UDF using YOLO [44], which can detect objects quickly.
-
Text sentiment classifier This UDF is applied to the attribute l_comment in the extended Lineitem table. This UDF takes review comment text data as input and returns Positive or Negative, representing the sentiment of the review comment, as Value and likelihood of the sentiment as Reason. We implemented this UDF using BERT-Base text sentiment classifier provided by IBM [29].
7.2.3 Task processing
7.2.4 Derivation of augmented lineage
FM | Full | FM | Full | FM | Full | |
---|---|---|---|---|---|---|
Q1 | Q2 | Q3 | ||||
SF1 | \(7.89 \times 10^2\) | \(1.17 \times 10^5\) | \(3.02 \times 10^4\) | \(3.02 \times 10^4\) | \(1.09 \times 10^4\) | \(1.09 \times 10^4\) |
SF10 | \(7.92 \times 10^3\) | \(1.18 \times 10^6\) | \(3.00 \times 10^5\) | \(3.00 \times 10^5\) | \(1.08 \times 10^5\) | \(1.08 \times 10^5\) |
SF100 | \(8.00 \times 10^4\) | \(1.19 \times 10^7\) | \(2.99 \times 10^6\) | \(2.99 \times 10^6\) | \(1.09 \times 10^6\) | \(1.09 \times 10^6\) |
Q4 | Q5 | Q6 | ||||
SF1 | \(4.60 \times 10^5\) | \(4.70 \times 10^5\) | \(2.96 \times 10^4\) | \(2.96 \times 10^4\) | \(2.05 \times 10^2\) | \(2.00 \times 10^5\) |
SF10 | \(4.59 \times 10^6\) | \(4.69 \times 10^6\) | \(2.96 \times 10^5\) | \(2.96 \times 10^5\) | \(2.05 \times 10^3\) | \(2.00 \times 10^6\) |
SF100 | \(4.59 \times 10^7\) | \(4.69 \times 10^7\) | \(2.97 \times 10^6\) | \(2.97 \times 10^6\) | \(2.01 \times 10^4\) | \(2.00 \times 10^7\) |
Q7 | Q8 | Q9 | ||||
SF1 | \(6.30 \times 10^1\) | \(1.50 \times 10^6\) | \(4.71 \times 10^2\) | \(4.71 \times 10^2\) | \(2.22 \times 10^3\) | \(5.47 \times 10^5\) |
SF10 | \(5.88 \times 10^2\) | \(1.50 \times 10^7\) | \(4.70 \times 10^3\) | \(4.70 \times 10^3\) | \(2.17 \times 10^4\) | \(5.47 \times 10^6\) |
SF100 | \(5.12 \times 10^3\) | \(1.50 \times 10^8\) | \(4.81 \times 10^4\) | \(4.81 \times 10^4\) | \(2.18 \times 10^5\) | \(5.48 \times 10^7\) |
Q10 | ||||||
SF1 | \(3.78 \times 10^3\) | \(7.55 \times 10^3\) | ||||
SF10 | \(3.97 \times 10^4\) | \(7.95 \times 10^4\) | ||||
SF100 | \(3.99 \times 10^5\) | \(7.98 \times 10^5\) |