Das Kapitel geht der kritischen Notwendigkeit transparenter Forschungsergebnisse in einer Zeit nach, in der Entscheidungen zunehmend durch Datenvisualisierungen beeinflusst werden. Sie unterstreicht die Beschränkungen traditioneller Visualisierungen, denen es häufig an der Fähigkeit fehlt, die Herkunft ihrer Daten aufzudecken, was es den Nutzern erschwert, ihnen zu vertrauen und sie zu interpretieren. Die Autoren schlagen ein neuartiges bidirektionales Analyserahmen vor, das feinkörnige Provenienzabfragen über Abhängigkeitsgraphen unterstützt und es den Benutzern ermöglicht, Datenabhängigkeiten direkt aus Visualisierungen zu erforschen. Dieser Ansatz verwendet dynamische Abhängigkeitsgraphen, um Analysen zu implementieren, die schnell genug sind, um interaktiv und sprachunabhängig zu sein und Abfragen über den Graphen vom Problem der Ableitung des Abhängigkeitsgraphen für ein bestimmtes Programm trennen. Das Kapitel stellt auch eine Kernrechnung für datentransparente Ausgaben vor, bei der Eingabeteile und berechnete Werte eindeutigen Bezeichnungen zugewiesen werden, die als Adressen bezeichnet werden, und Programme über eine operationelle Semantik verfügen, die jedes Ergebnis mit einem dynamischen Abhängigkeitsgraphen paart. Darüber hinaus werden Kognitätsabfragen über Abhängigkeitsdiagramme eingeführt, die es Benutzern ermöglichen, Inputs und Outputs zu verknüpfen, die gemeinsame Abhängigkeiten aufweisen, was die Transparenz und Vertrauenswürdigkeit von Datenvisualisierungen erhöht. Die Implementierung dieses Frameworks in einer rein funktionalen Programmiersprache namens Fluid wird ebenso diskutiert wie seine Leistungsvorteile gegenüber trace-basierten Techniken. Das Kapitel schließt mit einer Skizze zukünftiger Forschungsrichtungen, einschließlich der Entwicklung semantisch begründeter Abhängigkeitsgrafiken und der Erforschung intensiver Transparenz.
KI-Generiert
Diese Zusammenfassung des Fachinhalts wurde mit Hilfe von KI generiert.
Abstract
Charts, figures, and text derived from data play an important role in decision making. But making sense of or fact-checking outputs means understanding how they relate to the underlying data. Even for experts with access to the source code and data sets, this poses a significant challenge. We introduce a new program analysis framework (A supporting artifact is available at https://zenodo.org/records/14637654 [5].) which supports interactive exploration of fine-grained IO relationships directly through computed outputs, using dynamic dependence graphs. This framework enables a novel notion in data provenance which we call linked inputs, a relation of mutual relevance or cognacy which arises between inputs that contribute to common features of the output. We give a procedure for computing linked inputs over a dependence graph, and show how the presented in this paper is faster on most examples than an implementation based on execution traces.
1 Introduction: Towards Transparent Research Outputs
Whether formulating national policy or making day-to-day decisions about our own lives, we increasingly rely on the charts, figures and text created by scientists and journalists. Interpreting these visual and textual summaries is essential to making informed decisions. However, most of the artifacts we encounter are opaque, inasmuch as they are unable to reveal anything about how they relate to the data they were derived from. Whilst one could in principle try to use the source code and data to reverse engineer some of these relationships, this requires substantial expertise, as well as valuable time spent away from the “comprehension context” in which the output in question was encountered. These difficulties are compounded when the information presented uses multiple data sources, such as medical meta-analyses [13], ensemble models in climate science [25], or queries that span multiple database tables [36]. Perhaps more often than we would like, we end up taking things on trust.
Fig. 1.
A hand-crafted transparent visualisation due to Bremer and Ranzijn [6]
With traditional print media, a “disconnect” between outputs and underlying data is unavoidable. For digital media, other options are open to us. One way to improve things is to engineer visual artifacts to be more “self-explanatory”, revealing the relationship to the underlying data to an interested user. Consider the histogram in Figure 1 showing urban population growth in Asia over a decade [6]. A reader might have many questions about what this chart represents – in other words, how visual elements map to underlying data. Do the points represent individual cities? What does the colour scheme indicate? Which of the points represent large cities or small cities? Legends and other accompanying text can help, but some ambiguities inevitably remain. The purpose of a visual summary after all is to present the big picture at the expense of certain detail.
Anzeige
Bremer and Ranzijn’s [6] approach to this problem is an interactive feature that allows a user to explore some of these provenance-related questions in situ, i.e. directly from the chart. Selecting an individual point shows the user a view of the data the point was calculated from. In the figure the highlighted point represents Chiang Mai, and shows that the plotted value of 10.9% was derived from an increase in population density from 4,416 to 5,010 people per sq. km. Features like these are valuable as comprehension aids, but are laborious to implement by hand and require the author to anticipate the queries a user might have. For these reasons they also tend not to generalise: for example Bremer and Ranzijn’s [6] visualisation only allows the user to select one point at a time.
1.1 Data transparency as PL infrastructure
Hand-crafted efforts like Bremer and Ranzijn’s [6] are labour-intensive because they involve manually embedding metadata about the relationship between outputs and inputs into the same program. When the visualisation or analysis logic changes, the relationship between inputs and outputs also change, and the metadata must be manually updated. A less costly and more robust approach is to treat data provenance as a language infrastructure problem, baking lineage or provenance metadata directly into outputs so that provenance queries can be supported automatically. For example, for an in-memory database engine, Psallidas and Wu [32] describe how to “backward trace” from output selections to input selections, and then “forward trace” to find related output selections in other views, to support a popular feature from data visualisation called linked brushing. Perera et al. [30] implemented a similar system for a general-purpose functional programming language, using execution traces for bidirectional queries.
The advantage of shifting the burden of implementing transparency features onto the language runtime is that the author of the content can concentrate purely on data visualisation, and as the infrastructure improves, the benefits are inherited automatically by end users, at no additional cost to the author. For example if formal guarantees (perhaps that data selections are in some sense minimal and sufficient) are provided, then those can be proved once for the infrastructure rather than for each bespoke implementation.
In this paper, we propose a new bidirectional analysis framework for a general-purpose programming language, which supports fine-grained in situ provenance queries, an end-user feature we call data transparency. In contrast to prior work, we use dynamic dependence graphs [2, 17] to implement the analyses, which enables our approach to be fast enough for interactive use and also language-independent, by separating queries over the graph from the problem of deriving the dependence graph for a particular program.
Anzeige
1.2 Contributions and Roadmap
Our specific contributions are as follows:
§ 3 defines a core calculus for data-transparent outputs, where parts of inputs and computed values are assigned unique labels called addresses, and programs have an operational semantics that pairs every result with a dynamic dependence graph capturing fine-grained IO relationships between input addresses and output addresses.
§ 4 presents a new formal framework for bidirectional provenance queries over dependence graphs, formalising two operators over such graphs, \(\triangledown \) (demands) and (demanded by). We show these to be conjugate in the sense of Jonsson and Tarski [22], and give procedures to compute \(\triangledown \) and . We show how a novel cognacy operator called linked inputs, relating inputs when they contribute to common parts of the output, can be obtained by composing \(\triangledown \) and , and how a dual linked outputs operator , supporting the “linked visualisations” of prior work [30, 32], is obtained by transposing the two operators.
§ 5 shows that our implementation performs better than one based on traces and bidirectional interpreters, the primary alternative implementation technique. Using usability metrics from Nielsen [26], we find that \(81\%\) of the queries we tested execute at a speed that appears instantaneous to a user, compared to \(25\%\) using an implementation based on traces. We also compare overhead of building a trace vs. building a dependence graph for a given program.
Although Psallidas and Wu [32] support fast queries for linked visualisations in a database setting, and Perera et al. [30] support linked outputs in a general-purpose languages, ours is (to the best of our knowledge) the first implementation for a general-purpose language which is fast enough for interactive use. § 6 discusses other related work in database provenance, program slicing and data visualisation.
§ 7 wraps up with a discussion of some limitations. In particular, the sort of transparency considered in this paper is purely extensional, and falls short of providing full explanations of how output parts are related to input parts. We discuss more intensional forms of transparency in § 7 and propose some other ways in which the present system could be improved.
We implement our approach in a pure functional programming language called Fluid. The author of a visualisation expresses their chart as a pure function of the inputs, using a set of built-in data types for common visualisations; a d3.js front end automatically enriches the rendered outputs with support for interactive selection and the data transparency queries introduced in the next section. Our implementation is available at https://github.com/explorable-viz/fluid.
2 Overview: Fine-Grained Interactive Provenance
In this section, we introduce the main interactive data provenance features that wish to support, using two Fluid examples to illustrate. The line chart in Figure 2 shows projected methane emissions from agricultural sources under a global warming scenario called RCP8.5, with source code in Figure 3; the scatter plot and stacked bar chart in Figure 4 show changing non-renewable energy outputs and capacities for various countries, with source code in Figure 5.
Fig. 2.
Data transparency: demanded by ( ) and demands (\(\triangledown \)) operators link inputs that share common output dependencies.
The key idea of a transparent visualisation is that the original data sources are kept around and can be viewed (when a user so requests) alongside the visualisation, and the user is then able to interact with both the data and the view to explore how they are related. Crucially the author of the visualisation does not to have to implement any of these interactions themselves; they need only write the visualisation code and the language runtime and rendering infrastructure provides the interactions.
1. Fine-grained linking of the data to the view. In Figure 2, the user has chosen to reveal the underlying data set, which is shown on the left; only rows relevant to the chart are visible, with the other rows hidden automatically. This clarifies that only agricultural data is relevant, confirming the (informal) claim in the caption. The emissions values shown do contribute to the chart in some way, and the user can investigate this by interacting with individual entries. For example, moving their mouse over the number 104.69 highlights one point in the projected emissions curve, and three points in the other curve, which plots the moving average (see Figure 3) of the projected emissions. The provenance analysis which underpins this we write as (“demanded by”), and here this tells us that 104.69 was needed to compute either the x or y coordinate (in this case just the y coordinate) of the four highlighted points.
Fig. 3.
Moving average source code
2. Linked inputs. We want to support fine-grained IO queries that run in the other direction too, via another provenance analysis written \(\triangledown \) (“demands”). In Figure 2 the \(\triangledown \) analysis is initiated automatically on the output of ; this reveals that calculating the y coordinates of the points highlighted on the right required not only the 104.69 we started with, but 4 additional emissions values (blue border on the left). These additional inputs we refer to as the related inputs of the original input selection; they are the other inputs demanded by any output that demands our starting input, and are computed using the composite operator . Given that the initial input selection here contributes to 3 points of the moving average, the “window” of related inputs in this case is 5 wide, comprising all the data points needed to account for the selection on the right. Related inputs is a cognacy relation: two inputs are related if they have a common ancestor in a dependence graph that captures how inputs are demanded by outputs.
3. Fine-grained linking of the view to the data. We would also like to support cognacy queries that start from the output rather than the input; this is the basis of a feature called linked brushing (or brushing and linking) in data visualisation [4, 7]. Here we would like to provide linked brushing in a way that is transparent to the user. In Figure 4, the user has again revealed the underlying data set, this time opting to see only rows with active data selections, rather than only rows with data that are used by any part of the output. They then express interest in one of the points in the scatter plot. By clicking on it, rather than just mousing over it, they create what we call a persistent selection (shown in green). The demands analysis \(\triangledown \) reveals (also in green) the inputs needed to compute both the x and y coordinates of the selected point; this shows that the 2016 data for 4 countries was used, although it does not reveal how.
Fig. 4.
Data transparency: demands (\(\triangledown \)) and demanded by ( ) operators link outputs that share common input dependencies.
4. Linked outputs. Now that the inputs demanded by the output selection are determined, the analysis runs automatically on the output of \(\triangledown \); this reveals that those inputs were also needed for 4 of the bar segments (highlighted with cross-hatching) in the bar chart, namely those for 2016. This is standard linked brushing, and is implemented using the composite operator , which we call related outputs. Related outputs is a “co-cognacy” relation, relating two outputs whenever they have a common descendant, rather than common ancestor, in the dependence graph. Here the user can take advantage of the fact that a persistent selection remains active even when the mouse moves off the original item of interest; this allows them to investigate further. Moving their mouse over the yellow IND segment of the 2016 bar (highlighted with blue border) initiates an orthogonal \(\triangledown \) query which shows (also with various blue borders) the data needed to compute the yellow segment. Overlaid on the persistent selection, this reveals that the 37.90 in the nuclearOut column is needed to compute both the IND bar segment and the selected scatter plot point, explaining why the two output selections are linked.
Fig. 5.
Non-renewable energy source code
3 A Core Calculus for Transparent Outputs
We start by introducing a calculus for transparent outputs, which is the core of Fluid. The core language supports deep pattern-matching and mutual recursion. Fluid also provides a surface syntax that desugars into the core, providing piecewise function definitions, list comprehensions and other conveniences, as shown in Figures 3 and 5, which would otherwise complicate the core.
The term syntax of the core language (§ 3.1) is uncontroversial: it is pure and untyped, and provides datatypes, records and matrices (although we omit a presentation of matrices here). The syntax of values is less standard: each part of a value has a unique address\(\alpha \) which serves to identify that part of the value in a dependence graph. We give a big-step operational semantics that evaluates a term to both a value and a dynamic dependence graph for that value (§ 3.2), capturing how parts of the value depend on parts of the input.
Some notation. We write to denote a finite sequence of elements \({x_1} ,\, ..,\, {x_n}\), with \(\epsilon \) as the empty sequence. Concatenation of sequences is written ; we also write for cons (prepend) and for snoc (append). We write to denote a finite map, i.e. a set of pairs \({{k_1}:{x_1}} ,\, ..,\, {{k_n}:{x_n}}\) where keys \(k_i\) are pairwise unique. If X and Y are sets, we write \(X \uplus Y\) to mean \(X \cup Y\) where X and Y are disjoint, and also \(x \cdot X\) or \(X \cdot x\) to mean \(X \uplus \{x\}\).
3.1 Syntax
Fig. 6.
Syntax of the core language, including values labeled with addresses
Terms Figure 6 defines expressions e of the language, which include variables x, integer constants n, let-bindings \({{\texttt {let}}}\;{x}\mathrel {{{\texttt {=}}}}{e}\;{{\texttt {in}}}\;{e'}\), record construction , record projection \(e.x\) and (saturated) constructor expressions , where c ranges over data constructors. The language is parameterised by a finite map \(\varSigma \) from constructors c to arities \(\varSigma (c) \in \mathbb {N}\). Function application has two forms: the usual \({e}\,{e'}\) and (saturated) foreign function application (described below). Lastly, expressions include anonymous functions \(\lambda {\sigma }\), where \(\sigma \) is a pattern-matching construct called an eliminator (§ 3.1), and sets of mutually recursive functions \({{\texttt {let}}}\;{\rho }\;{{\texttt {in}}}\;{e}\), where \(\rho \) is a finite map from names to eliminators.
The language is also parameterised by a finite map \(\varPhi \) from variables f to arities \(\varPhi (f) \in \mathbb {N}\) of the foreign function they denote. For the graph semantics in § 3.2 below, every foreign function name f is required to provide an interpretation \(\hat{f}\) that, for a sequence of arguments and dependence graph G, returns the result of applying the foreign function f to plus a dependence graph \(G'\) which extends G with information about how the result depends on .
Continuations and eliminators A continuation\(\kappa \) is a term e or an eliminator \(\sigma \), describing how an execution proceeds after a value is matched. Eliminators are a deep pattern-matching construct based on tries [21, 31]; for a language with rich structured values like records and data types, they make for a cleaner presentation than single shallow elimination forms for each type. Piecewise function definitions in the surface language desugar into eliminators.
An eliminator specifies how values of a particular shape are matched, and for a given value, determines how any pattern variables get bound and the continuation \(\kappa \) which will be executed under those bindings. A variable eliminator \({x}\mapsto {\kappa }\) says how to match any value (as variable x) and continue as \(\kappa \). A record eliminator says how to match a record with fields , and provides a continuation \(\kappa \) for sequentially matching the values of those fields. Lastly, a constructor eliminator provides a branch \(c \mapsto \kappa \) for each constructor c in , where each \(\kappa \) specifies how any arguments to c will be matched. (We assume that any constructors matched by an eliminator all belong to the same data type, but this is not enforced in the core language.)
Addresses, values and environments We define addressed valuesv mutually inductively with raw values. A raw value is simply a value without an associated address; a value decorates a raw value with an address \(\alpha \). Addresses are allocated during evaluation so that new partial values have fresh addresses, which can then be used as vertices in a dependence graph.
Raw values \(\boldsymbol{v}\) include integers n, m; records ; and constructor values . Raw values also include closures \({{\texttt {cl}}}({\gamma },{\rho },{\sigma })\) where \(\sigma \) is the function body, \(\gamma \) the captured environment, and to support mutual recursion, \({\rho }\) is the (possibly empty) set of named functions with which \(\sigma \) was mutually defined. Environments are finite maps from variables to values. Because foreign functions in the core language are not first-class and calls are saturated, i.e. , the surface language provides a top-level environment which maps every foreign function name f of arity n to the closure , with \(\alpha \) fresh, emulating first-class foreign functions.
3.2 Operational Semantics
We now give a big-step operational semantics for the core language, which evaluates a term to a value paired with a dynamic dependence graph for that value.
Dynamic Dependence Graphs A dynamic dependence graph [2] (hereafter dependence graph) is a directed acyclic graph \(G = (V,E)\) with a set V of vertices and a set \(E \subseteq V \times V\) of edges. When convenient we write \(\textsf {V} (G)\) for V and \(\textsf {E} (G)\) for E. In the dependence graph for a particular program, vertices \(\alpha , \beta \in V\) are addresses associated to values (either supplied to the program as inputs or produced during evaluation) and edges \((\alpha , \beta ) \in E\) indicate that, in the evaluation of that program, the value associated to \(\beta \)depends on (is demanded by, in the terminology of § 2) the value associated to \(\alpha \). Such values may be sub-terms of a larger value. This diverges somewhat from traditional approaches to dynamic dependence graphs in only considering one type of edge (data dependency) rather than separate data and control dependencies; moreover our edges point in the direction of dependent vertices, whereas in the literature the other direction is somewhat more common.
During evaluation, when a fresh vertex \(\alpha \) is allocated for a constructed value, the dependence graph G is extended by a graph fragment specifying that \(\alpha \) depends on a set V of preexisting vertices, given by the following notation:
Definition 1
(In-star notation). Write \(\{V \mapsto \alpha \}\) as shorthand for the star graph \((\{\alpha \} \uplus V, V \times \{\alpha \})\).
Fig. 7.
Pattern matching
Fig. 8.
Operational semantics with dependence graph
Pattern matching Evaluation relies on pattern matching, which is defined by a separate judgement form, computing a vertex set. Figure 7 defines the pattern-matching judgement . Rather than matching a single value v, the judgement matches a “stack” of values against a continuation \(\kappa \), returning the selected branch e, an environment \(\gamma \) providing bindings for the free variables of e, and the set V of addresses found in the matched portions of .
A continuation which is just an expression e matches only the empty stack of values ( ), in which case pattern-matching is complete and e is the selected branch. Other rules require an eliminator\(\sigma \) as the continuation and a non-empty stack ; any relevant subvalues of v are unpacked and pushed onto the tail and then recursively matched using the continuation \(\kappa \) selected from \(\sigma \). A variable eliminator \({x}\mapsto {\kappa }\) pops v off the stack, using \(\kappa \) to recurse ( ); no part of v is consumed so the addresses V consumed by the recursive match are returned unmodified. A record eliminator matches a record of the form as long as the variables in are also fields in , augmenting V with the address \(\alpha \) associated with the record ( ); the premise projects out the corresponding values of from . Lastly, a constructor eliminator \(({c}\mapsto {\kappa })\) matches any constructor value of the form , augmenting V with the address \(\alpha \) associated with the constructor ( ).
Evaluation Figure 8 defines a big-step evaluation relation \(\gamma , e, V, G \Rightarrow v, G'\) stating that term e, under an environment \(\gamma \), vertex set V and dependence graph G, evaluates to a value v and extended dependence graph \(G'\). The vertex set V records the (partial) input values consumed by the current active function call providing the dynamic context in which e is being evaluated; V is initially empty and changes whenever a function application is evaluated.
The evaluation rule for variables is fairly standard; the dependence graph is returned unmodified, because no new addresses are allocated as a result of simply looking up a variable. The rule for record projections \(e.y\) is similar: if e evaluates to a record of the form , then \(e.y\) evaluates to the value u of field y, discarding the address \(\alpha \) of the record.
Introduction rules, such as for integers, functions, records and constructors, follow the pattern of assigning a fresh address for the (partial) value being constructed, and then extending G with a set of dependency edges from the vertices in V to \(\alpha \). For example, the integer rule evaluates an expression n to its value form \(n_{\alpha }\); the \(\alpha \notin \textsf {V} (G)\) constraint ensures that \(\alpha \) is fresh. The dependence graph is then extended with \(\{V \mapsto \alpha \}\), indicating that \(n_\alpha \) depended on all matched partial inputs in V. Likewise, the rule for an anonymous function \(\lambda {\sigma }\) constructs the closure \({{\texttt {cl}}}({\gamma },{\varnothing },{\sigma })_\alpha \) with fresh address \(\alpha \), capturing the current environment \(\gamma \) and using \(\varnothing \) as the set of mutually recursive definitions associated with the function, and again establishing dependency edges from V to \(\alpha _i\).
Rules which involve recursively evaluating subterms thread the graph under construction through the evaluation of the subterms. For example, the auxiliary evaluation relation (bottom of Figure 8), evaluates each \(e_i\) in a sequence of terms to a value \(v_i\) and dependence graph \(G_{i + 1}\), which is used as the input graph for evaluating \(e_{i + 1}\). We make use of this judgement in the rules for records , constructors , and foreign applications ; in the last rule, \(\hat{f}\) is the foreign implementation that evaluates an application of f to a sequence of values and dependence graph G to a result v and extended dependence graph \(G'\) (§ 3.1).
The rules for \({{\texttt {let}}}\;{x}\mathrel {{{\texttt {=}}}}{e}\;{{\texttt {in}}}\;{e'}\) and application \({e}\,{e'}\) may involve mutual recursion, and rely on the auxiliary relation \(\gamma , \rho , V, G \rightarrowtail \gamma ', G'\) defined at the bottom of Figure 8. This judgement takes a set \(\rho \) of recursive definitions, vertex set V and dependence graph G, and returns an environment \(\gamma '\) of closures derived from \(\rho \) and extended dependence graph \(G'\). Each function definition \(\rho (x_i) \in \rho \) generates a new closure \((\gamma , \rho , \rho (x_i))_{\alpha _i}\) capturing \(\gamma \) and \(\rho \), with fresh address \({\alpha _i}\), and extends the dependence graph with a set of edges from V to \(\alpha _i\).
The evaluation rule for \({{\texttt {let}}}\;{\rho }\;{{\texttt {in}}}\;{e}\) is similar to the rule for regular let-bindings, except for using \(\rightarrowtail \) to build a set of closures \(\gamma '\) which extends \(\gamma \). Finally, the rule for \({e}\,{e'}\) is notable because this is where the active function context changes and the ambient V is discarded. We compute the closure \({{\texttt {cl}}}({\gamma _1},{\rho },{\sigma })_{\alpha }\) from e and the argument \(v'\) from \(e'\), and then use \(\sigma \) to match \(v'\). If pattern-matching returns selected branch \(e^{\prime \prime }\), with vertex set \(V'\) representing the consumed part of \(v'\), then \(e^{\prime \prime }\) is evaluated and becomes the result of the application, with \(V' \cup \{\alpha \}\) serving as set of (partial) inputs associated with the new active function context.
Fig. 9.
Portion of dependence graph computed for moving averages example
Example Figure 9 shows part of the dependence graph for the moving average example in Figures 2 and 3. The graph shows the calculation of the first 3 points in the moving average plot from entries in the emissions table. This is a simplified version of the graph, since in practice they get very large; for example, we omit closures, list cells and the calculation of the divisors. Also note that the node labels here are merely illustrative: the graph only stores value dependencies, and in particular the in-neighbours of a vertex are unordered.
4 Cognacy Queries Over Dependence Graphs
We now turn to cognacy and “co-cognacy” queries over dependence graphs, expressed in terms of the \(\triangledown \) and operators introduced informally in § 2. Boolean algebras (Definition 2) are used to represent selections; we then define \(\triangledown \) and and their De Morgan duals, first for an arbitrary relation (§ 4.1) and then for the IO relation of a dependence graph (§ 4.2). Then we give procedures for computing \(\triangledown \) and and their duals over a given dependence graph, via an intermediate graph slice (§ 4.3).
Definition 2
(Boolean algebra). A Boolean algebra (or Boolean lattice) A is a 6-tuple \((A, \wedge , \vee , \bot , \top , \lnot )\) with carrier A, distinguished elements \(\bot \) (bottom) and \(\top \) (top), commutative operations \(\wedge \) (meet) and \(\vee \) (join) with \(\top \) and \(\bot \) as respective units, and unary operation \(\lnot \) (negate) satisfying \(x \vee \lnot x = \top \) and \(x \wedge \lnot x = \bot \). The operations \(\wedge \) and \(\vee \) distribute over each other and induce a partial order \(\le \) over X.
An input (or output) selection, where X is the set of all addresses that occur in the input (or output) of the program, is represented by the powerset Boolean algebra \(\mathbb {P}(X)\) whose carrier is the set of subsets \(X' \subseteq X\) ordered by inclusion \(\subseteq \). In this case bottom is the empty set \(\varnothing \), top is the whole set X, meet and join are given by \(\cap \) and \(\cup \), and negation by relative complement \(\setminus \), so that \(\lnot X' = X \setminus X'\).
4.1 \(\triangledown \) and and their duals
The \(\triangledown \) and operators sketched earlier are simply the image and preimage functions for a particular dependency relation R. A mnemonic device may help with reading the symbols; if R is a dependency relation oriented with outputs at the top and inputs at the bottom, then demanded by can be read as an arrow pointing from inputs to outputs, and demands\(\triangledown \) as an arrow pointing from outputs to inputs.
Definition 3
(Image and Preimage Functions for a Relation). For a relation \(R \subseteq X \times Y\), define and \(\triangledown _R : \mathbb {P}(Y) \rightarrow \mathbb {P}(X)\) as:
Trivially . The image and preimage functions form a conjugate pair, in the sense of Jonsson and Tarski [22]:
Definition 4
(Conjugate Functions). For Boolean algebras A, B, functions \(f: A \rightarrow B\) and \(g: B \rightarrow A\) form a conjugate pair iff:
$$ f(x) \wedge y = \bot \iff x \wedge g(y) = \bot $$
(Jonsson and Tarski [22] consider only endofunctions; here we extend the idea of conjugacy to maps between Boolean algebras.)
Lemma 1
and \(\triangledown _{R}\) are conjugate.
This should be intuitive enough: for any subsets \(X' \subseteq X\) and \(Y' \subseteq Y\), if the elements “on the right” to which \(X'\) is related are disjoint from \(Y'\), then there are no edges in R from \(X'\) to \(Y'\); and in virtue of that, the elements “on the left” to which \(Y'\) is related must also be disjoint from \(X'\).
Conjugate functions are related to another class of near-reciprocals between Boolean algebras, namely Galois connections; in fact every conjugate pair induces a Galois connection [24]. This relates the present setting to previous work on program slicing with Galois connections [28, 29, 33].
Definition 5
(Galois connection). Suppose A, B are partial orders, then monotone functions \(f: A \rightarrow B\) and \(g: B \rightarrow A\) form a Galois connection iff
$$\begin{aligned} f(x) \le y \iff x \le g(y) \end{aligned}$$
Proposition 1
Suppose functions \(f: A \rightarrow B\) and \(g: B \rightarrow A\) between Boolean algebras. The following statements are equivalent:
1.
f and g form a conjugate pair
2.
f and \(g^{\circ }\) form a Galois connection
It is also useful (both for performance reasons and as a user feature) to consider the De Morgan duals of \(\triangledown \) and , which we write as \(\blacktriangle \) and \(\blacktriangledown \); these also form a conjugate pair.
Definition 6
(De Morgan Dual). Suppose a function \(f: A \rightarrow B\) between Boolean algebras with \(\lnot _A\) and \(\lnot _B\) the negation operators of A and B. Define the De Morgan Dual\(f^{\circ }: A \rightarrow B\) of f as:
(Dual Image and Preimage Functions for a Relation). For relations \(R \subseteq X \times Y\), define \(\blacktriangle _R: \mathbb {P}(X) \rightarrow \mathbb {P}(Y)\) and \(\blacktriangledown _R: \mathbb {P}(Y) \rightarrow \mathbb {P}(X)\) as:
Bearing in mind that \(\lnot \) for \(\mathbb {P}(X)\) is just relative complement \(X \setminus \cdot \), it is easy to show that these are indeed the intended De Morgan duals.
If picks out the outputs that the elements of \(X'\) are necessary for, \(\blacktriangle _R(X')\) picks out the outputs that the elements of X are sufficient for. \(\blacktriangledown _R(Y')\) picks out the inputs that are needed only by elements of \(Y'\).
4.2 \(\triangledown _G\) and for Dependence Graphs
Reachability in G induces a relation just between sources and sinks, which we call the IO relation of G. We now extend \(\triangledown _{R}\) and and their duals to a dependence graph G, via its IO relation.
Definition 8
(Sources and sinks). For a graph \(G = (V,E)\), write \(\textsf{S}(G)\) for the sources of G (i.e. those vertices with no in-edges) and \(\textsf{T}(G)\) for the sinks of G (those with no out-edges).
Definition 9
(Reachability relation). Define the reachability relation for a graph \(G=(V,E)\) to be the reflexive transitive closure of E.
Definition 10
(IO relation). For any dependence graph G with reachability relation R, define the IO relation of G to be \(R \cap (\textsf{S}(G) \times \textsf{T}(G))\).
The IO relation specifies how specific inputs (sources) are demanded by specific outputs (sinks); thus we interpret \((x, y) \in R\) as “x is demanded byy”. Clearly the IO relation of G is the converse of the IO relation of \(G^{-1}\). The set of inputs that some outputs Ydemand, \(\triangledown _{G}(Y)\), is simply the subset of the inputs of G reachable from Y (by traversing the graph in its opposite direction). Conversely, the set of outputs demanded by some inputs X, , is simply the subset of the outputs of G reachable from X.
Definition 11
(\(\triangledown _G\)andfor a dependence graph). For a graph G with IO relation R, define:
4.3 Computing \(\triangledown _G\), , \(\blacktriangle _{G}\) and \(\blacktriangledown _{G}\) for Dependence Graphs
We now show how to compute e\(\triangledown _{G}\), , \(\blacktriangle _{G}\) and \(\blacktriangledown _{G}\) for a dependence graph G, via an intermediate graph slice which we then restrict to its IO relation. First some graph notation:
Definition 12
(In-edges and out-edges). For a graph \(G=(V,E)\) and vertex \(\alpha \in V\), write \(\textsf {inE} _{G}(\alpha )\) for the in-edges of \(\alpha \) in G and \(\textsf {outE} _{G}(\alpha )\) for its out-edges.
Definition 13
(Opposite graph). For a graph G define the opposite graph \(G^{-1} :=(V,E^{-1})\) where \(\cdot ^{-1}\) denotes relational converse.
With the trace-based approaches mentioned in § 1, one can use a given algorithm to implement its De Morgan dual; for example, given a procedure for we can compute \(\blacktriangle _{G}\) by pre- and post-composing with negation [30]. In the dependence graph setting, we can also use a given algorithm to compute its conjugate, e.g., given a procedure for , we can compute \(\triangledown _{G}\) simply as .
For efficiency, however, we give direct procedures both for (§ 4.3) and \(\blacktriangle _{G}\) (§ 4.3), which can then be used to derive implementations of the other operators, as shown at the end of this section. Each procedure factors through an auxiliary operation that computes a “slice” of the original graph, i.e. a contiguous subgraph. While it is technically possible to compute the desired image/preimage of the IO relation without creating this intermediate graph, we anticipate use cases which will make use of the graph slice; these are discussed in more detail in § 7. Our implementation makes it easy to flip between G and \(G^{-1}\) so we freely make use of both in the algorithms to access out-neighbours and in-neighbours.
Direct algorithm for(demanded by)
Definition 14
(demBy\(_{G}\)). Figure 10a defines the family of relations \(\mathrel {\textsf {demBy} _{G}}\) for dependence graph G.
For a dependence graph G and inputs \(X \subseteq \textsf{S}(G)\), the judgement \(X\; \smash {\mathrel {\textsf {demBy} _{G}}} Y\) says that X is demanded by \(Y \subseteq \textsf{T}(G)\). The algorithm defers to an auxiliary operation \(\mathrel {\textsf {demByV} }\), which takes a (partial) graph slice \(H \subseteq G\), initially containing only the original input vertices X and no edges, and which proceeds as follows. If there is a sink of H that still has outgoing edges in G, then the targets of those edges are reachable from the vertices of H, and so we must add them to H and recurse ( ). If \(\textsf{T}(H) \subseteq \textsf{T}(G)\), then there are no unexplored nodes in G that are reachable from H, and so we terminate with the current state of H ( ). We note that it is enough to consider the sinks of H, since we move every outgoing edge of a vertex from G to H all at once. When \(\mathrel {\textsf {demByV} }\) is done, \(\mathrel {\textsf {demBy} _{G}}\) returns the sinks T(H) from the final graph slice H, representing the outputs that the original inputs are demanded by.
Proposition 2
(demBy\(_{G}\)Computes). For any dependence graph G, any \(X \subseteq \textsf{S}(G)\) and any \(Y \subseteq \textsf{T}(G)\) we have
Proof
See the supplementary material.
Fig. 10.
Dual analyses over a dependence graph G
Direct algorithm for\(\blacktriangle _{\boldsymbol{G}}\)(suffices)
Definition 15
(suff\(_{G}\)). Figure 10b defines the family of relations \(\mathrel {\textsf {suff} _{G}}\) for dependence graph G.
Like \(\mathrel {\textsf {demands} _{G}}\), the algorithm \(\mathrel {\textsf {suff} _{G}}\) also delegates to an auxiliary operation \(\mathrel {\textsf {suffE} }\) which builds a slice of G. The operation \(\mathrel {\textsf {suffE} }\) takes the graph slice H under construction, a “pending” subgraph P of nodes for which we have discovered partial information, and the remaining unexplored graph G, and returns the final graph slice \(H'\). Whenever we have a vertex in H which still has outgoing edges in G, we add those edges and their endpoints to the pending graph P ( ). If a vertex \(\alpha \) has no more incoming edges in G, we can then move \(\alpha \) and its incoming edges from P to H ( ), as this only happens when we have already moved every ancestor of \(\alpha \) into H. If neither scenario is the case, then we terminate with the (potentially incomplete) graph H ( ). Once \(\mathrel {\textsf {suffE} }\) has terminated with H, \(\mathrel {\textsf {suff} _{G}}\) returns just the vertices in H that are also sinks in G. If there are no such vertices, it simply means that the input data X was insufficient to compute any of the original program’s outputs.
We provide an example of a run of the algorithm in Figure 11. Here, the green vertices and thick edges are in H, the orange vertices and dashed arrows are in P, and the blue vertices and normal arrows are in G. In the figure, steps 2, 3, 5 and 6 correspond to applications of the rule , and steps 4 and 7 correspond to applications of the rule . After step 7, the run is complete.
Fig. 11.
Run of \(\mathrel {\textsf {suff} }\) with G and H superimposed on \(G_0\)
Proposition 3
(suff\(_{G}\)Computes\(\blacktriangle _{G}\)). For any dependence graph G, any \(X \subseteq \textsf{S}(G)\) and any \(Y \subseteq \textsf{T}(G)\) we have:
$$\begin{aligned} X \mathrel {\textsf {suff} _{G}} Y \iff \blacktriangle _{g}(X) = Y \end{aligned}$$
Proof
See the supplementary material.
Derived algorithms Propositions 2 and 3 justify treating \(\mathrel {\textsf {demBy} _{G}}\) and \(\mathrel {\textsf {suff} _{G}}\) as functions. Using Lemma 2 we can define a direct algorithm for \(\triangledown _{G}\) from \(\mathrel {\textsf {demBy} _{G}}\) by simply flipping the graph, and the same can be done to acquire an algorithm for \(\blacktriangledown _{G}\) from \(\mathrel {\textsf {suff} _{G}}\). Using Lemma 2 and Definition 6 we can also define alternative algorithms for all 4 operators using the De Morgan dual construction. These are summarised in the table below.
In § 5 we contrast the performance of the direct and De Morgan dual implementations of , used to implement linked inputs/outputs as shown in § 1.
4.4 and for Dependence Graphs
Now we can provide a formal account of the notions of linked inputs and outputs.
Definition 16
(Linked inputs and outputs). For a dependence graph G let:
Intuitively, linked outputs and linked inputs are relations of cognacy (common ancestry) in G and \(G^{-1}\) respectively. For a set of inputs \(X \subseteq \textsf{S}(G)\), linked inputs asks “what other inputs are demanded by the outputs which demand X?”; it first finds all outputs that our inputs are demanded by, and then computes the inputs that those outputs demand, i.e. . Conversely for a set of outputs \(Y \subseteq \textsf{T}(G)\), linked outputs asks “what other outputs demand the inputs that Y demands”; it first finds all inputs \(\triangledown _{G}(Y')\) that our outputs demand, and then computes the outputs that those inputs \(\triangledown _{G}(Y)\) demand, i.e. .
5 Evaluation
We now compare two implementations of the dependency-tracking runtime of Fluid, one based on the dependence graph design described in Sections 3 and 4 and one based on the main alternative style of implementation, namely a bidirectional interpreter with two components: a forward evaluator which performs a forwards analysis and produces an execution trace, and a backward evaluator which consumes the execution trace and performs a backwards analysis [30, 32, 33]. Other Fluid system components were shared by the two implementations, including the parser, desugaring and visualisation layers, and libraries.
For a language implementor, the main benefit of our graph approach is that it removes the need to implement a bidirectional interpreter. Otherwise, changing the language requires modifying each direction of the interpreter in a manner that maintains bidirectionality — a considerable effort. In our approach the graph algorithms for computing \(\triangledown _{G}\) and (Figure 10) are language-agnostic and only have to be defined once.
For a user, our hypothesis is that our approach has better performance than trace-based techniques; to test this, we evaluated a number of benchmark programs taking examples from data analysis and data visualisation (§ 5 below). Because interactive performance plays a critical role for the user, we measured the following:
Q1: Overhead of building a dependence graph versus building a trace;
Q2: Performance of computing demands over a graph vs. a trace;
Q3: Performance of various implementations of demanded by: trace-based, graph-based using \(\mathrel {\textsf {demands} }\), and graph-based using the dual of \(\mathrel {\textsf {suff} }\).
We assessed performance according to the guidelines proposed by Nielsen [26]:
100ms: limit for an interaction feeling instantaneous;
1000ms: limit for a user’s train of thought to remain uninterrupted, although they may perceive a delay;
10000ms: limit to keep a user’s attention on the system.
Ideally, interactive queries (Q2 and Q3) would stay as close as possible to the instantaneous category, and the time taken to evaluate programs (Q1) would stay within the limit for keeping a user’s attention. Experiments were timed on an Intel Core i7-10850H 2.70GHz with 16Gb of RAM, using Chrome 121.0.6261.69 and JavaScript runtime V8 12.2.281.16. We used PureScript as the host language.
Choice of examples We collected some canonical examples from data visualization and analytics. First, we considered matrix convolution. Given that different kernels give rise to subtly different dependency structures, we implemented three different kernels (edge-detect, emboss, gaussian, 31 lines of code (LOC) each). Second, we developed a full-featured graphics library based on SVG that demonstrates the performance on bespoke visualisation code (grouped-bar-chart, 140 LOC, line-chart, 143 LOC, stacked-bar-chart, 136 LOC). Finally, we tested two examples that use a D3.js front end for visualisation (stacked-bar-scatter-plot, 26 LOC, and bar-chart-line-chart, 38 LOC), the same front end used in Figures 2 and 4. The code for our benchmarks is given in the supplementary material.
In the experiments, each benchmark is run 10 times, and we report the mean runtime and standard deviation (in parentheses, coloured grey). Runtimes are reported in milliseconds, to 1 decimal place. When we refer to speedup or slowdown in reference to a pair of implementations, we mean the ratio of average runtimes for that benchmark.
Q1: Overhead of graph construction vs. trace construction Table 1 compares the average time taken to evaluate a program in the graph approach (G-Eval) with the time taken in the trace-based approach (T-Eval). The ratio of graph-based to trace-based time is shown in the third column (Eval-Slowdown). Since building the graph involves maintaining a heap of allocated vertices, and we also build the inverse graph at the same time, we expect a larger overhead for the graph approach. As expected, we do see a higher overhead, being from 2.5x to 15x slower than their trace-based counterparts. For both approaches, program evaluation always stays within the limit to keep a user’s attention on the system (10000ms).
Table 1.
Average Evaluation Time for Traces vs. Graphs (ms, 1 decimal place)
Dummy
Q2: Graph-based demands vs. trace-based demands In Table 2, columns T-Demands and G-Demands show the average times taken to compute demands with the trace and graph approaches, respectively. The ratio of graph-based to trace-based performance is in the last column, Bwd-Speedup. We compute demands over the graph by running the algorithm for demBy from Figure 10a over the opposite graph. In these examples, we keep our query selections small, generally only selecting one or two output nodes.
Overall, we observe a significant speedup in the performance of demands using the graph approach, from 33x speedup for the stacked-bar-chart, to more than 80x for gaussian. In particular, every graph-based demands query is within the “instantaneous” category, whilst all but two of the examples using the trace approach are in the “noticable delay” category. We attribute this speedup to a couple of factors. First, in the trace approach, each backwards query involves rewinding the entire execution, regardless of whether all of the trace is relevant; graph queries only traverse the relevant parts of the graph. Second, in the trace approach, backwards evaluation makes frequent use of lattice join (\(\vee \)) to combine slicing information from different branches of the computation; joining closure slices in particular has the potential to be expensive because closures contain environments (which in turn contain closures, and so on). The graph approach lends itself to a more imperative implementation style, where demand information accrues against vertices as the backwards analysis proceeds, with no separate join steps.
Table 2.
Trace-Based vs. Graph-Based Implementations of Demands (ms, 1 d.p.)
Dummy
Q3: Demanded By Table 3 contrasts average times for the trace implementation of demanded by (T-DemBy) with the graph-based algorithm from Figure 10a (G-DemBy) and the graph-based approach using using the De Morgan dual of the \(\mathrel {\textsf {suff} }\) algorithm from Figure 10b (G-DemBy-Suff). S is the speedup of G-DemBy compared to T-DemBy, and S’ is the speedup of G-DemBy-Suff compared to T-DemBy. Again, because our selections are small, the dual of \(\mathrel {\textsf {suff} }\), which traverses a “complement” subgraph, potentially explores a much larger subgraph than \(\mathrel {\textsf {demBy} }\). For all benchmarks, the graph \(\mathrel {\textsf {demBy} }\) algorithm in Figure 10a performs better than the other two approaches, with five benchmarks running instantaneously, as opposed to only two for the trace-based approach.
Table 3.
Trace-Based vs. Graph-Based Implementations of Demanded By (ms, 1 d.p.) where \(S= \textit{T-DemBy}/\textit{G-DemBy}\) and \(S' = \textit{T-DemBy}/\textit{G-DemBy-Suff}\)
Dummy
Discussion When comparing the graph and trace approaches, the main shortcoming of the graph approach is the increased cost of evaluating a program to a graph versus a trace (Q1). We propose that this overhead is worth paying; the graph is only constructed once, and evaluation time still remains within the limit for keeping the user’s attention. More importantly, the fast response times for demands and demanded by compared to the trace approach (Q2 and Q3) are good enough for instantaneous queries. Also notable (for Q3), is the performance of G-DemBy-Suff, which uses the dual of the \(\mathrel {\textsf {suff} }\) algorithm, and which tends to perform significantly worse than G-DemBy. This can perhaps be explained by the fact that our selections are small: since the selection is complemented, and the portion of the graph that the algorithm traverses is in general much larger. It is also plausible that some of this disparity can be explained by \(\mathrel {\textsf {suff} }\) requiring more bookkeeping than \(\mathrel {\textsf {demBy} }\).
6 Related work
6.1 Data Provenance
Provenance research has a long history, ranging from scientific workflow provenance [9, 12] to more fine-grained database techniques like where provenance [8], with comparatively less emphasis on general-purpose languages. Dietrich et al. [14] track both value and control dependencies in recursive database queries and user-defined functions by rewriting SQL queries to provenance-tracking counterparts; Fehrenbach and Cheney [16] explore provenance for language-integrated query, extending the SQL fragment of the multi-tier language Links [11] with provenance tracking. For structured outputs like visualisations, the most closely related lines of work are Psallidas and Wu [32] for relational languages, and Perera et al. [30] for general-purpose languages. These authors all emphasise the importance of cognacy, i.e. common ancestry, albeit not in the context of dependence graphs, showing how to “link” outputs by tracing back from an output selection to a data selection and then forward to another output selection. This seems to be largely unstudied in the provenance literature, despite its importance in data visualisation.
6.2 Dynamic Dependence Graphs
Dynamic dependence graphs have been used extensively for program slicing [18, 19], optimisation [17], incremental computation [1] and fault localisation [37]. Most of these applications are for imperative languages, where it is useful distinguish between control and data edges; because Fluid is pure, we use somewhat simpler dependence graphs with a single kind of edge. However different kinds of edge are likely to be needed for future applications (§ 7). The main contribution of our work to the dependence graph paradigm are cognacy operators which allow the identification of minimally related sets of inputs ( ) and minimally related sets of outputs ( ), in a formal setting where we show these operators to be self-conjugate and related to Galois connections.
6.3 Galois Slicing
Although we do not consider program slices here, our formal setting is closely related to bidirectional program slicing techniques based on Galois connections [28, 29, 33]. In these works, the forward and backward directions of the interpreter correspond to our sufficiency (\(\blacktriangle \)) and demands (\(\triangledown \)) queries. Perera et al. [30] extended this approach to a setting where slices have complements, and showed how composing the backward analysis with the De Morgan dual of the forwards analysis can be used to implement linked brushing. As well as requiring a bidirectional interpreter, the main difficulty with this approach has been achieving interactive performance. Implementations have relied on a sequential execution trace, used in the forwards direction to “replay” the computation, propagating sufficiency information from inputs to outputs, and in the backwards direction to “rewind” the computation, propagating demand from outputs to inputs. As discussed in § 5, and in contrast to the graph approach presented here, this approach is overly sequential and unable to exploit the independence of some subcomputations that makes slicing useful in the first place.
6.4 Linked Visualisations
The connection between visual selection (such as clicking or lassoing) and data “selection” has been explored in the visualisation literature, leading to various techniques for inverting selections in visual space to obtain selections in the underlying data [20, 27]. There is also a literature on linked brushing [4, 23] as a way of connecting output selections via mediating data selections. Although this has long been recognised as an important visual tool, with Roberts and Wright [34] arguing it should be ubiquitous, there is relatively little work on general-purpose infrastructure to support it. For example, Reactive Vega [35] provides a rich set of interactivity features to support linked brushing, but no mechanism for automatically propagating selections through arbitrary intervening queries; Glue [3] is a powerful Python library for building linked visualisations for scientific applications, but the developer is responsible for specifying the relationships between data sets that unpin the linking. More infrastructural approaches to linked brushing [23, 27, 32], similar in spirit to our work, have been developed mainly for relational languages. Relational languages are a promising direction in data visualisation, but general-purpose languages continue to be widely used too, so it is important to support linked brushing in this context as well.
6.5 Transparent Research Outputs
Dragicevic et al. [15] also develop techniques for transparent research outputs, via explorable multiverse analysis, exposing analysis parameters, the methodological choices made in designing a data analysis. Since different methodological choices will impact the conclusions of the analysis, they expose these choices to users and allow them to switch between different choices and observe the impact on the results. Our work is potentially complementary: we are concerned with surfacing the dependencies that arise within a specific choice of analysis parameters, whereas multiverse analyses are about exploring how things change under different choices. It might be useful to package our dependency analysis as part of such a multiverse analysis, allowing the user to observe the different dependency structures that arise from different methodological decisions.
7 Conclusion and Future Work
Visualisations are an essential tool for communicating science and other data-driven claims, but can be hard to make sense of and trust. Visualisations and other outputs that are “transparent” – that can reveal to an interested reader how they are related to data – are more informative and trustworthy but are also more difficult to produce. In this paper, we introduced a novel bidirectional program analysis framework for a general-purpose programming language that makes it easier to create transparent visualisations by shifting much of the burden to the language runtime. Relative to prior work based on execution traces and bidirectional interpreters, our approach also makes life easier for the language implementor, by decomposing the system into a single evaluator that builds a dependence graph, and a pair of language-agnostic bidirectional graph algorithms. We also showed an overall performance improvement for provenance queries, at the cost of some overhead in building the dependence graph.
We identify two important directions for future work. First, the dependency relation captured by the dependence graph semantics in § 3 omits certain intuitively plausible edges, such as the projection edge that one might expect to exist between a record and the value of a contained field as a consequence of evaluating a field access expression \(e.x\). On the other hand, recording all structural dependencies of this nature would significantly bloat the dependence graph with information that is only relevant in certain contexts (for example, when one is specifically concerned with where a value came from, rather than how it was computed). We would like to develop a more semantically justified dependence graph, along with a proof that the graph is “complete” in some formal sense (cf. the dependency correctness notion of Cheney et al. [10]), but anticipate that making this richer graph practical will require distinguishing different kinds of query (e.g. “how” vs. “where from”) for use in different contexts.
Second, although the algorithms presented in § 4 compute graph slices as an interim structure, the internal nodes of the graph slices are discarded. This is adequate for the use cases in § 2, which only concern “extensional” (IO) transparency. However, intensional information would be useful too: for example, highlighting a point in the moving average in Figure 2 could show not only that 3 emissions inputs were relevant, but that those values were summed and divided by 3. This would connect to work on executable program slicing [18]. One challenge here is managing the size and complexity of intensional explanations; again Cheney et al. [10] offer inspiration. Their idea of expression provenance may be a way of presenting pared-back but still informative explanations (for example, showing the tree of primitive arithmetic operations that computed a value but omitting user-defined function calls and branches). And the user may only want intensional information for certain subcomputations, in which case graph transformations which elide internal nodes but preserve IO connectivity, such as the Y-\(\varDelta \) transform [38], may also have a role to play.
Acknowledgments
This research received support through Schmidt Sciences, LLC via the Institute of Computing for Climate Science (Perera and Orchard).
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.